Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make NNlibCUDA an extension #492

Merged
merged 11 commits into from
Jun 14, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 7 additions & 14 deletions .buildkite/pipeline.yml
Original file line number Diff line number Diff line change
@@ -1,30 +1,22 @@
steps:
- label: "GPU julia v1.6"
- label: "CUDA - Julia v1.9"
plugins:
- JuliaCI/julia#v1:
version: "1.6"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this, we have no way to tell if changes to NNlib break NNlibCUDA on Julia <1.9. So either we add something like I mentioned in #495 (comment) or we create a separate Reverse CI step/pipeline for NNlibCUDA.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the issue here. Changes to NNlib won't effect julia < 1.9 users since NNlib is julia >= 1.9 from now on

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For backports, if there is ever gonna be one, we can test locally.

Copy link
Member

@ToucheSir ToucheSir Jun 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed that Project.toml also bumped Julia compat. To be clear then, merging this would mean we're stopping feature development for Julia <1.9 and now maintaining two backport branches (one for NNlib and one for NNlibCUDA)? I recall there being mixed success with backport branches in FluxML before, are there any lessons learned from that so that this doesn't run into the same issues? cc @FluxML/committers

Edit: I misspoke about the two backport branches. Depending on how we want to handle extensions in Flux, this may require three backport branches, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear then, merging this would mean we're stopping feature development for Julia <1.9 and now maintaining two backport branches (one for NNlib and one for NNlibCUDA)?

we are stopping development for julia < 1.9 and we will maintain backport branches in case we feel the need and care enough about backporting something, which I consider an unlikely event. In expectation, the benefits far outweigh the drawbacks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 vote for moving to 1.9 soon!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, in that case I'll cast my vote for this too. The rest of the PR LGTM.

version: "1.9"
- JuliaCI/julia-test#v1: ~
- JuliaCI/julia-coverage#v1:
codecov: true
dirs:
- src
# commands:
# - julia --project=test -e """
# Pkg.develop(url = \"https://github.com/FluxML/NNlibCUDA.jl\")
# Pkg.instantiate()
# Pkg.build()
# Pkg.status()
# Pkg.test()
# Pkg.test(\"NNlibCUDA\")
# """
- ext
agents:
queue: "juliagpu"
cuda: "*"
env:
NNLIB_TEST_CUDA: true
timeout_in_minutes: 60

- label: "GPU julia v1"
- label: "CUDA - Julia v1"
plugins:
- JuliaCI/julia#v1:
version: "1"
Expand All @@ -33,6 +25,7 @@ steps:
codecov: true
dirs:
- src
- ext
agents:
queue: "juliagpu"
cuda: "*"
Expand All @@ -55,10 +48,10 @@ steps:
if: build.pull_request.labels includes "benchmark"
timeout_in_minutes: 30

- label: "AMDGPU - Julia 1.9"
- label: "AMDGPU - Julia v1.9"
plugins:
- JuliaCI/julia#v1:
version: 1.9-nightly
version: "1.9"
- JuliaCI/julia-test#v1:
- JuliaCI/julia-coverage#v1:
codecov: true
Expand Down
1 change: 0 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,6 @@ deps.jl
*.log
.vscode/
/Manifest.toml
lib/NNlibCUDA/Manifest.toml
benchmark/Manifest.toml
benchmark/*.json
benchmark/report.md
11 changes: 9 additions & 2 deletions Project.toml
Original file line number Diff line number Diff line change
@@ -1,11 +1,12 @@
name = "NNlib"
uuid = "872c559c-99b0-510c-b3b7-b6c96a88d5cd"
version = "0.8.20"
version = "0.9.0"

[deps]
Adapt = "79e6a3ab-5dfb-504d-930d-738a2a938a0e"
Atomix = "a9b6321e-bd34-4604-b9c9-b65b8de01458"
ChainRulesCore = "d360d2e6-b24c-11e9-a2a3-2a2ae2dbcce4"
cuDNN = "02a925ec-e4fe-4b08-9a7e-0d78e3d38ccd"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to make cuDNN a strong dependence because it seems that there is no way to make using CUDA also trigger the loading of cuDNN from the extension. This is not ideal but the alternatives seem to be worse:

  • have the extension triggered by using CUDA, cuDNN
  • keep NNlibCUDA as a separate package

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some other option?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making it a strong dep is a non-starter because cuDNN directly depends on CUDA.jl. I would say we go with option 1, but there are a couple variations on it we could also consider:

  1. Make only cuDNN a weak dep and access CUDA.jl through cuDNN.CUDA.
  2. Create separate extensions for CUDA.jl and cuDNN. Then someone can choose to only load the former if they don't need e.g. conv functionality. This doesn't make anything easier so we should probably consider it after an extension is in place.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think we should go with using CUDA, cuDNN. This will also carry over to Flux's usage.
I'm annoyed by portability issues for scripts. Ideally, it should be possible to run the same script on any machine without any edit and using the appropriate device, so something like:

using Flux

gpu_backend = ... # Some hardware detection utility or Preferences.jl magic?

 if gpu_backend == "cuda"
  using CUDA, cuDNN
elseif gpu_backend == "amdgpu"
  using AMDGPU
elseif gpu_backend == "metal"
  using Metal
end

Even better, the whole loading should be conditionally done by flux itself, but I don't think this can be done without hard dependencies.

Anyways, since this is a crucial decision, let's also ask @mcabbott and @darsnack if they are ok with having the CUDA extension here be triggered by using CUDA, cuDNN.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't like using Flux, CUDA, cuDNN, it seems a huge shame to have to load two packages, one of which is a super-obscure internal thing. I mean I don't even know where it lives, https://www.google.com/search?q=cuDNN.jl leads me only to one abandoned 5 years ago.

It's a huge shame to give up on using Flux, CUDA as the interface. I understand that the default use of new package extensions does not allow us to then load cuDNN. I wonder if we should seriously consider either hacks for now (can Requires or something like it load cuDNN?) or finding out if upstream can be changed e.g. to unify cuDNN into CUDA.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some packages might want to expose GPU-accelerated functionality without users having to depend on either CUDA or CUDNN. With Preferences, the user environment would then need to include CUDA (i.e. in the Project.toml) in order to set the CUDNN preference.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does JuliaPackaging/Preferences.jl#24 work for package extension dependencies? The example seems to imply that packages can set preferences for their dependencies.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. In any case, that example also shows how the active project needs to have A as a hard dependency, either in [deps] or in [extras], which is the point I was making above.

That said, although I'm happy to consider alternative ways of loading CUDNN-like functionality, I don't see it happening soon. Without a first-class language feature and using Preferences.jl, it would require users to import CUDA.jl to enable the CUDNN features, which IMO is mostly the same as having them do using cuDNN. And even with a first-class feature where, say, packages could express in their Project.toml which features they request of a package, it doesn't seem clear how that would interact with package extensions (what if a user has CUDA.jl but not CUDA.jl+CUDNN, etc).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By sheer coincidence, I noticed JuliaPackaging/Preferences.jl#53 was reported today.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I'm currently too busy working on other things to consider reworking the CUDA.jl/cuDNN.jl situation again (especially now that it just stabilized a bit after introducing JLLs), but I'm not opposed to changes. So if anybody would want to explore a different mechanism for shipping CUDA libraries, feel free to open an issue or PR.

GPUArraysCore = "46192b85-c4d5-4398-a991-12ede77f4527"
KernelAbstractions = "63c18a36-062a-441e-b654-da1e3ab1ce7c"
LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"
Expand All @@ -16,19 +17,25 @@ Statistics = "10745b16-79ce-11e8-11f9-7d13ad32a3b2"

[weakdeps]
AMDGPU = "21141c5a-9bdb-4563-92ae-f87d6854732e"
CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"

[extensions]
NNlibAMDGPUExt = "AMDGPU"
NNlibCUDAExt = "CUDA"

[compat]
AMDGPU = "0.4.8"
Adapt = "2, 3.2"
Atomix = "0.1"
ChainRulesCore = "1.13"
CUDA = "4"
cuDNN = "1"
GPUArraysCore = "0.1"
KernelAbstractions = "0.9.2"
Requires = "0.5, 1.0"
julia = "1.6"
julia = "1.9"

[extras]
AMDGPU = "21141c5a-9bdb-4563-92ae-f87d6854732e"
CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"

2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,4 @@ This package provides a library of functions useful for neural networks, such as

For use with automatic differentiation, this package defines gradients using [ChainRules.jl](https://github.com/JuliaDiff/ChainRules.jl). These will be seen by various packages including [Zygote.jl](https://github.com/FluxML/Zygote.jl).

To use these functions with [CUDA.jl](https://github.com/JuliaGPU/CUDA.jl) you will need [NNlibCUDA.jl](https://github.com/FluxML/NNlibCUDA.jl) as well.
GPU support is provided whenever the corresponding package (e.g. [CUDA.jl](https://github.com/JuliaGPU/CUDA.jl) or [AMDGPU.jl](https://github.com/JuliaGPU/AMDGPU.jl)) is loaded.
3 changes: 1 addition & 2 deletions docs/src/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,5 +4,4 @@

For use with automatic differentiation, this package defines gradients using [ChainRules.jl](https://github.com/JuliaDiff/ChainRules.jl). These will be seen by various packages including [Zygote.jl](https://github.com/FluxML/Zygote.jl).

To use these functions with [CUDA.jl](https://github.com/JuliaGPU/CUDA.jl) you will need [NNlibCUDA.jl](https://github.com/FluxML/NNlibCUDA.jl) as well.
For [AMDGPU.jl](https://github.com/JuliaGPU/AMDGPU.jl) you will need to load it and NNlib in the same Julia session.
To use these functions with [CUDA.jl](https://github.com/JuliaGPU/CUDA.jl) or [AMDGPU.jl](https://github.com/JuliaGPU/AMDGPU.jl) you will need to load them and NNlib in the same Julia session.
20 changes: 0 additions & 20 deletions ext/NNlibCUDA/.buildkite/pipeline.yml

This file was deleted.

26 changes: 0 additions & 26 deletions ext/NNlibCUDA/.github/workflows/compathelper.yml

This file was deleted.

15 changes: 0 additions & 15 deletions ext/NNlibCUDA/.github/workflows/tagbot.yml

This file was deleted.

1 change: 0 additions & 1 deletion ext/NNlibCUDA/.gitignore

This file was deleted.

23 changes: 0 additions & 23 deletions ext/NNlibCUDA/LICENSE.md

This file was deleted.

29 changes: 0 additions & 29 deletions ext/NNlibCUDA/Project.toml

This file was deleted.

5 changes: 0 additions & 5 deletions ext/NNlibCUDA/README.md

This file was deleted.

27 changes: 0 additions & 27 deletions ext/NNlibCUDA/test/batchnorm.jl

This file was deleted.

Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
module NNlibCUDA
module NNlibCUDAExt

using NNlib
using CUDA, cuDNN
Expand Down
File renamed without changes.
File renamed without changes.
2 changes: 1 addition & 1 deletion ext/NNlibCUDA/src/ctc.jl → ext/NNlibCUDAExt/ctc.jl
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# CTC loss moved from Flux.jl to NNlib + NNlibCUDA
# CTC loss moved from Flux.jl to NNlib

import NNlib: ctc_loss, ctc_alpha, ∇ctc_loss

Expand Down
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
2 changes: 1 addition & 1 deletion src/activations.jl
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
## Activation functions
#
# Some of activation functions have its wrapper function for GPU in NNlibCUDA.jl.
# Some of activation functions have its wrapper function for GPU in NNlibCUDAExt.jl.
# https://github.com/JuliaGPU/CuArrays.jl/issues/614

ACTIVATIONS = [
Expand Down
2 changes: 1 addition & 1 deletion src/ctc.jl
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# CTC loss moved from Flux.jl to NNlib + NNlibCUDA
# CTC loss moved from Flux.jl to NNlib

## CPU implementation

Expand Down
20 changes: 0 additions & 20 deletions src/deprecations.jl
Original file line number Diff line number Diff line change
@@ -1,23 +1,3 @@

### Deprecated while v0.7 was latest

function ∇softmax(Δ, x; dims = 1)
# This 2-arg version recomputes the forward pass, which is slow.
# Removed from use in 0.7, but only prints a warning during 0.8:
Base.depwarn("`∇softmax(Δ, x)` without `y = softmax(x)` argument is deprecated, as this is inefficient, please use `∇softmax_data(dy, y)`", :∇softmax)
∇softmax(Δ, x, softmax(x; dims); dims)
end
∇softmax!(Δ, x; dims = 1) = Δ .= ∇softmax(Δ, x; dims)
∇softmax!(out, Δ, x; dims = 1) = out .= ∇softmax(Δ, x; dims)

function ∇logsoftmax(Δ, x; dims = 1)
Base.depwarn("`∇logsoftmax(Δ, x)` without `y = logsoftmax(x)` argument is deprecated, please use `∇logsoftmax_data(dy, y)`", :∇logsoftmax)
∇logsoftmax(Δ, x, logsoftmax(x; dims); dims)
end
∇logsoftmax!(Δ, x; dims = 1) = Δ .= ∇logsoftmax(Δ, x; dims)
∇logsoftmax!(out, Δ, x; dims = 1) = out .= ∇logsoftmax(Δ, x; dims)


### Deprecated while v0.8 was latest

export ∇softmax,
Expand Down
2 changes: 1 addition & 1 deletion src/dropout.jl
Original file line number Diff line number Diff line change
Expand Up @@ -158,5 +158,5 @@ _rng_from_array(::AbstractArray) = Random.default_rng()
@non_differentiable _rng_from_array(::Any)

# This exists because `rand!(default_rng(), CUDA.rand(3))` ignores the RNG,
# and Flux would prefer an error. NNlibCUDA will overload it to produce that.
# and Flux would prefer an error. NNlibCUDAExt will overload it to produce that.
_rng_compat_array(::AbstractRNG, ::AbstractArray) = nothing
9 changes: 0 additions & 9 deletions src/upsample.jl
Original file line number Diff line number Diff line change
Expand Up @@ -380,15 +380,6 @@ function ∇upsample_linear_kernel!(
return dx
end

# Compatibility layer for old versions of NNlibCUDA.
# TODO Can be removed from NNlib 0.9.
upsample_linear_wcn!(y, x) = upsample_linear_kernel!(y, x)
upsample_bilinear_whcn!(y, x) = upsample_linear_kernel!(y, x)
upsample_trilinear_whdcn!(y, x) = upsample_linear_kernel!(y, x)
∇upsample_linear_wcn!(y, x) = ∇upsample_linear_kernel!(y, x)
∇upsample_bilinear_whcn!(y, x) = ∇upsample_linear_kernel!(y, x)
∇upsample_trilinear_whdcn!(y, x) = ∇upsample_linear_kernel!(y, x)

# Linear (CPU): parallelization along channel x batch dimensions.

@kernel function _upsample_linear_kernel!(::CPU, y::T, x::T, rwidth, align::Val{A}) where {
Expand Down
2 changes: 0 additions & 2 deletions test/Project.toml
Original file line number Diff line number Diff line change
@@ -1,13 +1,11 @@
[deps]
Adapt = "79e6a3ab-5dfb-504d-930d-738a2a938a0e"
CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"
ChainRulesTestUtils = "cdddcdb0-9152-4a09-a978-84456f9df70a"
KernelAbstractions = "63c18a36-062a-441e-b654-da1e3ab1ce7c"
Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
FiniteDifferences = "26cc04aa-876d-5657-8c51-4c34ba976000"
ForwardDiff = "f6369f11-7733-5829-9624-2563aa707210"
Logging = "56ddb016-857b-54e1-b83d-db4d58db5568"
NNlibCUDA = "a00861dc-f156-4864-bf3c-e6376f28a68d"
Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"
StableRNGs = "860ef19b-820b-49d6-a774-d7a799459cd3"
Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,8 @@ end
end
end

# Broadcasting over complex CuArray works without NNlibCUDA, this test checks that
# NNlibCUDA does not cause such operations to take a fast path which does not support
# Broadcasting over complex CuArray works without NNlibCUDAExt, this test checks that
# NNlibCUDAExt does not cause such operations to take a fast path which does not support
# complex numbers (e.g. cuDNN)
@testset "complex" begin
f(x) = tanh.(x)
Expand Down
File renamed without changes.
File renamed without changes.
27 changes: 27 additions & 0 deletions test/ext_cuda/batchnorm.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
@testset "Batchnorm" begin
v = CUDA.rand(Float32, 2)
m = CUDA.rand(Float32, 2, 5)

@testset for training in (true, false), track_stats in (true, false)
kws = (training=training, track_stats=track_stats)

# Normal
batchnorm(v, v, m, v, v, 1.0; kws...)
∇batchnorm(v, v, m, m, v, v, 1.0; kws...)

# No affine
batchnorm(nothing, nothing, m, v, v, 1.0; kws...)
∇batchnorm(nothing, nothing, m, m, v, v, 1.0; kws...)

# No tracking
batchnorm(v, v, m, nothing, nothing, 1.0; kws...)
∇batchnorm(v, v, m, m, nothing, nothing, 1.0; kws...)

# Both or neither tracked or affine params must be set
for (α, β) in ((v, nothing), (nothing, v))
@test_throws MethodError batchnorm(α, β, m, v, v, 1.0; kws...)
@test_throws MethodError ∇batchnorm(α, β, m, m, v, v, 1.0; kws...)
@test_throws ArgumentError batchnorm(v, v, m, α, β, 1.0; kws...)
end
end
end
File renamed without changes.
Loading