-
Notifications
You must be signed in to change notification settings - Fork 205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use the correct CUDNN scaling parameter type. #454
Conversation
ALL_Losses was removed some time back and code moved to a sub-package. Is it called directly when testing CUDA? |
Ah no I see what's up with that, I'm running the CUDA tests in isolation but it appears the Flux tests are stateful. Anyway, unrelated to this PR. |
Started the tests now |
Codecov Report
@@ Coverage Diff @@
## master #454 +/- ##
==========================================
- Coverage 80.77% 80.75% -0.02%
==========================================
Files 166 166
Lines 9086 9090 +4
==========================================
+ Hits 7339 7341 +2
- Misses 1747 1749 +2
Continue to review full report at Codecov.
|
I see failures in curnn tests as well as some movement tests |
That's not going to show much: 1.3 nor nightly are supported by CUDA.jl. Strange that you see failures, everything passes locally. Could you post some details? |
Ah, my bad I forgot to push the Project toml and gitlab config. It's on this branch https://github.com/FluxML/Flux.jl/tree/test_cudnn |
You shouldn't be running with an explicit |
Also, lots of Flux failures on CUDA.jl#master too: #455 (comment). Didn't you recently validate the master branch with Flux, or did I misunderstand |
Wow, I kinda gave up on this issue when I saw some other issue about float16 support which made it seem like there was some fundamental difference in how Julia did float16. Sorry for stupid questin but would this be enough to get the equivalent of Fwiw, I get the similar errors when I test Flux master with CUDA master and this branch. The tests labeled "Conv GPU grad tests" pass "Flux master + CUDA master"Test Summary: | Pass Fail Error Broken Total CUDA | 77 21 1 34 133 CUDA | 9 9 onecold gpu | 2 2 restructure gpu | 1 1 GPU functors | 2 2 Losses | 29 1 30 GPU grad tests | 24 1 25 Basic GPU Movement | 2 2 Conv GPU grad tests | 6 1 7 Pooling GPU grad tests | 2 2 AdaptivePooling GPU grad tests | 2 2 Dropout GPU grad tests | 1 1 2 Normalising GPU grad tests | 3 1 4 LayerNorm GPU grad test | 1 1 2 BatchNorm GPU grad test | 2 2 InstanceNorm GPU grad tests | 1 1 GroupNorm GPU grad tests | 1 1 Stateless GPU grad tests | 1 1 CUDNN BatchNorm | 8 8 R = RNN | 1 2 3 R = GRU | 1 2 3 R = LSTM | 1 2 3 RNN | 6 20 24 50 R = RNN, batch_size = 1 | 1 3 4 8 R = RNN, batch_size = 5 | 1 3 4 8 R = GRU, batch_size = 1 | 1 3 4 8 R = GRU, batch_size = 5 | 1 3 4 8 R = LSTM, batch_size = 1 | 1 4 4 9 R = LSTM, batch_size = 5 | 1 4 4 9 "Flux master + CUDA master"(cutest) pkg> add CUDA#tb/cudnn_scalar_type Updating git-repo `https://github.com/JuliaGPU/CUDA.jl.git` Resolving package versions... Updating `E:\Programs\julia\.julia\dev\cutest\Project.toml` [052768ef] ~ CUDA v1.3.0 `https://github.com/JuliaGPU/CUDA.jl.git#master` ⇒ v1.3.0 `https://github.com/JuliaGPU/CUDA.jl.git#tb/cudnn_scalar_type` Updating `E:\Programs\julia\.julia\dev\cutest\Manifest.toml` [052768ef] ~ CUDA v1.3.0 `https://github.com/JuliaGPU/CUDA.jl.git#master` ⇒ v1.3.0 `https://github.com/JuliaGPU/CUDA.jl.git#tb/cudnn_scalar_type` |
There is, but we're working to fix that :-) And it's not related to the issue you were seeing here.
That's up to Flux, but I think that would make sense. On the CUDA.jl side, we first need to expose the necessary functionality, and we're almost there. |
In the fp16 PR on Flux, we are introducing the the |
Great! I already did a quick benchmark with Flux.paramtype(Float16, Conv(...)) |> gpu and saw a nice about 3 times speedup on the forward pass (with correct outputs this time :) ). I'm thinking about things like what @maleadt mentioned in the linked issue where CUDA docs claim there is a soon to be removed "wronger" way of doing it and a "righter" way of doing it:
I guess this has to do with things like f16 mul with f32 accum (is this what automatic mixed precision refers to), or? |
Interesting that removing the image from GitLab CI runs does not get CUDNN by default, can I manually set a flag to ensure that its downloaded? |
If you remove the image flag, it should the default image which is always one with CUDNN (as configured on the runners). But I see the issue: some of those still use CUDNN 7. I'll fix that. |
055c680
to
3273b93
Compare
Fixes #92
cc @ DrChainsaw @hgt312
@DhairyaLGandhi Could you test this with Flux? There's very few CUDNN tests here, and we're close to release. FWIW, Flux' tests still pass, or at least throw the same errors as they did before this PR (
UndefVarError: ALL_LOSSES not defined
).