Use the correct CUDNN scaling parameter type. #454

maleadt · 2020-09-28T13:56:53Z

Fixes #92

cc @ DrChainsaw @hgt312

@DhairyaLGandhi Could you test this with Flux? There's very few CUDNN tests here, and we're close to release. FWIW, Flux' tests still pass, or at least throw the same errors as they did before this PR (UndefVarError: ALL_LOSSES not defined).

DhairyaLGandhi · 2020-09-28T15:16:40Z

ALL_Losses was removed some time back and code moved to a sub-package. Is it called directly when testing CUDA?

maleadt · 2020-09-28T15:21:00Z

Ah no I see what's up with that, I'm running the CUDA tests in isolation but it appears the Flux tests are stateful. Anyway, unrelated to this PR.

DhairyaLGandhi · 2020-09-28T15:27:40Z

Started the tests now

codecov · 2020-09-28T15:47:37Z

Codecov Report

Merging #454 into master will decrease coverage by 0.01%.
The diff coverage is 57.14%.

@@            Coverage Diff             @@
##           master     #454      +/-   ##
==========================================
- Coverage   80.77%   80.75%   -0.02%     
==========================================
  Files         166      166              
  Lines        9086     9090       +4     
==========================================
+ Hits         7339     7341       +2     
- Misses       1747     1749       +2

Impacted Files	Coverage Δ
lib/cudnn/activation.jl	`100.00% <ø> (ø)`
lib/cudnn/conv.jl	`50.00% <ø> (ø)`
lib/cudnn/pooling.jl	`94.73% <ø> (ø)`
lib/cudnn/softmax.jl	`100.00% <ø> (ø)`
lib/cudnn/tensor.jl	`59.52% <ø> (ø)`
lib/cudnn/util.jl	`50.00% <50.00%> (ø)`
lib/cudnn/batchnorm.jl	`36.95% <66.66%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b529985...055c680. Read the comment docs.

DhairyaLGandhi · 2020-09-28T15:57:43Z

I see failures in curnn tests as well as some movement tests

https://gitlab.com/JuliaGPU/Flux.jl/-/pipelines/195453079

maleadt · 2020-09-28T16:01:01Z

https://gitlab.com/JuliaGPU/Flux.jl/-/pipelines/195453079

That's not going to show much: 1.3 nor nightly are supported by CUDA.jl.

Strange that you see failures, everything passes locally. Could you post some details?

DhairyaLGandhi · 2020-09-28T16:16:25Z

Ah, my bad I forgot to push the Project toml and gitlab config.

It's on this branch https://github.com/FluxML/Flux.jl/tree/test_cudnn

https://gitlab.com/JuliaGPU/Flux.jl/-/pipelines/195461096

maleadt · 2020-09-28T16:51:19Z

Using Docker executor with image nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04 ...

You shouldn't be running with an explicit image tag. CUDNN 7 is unsupported now.

maleadt · 2020-09-28T18:41:32Z

Also, lots of Flux failures on CUDA.jl#master too: #455 (comment). Didn't you recently validate the master branch with Flux, or did I misunderstand

DrChainsaw · 2020-09-29T20:23:42Z

Wow, I kinda gave up on this issue when I saw some other issue about float16 support which made it seem like there was some fundamental difference in how Julia did float16.

Sorry for stupid questin but would this be enough to get the equivalent of model.use_half_type() in other frameworks or is that something completely different?

Fwiw, I get the similar errors when I test Flux master with CUDA master and this branch. The tests labeled "Conv GPU grad tests" pass

"Flux master + CUDA master"

Test Summary:                    | Pass  Fail  Error  Broken  Total
CUDA                             |   77    21      1      34    133
  CUDA                           |    9                           9
  onecold gpu                    |    2                           2
  restructure gpu                |    1                           1
  GPU functors                   |    2                           2
  Losses                         |   29            1             30
    GPU grad tests               |   24            1             25
  Basic GPU Movement             |    2                           2
  Conv GPU grad tests            |    6                    1      7
  Pooling GPU grad tests         |    2                           2
  AdaptivePooling GPU grad tests |    2                           2
  Dropout GPU grad tests         |    1                    1      2
  Normalising GPU grad tests     |    3     1                     4
    LayerNorm GPU grad test      |    1     1                     2
    BatchNorm GPU grad test      |    2                           2
  InstanceNorm GPU grad tests    |                         1      1
  GroupNorm GPU grad tests       |                         1      1
  Stateless GPU grad tests       |    1                           1
  CUDNN BatchNorm                |    8                           8
  R = RNN                        |    1                    2      3
  R = GRU                        |    1                    2      3
  R = LSTM                       |    1                    2      3
  RNN                            |    6    20             24     50
    R = RNN, batch_size = 1      |    1     3              4      8
    R = RNN, batch_size = 5      |    1     3              4      8
    R = GRU, batch_size = 1      |    1     3              4      8
    R = GRU, batch_size = 5      |    1     3              4      8
    R = LSTM, batch_size = 1     |    1     4              4      9
    R = LSTM, batch_size = 5     |    1     4              4      9

"Flux master + CUDA master"

(cutest) pkg> add CUDA#tb/cudnn_scalar_type Updating git-repo `https://github.com/JuliaGPU/CUDA.jl.git` Resolving package versions... Updating `E:\Programs\julia\.julia\dev\cutest\Project.toml` [052768ef] ~ CUDA v1.3.0 `https://github.com/JuliaGPU/CUDA.jl.git#master` ⇒ v1.3.0 `https://github.com/JuliaGPU/CUDA.jl.git#tb/cudnn_scalar_type` Updating `E:\Programs\julia\.julia\dev\cutest\Manifest.toml` [052768ef] ~ CUDA v1.3.0 `https://github.com/JuliaGPU/CUDA.jl.git#master` ⇒ v1.3.0 `https://github.com/JuliaGPU/CUDA.jl.git#tb/cudnn_scalar_type`

maleadt · 2020-09-30T05:59:55Z

Wow, I kinda gave up on this issue when I saw some other issue about float16 support which made it seem like there was some fundamental difference in how Julia did float16.

There is, but we're working to fix that :-) And it's not related to the issue you were seeing here.

Sorry for stupid questin but would this be enough to get the equivalent of model.use_half_type() in other frameworks or is that something completely different?

That's up to Flux, but I think that would make sense. On the CUDA.jl side, we first need to expose the necessary functionality, and we're almost there.

DhairyaLGandhi · 2020-09-30T09:31:06Z

In the fp16 PR on Flux, we are introducing the the f16 utility which would achieve that.

DrChainsaw · 2020-09-30T09:47:40Z

Great!

I already did a quick benchmark with Flux.paramtype(Float16, Conv(...)) |> gpu and saw a nice about 3 times speedup on the forward pass (with correct outputs this time :) ).

I'm thinking about things like what @maleadt mentioned in the linked issue where CUDA docs claim there is a soon to be removed "wronger" way of doing it and a "righter" way of doing it:

Note:CUDNN_DATA_HALF in cudnnSetConvolutionNdDescriptor() with HALF_CONVOLUTION_BWD_FILTER is not recommended as it is known to not be useful for any practical use case for training and will be considered to be blocked in a future cuDNN release. The use of CUDNN_DATA_HALF for input tensors in cudnnSetTensorNdDescriptor() and CUDNN_DATA_FLOAT in cudnnSetConvolutionNdDescriptor() with HALF_CONVOLUTION_BWD_FILTER is recommended and is used with the automatic mixed precision (AMP) training in many well known deep learning frameworks.

I guess this has to do with things like f16 mul with f32 accum (is this what automatic mixed precision refers to), or?

DhairyaLGandhi · 2020-09-30T11:10:08Z

Interesting that removing the image from GitLab CI runs does not get CUDNN by default, can I manually set a flag to ensure that its downloaded?

maleadt · 2020-09-30T15:01:09Z

Interesting that removing the image from GitLab CI runs does not get CUDNN by default, can I manually set a flag to ensure that its downloaded?

If you remove the image flag, it should the default image which is always one with CUDNN (as configured on the runners). But I see the issue: some of those still use CUDNN 7. I'll fix that.

maleadt added bugfix This gets something working again. cuda libraries Stuff about CUDA library wrappers. labels Sep 28, 2020

maleadt mentioned this pull request Sep 28, 2020

CUDNN convolution with Float16 always returns zeros #92

Closed

Use the correct CUDNN scaling parameter type.

3273b93

maleadt force-pushed the tb/cudnn_scalar_type branch from 055c680 to 3273b93 Compare October 29, 2020 08:42

maleadt merged commit 2f4d71a into master Oct 29, 2020

maleadt deleted the tb/cudnn_scalar_type branch October 29, 2020 11:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use the correct CUDNN scaling parameter type. #454

Use the correct CUDNN scaling parameter type. #454

maleadt commented Sep 28, 2020 •

edited

DhairyaLGandhi commented Sep 28, 2020

maleadt commented Sep 28, 2020

DhairyaLGandhi commented Sep 28, 2020

codecov bot commented Sep 28, 2020 •

edited

DhairyaLGandhi commented Sep 28, 2020

maleadt commented Sep 28, 2020

DhairyaLGandhi commented Sep 28, 2020 •

edited

maleadt commented Sep 28, 2020

maleadt commented Sep 28, 2020

DrChainsaw commented Sep 29, 2020

maleadt commented Sep 30, 2020 •

edited

DhairyaLGandhi commented Sep 30, 2020

DrChainsaw commented Sep 30, 2020

DhairyaLGandhi commented Sep 30, 2020

maleadt commented Sep 30, 2020

Use the correct CUDNN scaling parameter type. #454

Use the correct CUDNN scaling parameter type. #454

Conversation

maleadt commented Sep 28, 2020 • edited

DhairyaLGandhi commented Sep 28, 2020

maleadt commented Sep 28, 2020

DhairyaLGandhi commented Sep 28, 2020

codecov bot commented Sep 28, 2020 • edited

Codecov Report

DhairyaLGandhi commented Sep 28, 2020

maleadt commented Sep 28, 2020

DhairyaLGandhi commented Sep 28, 2020 • edited

maleadt commented Sep 28, 2020

maleadt commented Sep 28, 2020

DrChainsaw commented Sep 29, 2020

maleadt commented Sep 30, 2020 • edited

DhairyaLGandhi commented Sep 30, 2020

DrChainsaw commented Sep 30, 2020

DhairyaLGandhi commented Sep 30, 2020

maleadt commented Sep 30, 2020

maleadt commented Sep 28, 2020 •

edited

codecov bot commented Sep 28, 2020 •

edited

DhairyaLGandhi commented Sep 28, 2020 •

edited

maleadt commented Sep 30, 2020 •

edited