-
Notifications
You must be signed in to change notification settings - Fork 203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
softmax has problem with dim parameter #599
Comments
Hi, I already fixed this bug in the dy/cudnn branch (#523) -- please check cudnn related issues on that branch as well. There is a way to use cudnn softmax to compute the correct result for dims=2. Just give me a few more days to finish writing the tests before we can merge dy/cudnn. |
Not sure how to run NNlib's softmax -- in my version it calls CUDNN if the input is a CuArray. |
We can use the cudnn softmax to simulate nnlib softmax for dims 1, (;), and (partially) 2. There is not need for permutedims. Please check out my implementation at https://github.com/denizyuret/CUDA.jl/blob/dy/cudnn/lib/cudnn/nnlib.jl |
Could you have a look at my PR #600 (comment) Thanks. |
Thanks for your reply. I created a few test cases, similar to yours.It's based on my branch https://github.com/norci/CUDA.jl/tree/fix_599 benchmark_softmax.jlusing CUDA,NNlib,BenchmarkTools,Test results: CUDNN is much faster.Run on a RTX 2080 Super. I used a tensor with size 1000 x 1000 x 128.
test log
BTW, I don't think you used the correct parameters for Note:
But an optimized kernel function from CUDNN must be faster than a few function calls which composed by linear algebra and memory allocation/copy. |
@norci — the main difference seems to be the size of the tests we use. By
far the most common use of softmax is at the end of a network for
classification. I used the typical imagenet size of 1000x256 (1000 for num
classes, 256 for batch size). Can you try with this setting?
What did you have in mind for a use case with 1000x1000x128?
…On Thu, Dec 17, 2020 at 2:16 PM norci ***@***.***> wrote:
@denizyuret <https://github.com/denizyuret> ,
Thanks for your reply.
I created a few test cases, similar to yours. benchmark_softmax.jl
using CUDA,NNlib,BenchmarkTools,Test
function _softmax!(y::T, x::T; dims) where {T <: DenseCuArray}
y .= exp.(x .- maximum(x; dims))
y ./= sum(y; dims)
end
x = CUDA.rand(1000, 1000, 128);
out = similar(x)
validate
@testset begin
for i in 1:ndims(x)
ref = softmax(Array(x), dims=i) |> cu;
@test softmax!(out, x, dims=i) ≈ ref
@test _softmax!(out, x, dims=i) ≈ ref
end
end
macro ctime(ex)
quote
GC.gc(true)
println($(sprint(Base.show_unquoted, ex)))
@Btime ***@***.*** $ex; nothing)
***@***.*** $ex
end |> esc
end
benchmark
@ctime softmax!(out, x, dims=1);
@ctime _softmax!(out, x, dims=1);
@ctime softmax!(out, x, dims=2);
@ctime _softmax!(out, x, dims=2);
@ctime softmax!(out, x, dims=3);
@ctime _softmax!(out, x, dims=3);
results:
Run on a RTX 2080 Super.
I used a tensor with size 1000 x 1000 x 128.
- dims=1, cudnn is 10x faster than _softmax!
- dims=2, cudnn is 3.4x faster than _softmax!
- dims=3, cudnn is 4.5x faster than _softmax!
test log
Test Summary: | Pass Total
test set | 6 6
softmax!(out, x, dims = 1)
2.585 ms (47 allocations: 1.34 KiB)
0.002601 seconds (56 CPU allocations: 1.688 KiB)
_softmax!(out, x, dims = 1)
25.846 ms (109 allocations: 3.00 KiB)
0.026963 seconds (118 CPU allocations: 3.344 KiB) (2 GPU allocations: 1000.000 KiB, 0.01% gc time)
softmax!(out, x, dims = 2)
8.188 ms (51 allocations: 1.41 KiB)
0.008253 seconds (60 CPU allocations: 1.750 KiB)
_softmax!(out, x, dims = 2)
28.178 ms (109 allocations: 3.00 KiB)
0.028923 seconds (118 CPU allocations: 3.344 KiB) (2 GPU allocations: 1000.000 KiB, 0.01% gc time)
softmax!(out, x, dims = 3)
6.247 ms (49 allocations: 1.41 KiB)
0.006886 seconds (58 CPU allocations: 1.750 KiB)
_softmax!(out, x, dims = 3)
28.057 ms (107 allocations: 2.97 KiB)
0.029762 seconds (134 CPU allocations: 3.688 KiB) (2 GPU allocations: 7.629 MiB, 2.56% gc time of which 98.37% spent allocating)
BTW, I don't think you used the correct parameters for
cudnnSoftmaxForward. So your benchmark result should be wrong too.
Note:
cudnnSoftmaxForward is slower when dims>1, this is expected. The reason
is:
when dims=1, the kernel function can read/write data from a contiguous
device global memory. so the processor can use L1/L2 Cache to speed up
memory op.
But when the stride value > 1, then the processor can not have any speed
up in memory op.
But an optimized kernel function from CUDNN must be faster than a few
function calls which composed by linear algebra and memory allocation/copy.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#599 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAN43JWQI5H4MJRGI652BZTSVHSB5ANCNFSM4U2LZJKA>
.
|
Do you have something more specific? What function call, what parameter?
:) I have been working with CUDNN for a while now and I do not share your optimism. The library is full of bugs and inefficiencies and has been since the beginning. Currently there are many cases where simple Knet/CUDA code is faster than the "optimized" cudnn code, for example try:
|
Your Reasons I can think of for the differences in our test results other than array size:
|
This comes up at the end of segmentation models. |
@denizyuret , coderesults2D Tensordims = 1dims = 2The ratio depends on the size of the 2D Tensor. 3D TensorCUDNN is always faster than Julia ConclusionWe can create a function for 2D tensor, dims=2, in order to select the best algo. we'd better file a bug for CUDNN? |
btw, I'm not able to use julia> using Knet
ERROR: InitError:
Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0x7ffa334cd2d0 -- strlen at C:\Windows\System32\msvcrt.dll (unknown line) and TOT CUDA is not usable too julia> using CUDA
[ Info: Precompiling CUDA [052768ef-5323-5732-b1bb-66c8b64840ba]
ERROR: LoadError: ArgumentError: Package GPUCompiler [61eb1bfa-7361-4325-ad38-22787b887f55] is required
but does not seem to be installed:
- Run `Pkg.instantiate()` to install all recorded dependencies.
(CUDA) pkg> up
Updating registry at `C:\Users\zhexu\.julia\registries\General`
ERROR: expected package `LazyArtifacts [4af54fe1]` to be registered |
CUDA#master requires Julia 1.6. |
And the compatible Knet is Knet#dy/cudnn (with julia 1.6)
Thanks for the experiments, I will repeat them on my platform and integrate
your changes into CUDA#dy/cudnn
…On Fri, Dec 18, 2020 at 7:31 PM Tim Besard ***@***.***> wrote:
TOT CUDA is not usable too
CUDA#master requires Julia 1.6.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#599 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAN43JR5XRDDVRLCJZN4VE3SVN7XTANCNFSM4U2LZJKA>
.
|
@norci I generalized your dims trick to multiple (contiguous) dimensions and added an exception to use the backup implementation in the region where cudnn is slow. |
I see. The design of But I have a concern about the algorithm selector, maybe it's not optimal. So I benchmarked it with more sizes. and used a decision tree to get the optimal choice. Shall we use this tree to select the best algo? Details:codeData sizes
Decision tree as algo selector(see the last part of the code) features
(FIXME: are these features enough? ) labels
results
FIXME: do we need to prune the tree? |
@norci thanks for the experiments. They show the current hack is definitely not optimal. However I am not sure it is worth optimizing at the expense of complicating the code (I wasn't even sure if adding the current hack was a good idea) for a couple of reasons:
All in all I favor waiting nvidia to do something, but not very strongly. If you disagree and send me a modification of softmaxdims, I'll put it in the PR. |
@denizyuret , I agree with you. I have reported this performance issue to Nvidia, wish they will fix it. |
Describe the bug
softmax returns wrong result, seem like it does not use dim parameter
To reproduce
Manifest.toml
Version info
Details on Julia:
Details on CUDA:
Additional context
See FluxML/Flux.jl#1425 (comment)
The text was updated successfully, but these errors were encountered: