Hard error using dice loss #2383

cirobr · 2024-02-27T08:17:59Z

Cheers,

Regardless of the model, data, or any other condition, I’ve never been able of using the built-in Flux.dice_coeff_loss() function. A very long error dump shows up, apparently tied to CUDA and memory usage.

The issue has been confirmed and duplicated on Discourse forum. For details, please check this link.

mcabbott · 2024-02-29T01:06:44Z

My MWE from the discourse thread is this:

julia> using Flux, CUDA

julia> let x = randn(3,5) |> cu
           y = Flux.onehotbatch("abcab", 'a':'c') |> cu
           Flux.dice_coeff_loss(x, y)  # works forward
       end
1.1841338f0

julia> let x = randn(3,5) |> cu
           y = Flux.onehotbatch("abcab", 'a':'c') |> cu
           gradient(Flux.mse, x, y)  # some gradients work
       end
(Float32[-0.16939788 -0.19461282 … -0.30000073 -0.017194644; 0.07464689 -0.15628384 … -0.17090265 -0.007114268; -0.22359066 -0.06903434 … 0.1566836 -0.022250716], nothing)

julia> let x = randn(3,5) |> cu
           y = Flux.onehotbatch("abcab", 'a':'c') |> cu
           gradient(Flux.dice_coeff_loss, x, y)
       end
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
...
ERROR: KernelException: exception thrown during kernel execution on device Tesla V100-PCIE-16GB
Stacktrace:
  [1] check_exceptions()
    @ CUDA ~/.julia/packages/CUDA/htRwP/src/compiler/exceptions.jl:34
  [2] device_synchronize(; blocking::Bool, spin::Bool)
    @ CUDA ~/.julia/packages/CUDA/htRwP/lib/cudadrv/synchronization.jl:180

(@v1.10) pkg> st Flux CUDA
Status `~/.julia/environments/v1.10/Project.toml`
  [052768ef] CUDA v5.2.0
  [587475ba] Flux v0.14.11

I don't know if this is the same error as yours, but it's surprising, and is a bug.

What "Run Julia on debug level 2 for device stack traces" means is that starting the REPL with julia -g2 will capture more information, which may help narrow this down. Can you try this, and paste here as much information as possible?

ToucheSir · 2024-02-29T20:23:45Z

Can you try pulling y .^ 2 and ŷ .^ 2 in

Flux.jl/src/losses/functions.jl

Line 519 in 20d516b

1 - (2 * sum(y .* ŷ) + s) / (sum(y .^ 2) + sum(ŷ .^ 2) + s)

out on their own lines and seeing which one fails?

mcabbott added bug gradients cuda labels Feb 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hard error using dice loss #2383

Hard error using dice loss #2383

cirobr commented Feb 27, 2024

mcabbott commented Feb 29, 2024

ToucheSir commented Feb 29, 2024

Hard error using dice loss #2383

Hard error using dice loss #2383

Comments

cirobr commented Feb 27, 2024

mcabbott commented Feb 29, 2024

ToucheSir commented Feb 29, 2024