Unrealistically poor benchmark results #67

dfdx · 2019-01-31T19:34:14Z

White testing my own AD package, I tried to compare its performance to Zygote, but results turned to be unexpectedly poor for it.

Setup

Variational autoencoder with all dense layers on MNIST dataset. Nothing fancy, only built-in functions like matrix multiplication, elementwise operations and aggregation functions. I also defined derivative rules for both - Yota (my package) and Zygote. Full source can be found here.

It's also worth to mention that with the original cost function Zygote failed with segfault, so for simplicity I commended out guilty line for now.

Benchmark results

Compiling derivatives using Yota
  0.050908 seconds (819 allocations: 21.679 MiB, 5.64% gc time)
BenchmarkTools.Trial: 
  memory estimate:  21.68 MiB
  allocs estimate:  819
  --------------
  minimum time:     34.858 ms (0.00% GC)
  median time:      39.443 ms (6.69% GC)
  mean time:        42.610 ms (6.53% GC)
  maximum time:     155.245 ms (1.73% GC)
  --------------
  samples:          118
  evals/sample:     1

Compiling derivatives using Zygote
  1.573936 seconds (7.76 M allocations: 381.904 MiB, 35.54% gc time)
BenchmarkTools.Trial: 
  memory estimate:  381.90 MiB
  allocs estimate:  7762261
  --------------
  minimum time:     1.144 s (26.51% GC)
  median time:      1.337 s (33.87% GC)
  mean time:        1.301 s (32.50% GC)
  maximum time:     1.386 s (34.81% GC)
  --------------
  samples:          4
  evals/sample:     1

I use some aggressive optimizations (e.g. full buffering) so I actually expected 1.5-3x times difference, but not 30x. Note that both libraries give the same results and code doesn't have dead code branches.

What could go wrong?

I could make a mistake or misunderstand Zygote's README. Can you spot the problem with the code?
We hit some operation without efficient derivative. Maybe sum with keywords or mean?

The text was updated successfully, but these errors were encountered:

MikeInnes · 2019-02-01T10:05:38Z

Thanks for the detailed issue. Zygote still does dumb things every so often (c.f. segfault); given that this code is pretty straightforward there's probably either some simple typing issue in broadcast or we're missing a derivative, like you suggest. It would be great to have a manifest for that benchmark as I might not look at it immediately, but I'll dig into it sometime soon.

One thing to note is that your Zygote adjoint definitions won't have any effect, since we use a fused forward mode for broadcasts (I assume Yota uses a simpler reverse mode?).

What do you mean by "full buffering"? It doesn't seem that you're compiling a tape in this script, otherwise I'd assume you meant something like allocating array storage up front -- but I could be missing something.

dfdx · 2019-02-02T17:36:13Z

One thing to note is that your Zygote adjoint definitions won't have any effect, since we use a fused forward mode for broadcasts (I assume Yota uses a simpler reverse mode?).

Yep, for broadcasted operations I simply record broadcasted derivatives of scalar functions, e.g. for sin.(x) I just record cos.(x) .* Δ.

What do you mean by "full buffering"? It doesn't seem that you're compiling a tape in this script,

grad(f, args...) is a caching layer for _grad(f, args...). Perhaps it's easier just to show the code:

function grad(f::Function, args...)
    # check if we have already calculated derivatives for this function and arguments 
    # of such types and sizes
    cache_key = (f, ([isstruct(arg) ? typeof(arg) : size(arg) for arg in args]...,))
    if haskey(GRAD_CACHE, cache_key)
        # if so, just use cached tape
        tape = GRAD_CACHE[cache_key]
        val = play!(tape, args...)
        return val, GradResult(tape)
    else
        # otherwise, calculate the tape
        val, g = _grad(f, args...)
        # compile it
        compile!(g.tape)
        # and put to the cache
        GRAD_CACHE[cache_key] = g.tape
        return val, g
    end
end

[...] but I'll dig into it sometime soon

No hurry! I think I'll do more benchmarks for Yota on CPU and GPU soon and cross-check them with Zygote to detect any unexpected performance issues.

751: Fix FFT type promotions r=CarloLucibello a=wkearn I ran into an issue with type promotions while using `Conv` layers in Flux when I tried to run model output through FFTs. I tracked it down to the definitions of the adjoints for the various `fft` variants. The 1/N factors needed in the `irfft` adjoints automatically promoted `Float32` arrays to `Float64` arrays, which then caused the error when propagated back through the chain. A minimal example is included below. This pull request just changes all those multiplications by 1/N into divisions by N, which prevents the promotion. I also added a few tests that assert that the gradients are the right types. The tests with `irfft(x,dims)` throw an error that is not related to the type promotion, so I commented those out. ```julia x = randn(Float32,16,1,1) m = Conv((5,),1=>1,relu,pad=SamePad()) loss(x) = sum(abs2,irfft(rfft(m(x)),16)) gradient(()->loss(x),Flux.params(m)) ┌ Warning: Slow fallback implementation invoked for ∇conv_data! You probably don't want this; check your datatypes. │ yT = AbstractFloat │ T1 = AbstractFloat │ T2 = Float32 └ @ NNlib C:\Users\wkearney\.julia\packages\NNlib\sSn9M\src\conv.jl:206 ERROR: UndefRefError: access to undefined reference Stacktrace: [1] getindex at .\array.jl:745 [inlined] [2] #conv_direct!#149(::Float64, ::Bool, ::typeof(NNlib.conv_direct!), ::Array{AbstractFloat,5}, ::Array{AbstractFloat,5}, ::Array{Float32,5}, ::DenseConvDims{3,(5, 1, 1),1,1,(1, 1, 1),(2, 2, 0, 0, 0, 0),(1, 1, 1),false}) at C:\Users\wkearney\.julia\packages\NNlib\sSn9M\src\impl\conv_direct.jl:98 [3] (::NNlib.var"#kw##conv_direct!")(::NamedTuple{(:alpha, :beta),Tuple{Float64,Bool}}, ::typeof(NNlib.conv_direct!), ::Array{AbstractFloat,5}, ::Array{AbstractFloat,5}, ::Array{Float32,5}, ::DenseConvDims{3,(5, 1, 1),1,1,(1, 1, 1),(2, 2, 0, 0, 0, 0),(1, 1, 1),false}) at .\none:0 [4] #∇conv_data_direct!#152(::Float64, ::Bool, ::typeof(NNlib.∇conv_data_direct!), ::Array{AbstractFloat,5}, ::Array{AbstractFloat,5}, ::Array{Float32,5}, ::DenseConvDims{3,(5, 1, 1),1,1,(1, 1, 1),(2, 2, 0, 0, 0, 0),(1, 1, 1),false}) at C:\Users\wkearney\.julia\packages\NNlib\sSn9M\src\impl\conv_direct.jl:163 [5] ∇conv_data_direct! at C:\Users\wkearney\.julia\packages\NNlib\sSn9M\src\impl\conv_direct.jl:158 [inlined] [6] #∇conv_data!#106(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(∇conv_data!), ::Array{AbstractFloat,5}, ::Array{AbstractFloat,5}, ::Array{Float32,5}, ::DenseConvDims{3,(5, 1, 1),1,1,(1, 1, 1),(2, 2, 0, 0, 0, 0),(1, 1, 1),false}) at C:\Users\wkearney\.julia\packages\NNlib\sSn9M\src\conv.jl:208 [7] ∇conv_data!(::Array{AbstractFloat,5}, ::Array{AbstractFloat,5}, ::Array{Float32,5}, ::DenseConvDims{3,(5, 1, 1),1,1,(1, 1, 1),(2, 2, 0, 0, 0, 0),(1, 1, 1),false}) at C:\Users\wkearney\.julia\packages\NNlib\sSn9M\src\conv.jl:206 [8] #∇conv_data!#67(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(∇conv_data!), ::Array{AbstractFloat,3}, ::Array{AbstractFloat,3}, ::Array{Float32,3}, ::DenseConvDims{1,(5,),1,1,(1,),(2, 2),(1,),false}) at C:\Users\wkearney\.julia\packages\NNlib\sSn9M\src\conv.jl:148 [9] ∇conv_data! at C:\Users\wkearney\.julia\packages\NNlib\sSn9M\src\conv.jl:148 [inlined] [10] #∇conv_data#39 at C:\Users\wkearney\.julia\packages\NNlib\sSn9M\src\conv.jl:103 [inlined] [11] ∇conv_data at C:\Users\wkearney\.julia\packages\NNlib\sSn9M\src\conv.jl:101 [inlined] [12] #1241 at C:\Users\wkearney\.julia\dev\Zygote\src\lib\nnlib.jl:42 [inlined] [13] (::Zygote.var"#4101#back#1243"{Zygote.var"#1241#1242"{Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}},Array{Float32,3},Array{Float32,3},DenseConvDims{1,(5,),1,1,(1,),(2, 2),(1,),false}}})(::Array{AbstractFloat,3}) at C:\Users\wkearney\.julia\packages\ZygoteRules\6nssF\src\adjoint.jl:49 [14] Conv at C:\Users\wkearney\.julia\packages\Flux\IjMZL\src\layers\conv.jl:147 [inlined] [15] (::typeof(∂(λ)))(::Array{Float64,3}) at C:\Users\wkearney\.julia\dev\Zygote\src\compiler\interface2.jl:0 [16] loss at .\REPL[6]:1 [inlined] [17] (::typeof(∂(loss)))(::Float32) at C:\Users\wkearney\.julia\dev\Zygote\src\compiler\interface2.jl:0 [18] #7 at .\REPL[8]:1 [inlined] [19] (::typeof(∂(#7)))(::Float32) at C:\Users\wkearney\.julia\dev\Zygote\src\compiler\interface2.jl:0 [20] (::Zygote.var"#56#57"{Params,Zygote.Context,typeof(∂(#7))})(::Float32) at C:\Users\wkearney\.julia\dev\Zygote\src\compiler\interface.jl:177 [21] gradient(::Function, ::Params) at C:\Users\wkearney\.julia\dev\Zygote\src\compiler\interface.jl:54 ``` Co-authored-by: William Kearney <William.Kearney.ctr@nrlssc.navy.mil>

MikeInnes self-assigned this Mar 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unrealistically poor benchmark results #67

Unrealistically poor benchmark results #67

dfdx commented Jan 31, 2019

MikeInnes commented Feb 1, 2019 •

edited

Loading

dfdx commented Feb 2, 2019

Unrealistically poor benchmark results #67

Unrealistically poor benchmark results #67

Comments

dfdx commented Jan 31, 2019

Setup

Benchmark results

What could go wrong?

MikeInnes commented Feb 1, 2019 • edited Loading

dfdx commented Feb 2, 2019

MikeInnes commented Feb 1, 2019 •

edited

Loading