Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unrealistically poor benchmark results #67

Open
dfdx opened this issue Jan 31, 2019 · 2 comments
Open

Unrealistically poor benchmark results #67

dfdx opened this issue Jan 31, 2019 · 2 comments
Assignees

Comments

@dfdx
Copy link

dfdx commented Jan 31, 2019

White testing my own AD package, I tried to compare its performance to Zygote, but results turned to be unexpectedly poor for it.

Setup

Variational autoencoder with all dense layers on MNIST dataset. Nothing fancy, only built-in functions like matrix multiplication, elementwise operations and aggregation functions. I also defined derivative rules for both - Yota (my package) and Zygote. Full source can be found here.

It's also worth to mention that with the original cost function Zygote failed with segfault, so for simplicity I commended out guilty line for now.

Benchmark results

Compiling derivatives using Yota
  0.050908 seconds (819 allocations: 21.679 MiB, 5.64% gc time)
BenchmarkTools.Trial: 
  memory estimate:  21.68 MiB
  allocs estimate:  819
  --------------
  minimum time:     34.858 ms (0.00% GC)
  median time:      39.443 ms (6.69% GC)
  mean time:        42.610 ms (6.53% GC)
  maximum time:     155.245 ms (1.73% GC)
  --------------
  samples:          118
  evals/sample:     1

Compiling derivatives using Zygote
  1.573936 seconds (7.76 M allocations: 381.904 MiB, 35.54% gc time)
BenchmarkTools.Trial: 
  memory estimate:  381.90 MiB
  allocs estimate:  7762261
  --------------
  minimum time:     1.144 s (26.51% GC)
  median time:      1.337 s (33.87% GC)
  mean time:        1.301 s (32.50% GC)
  maximum time:     1.386 s (34.81% GC)
  --------------
  samples:          4
  evals/sample:     1

I use some aggressive optimizations (e.g. full buffering) so I actually expected 1.5-3x times difference, but not 30x. Note that both libraries give the same results and code doesn't have dead code branches.

What could go wrong?

  1. I could make a mistake or misunderstand Zygote's README. Can you spot the problem with the code?
  2. We hit some operation without efficient derivative. Maybe sum with keywords or mean?
@MikeInnes
Copy link
Member

MikeInnes commented Feb 1, 2019

Thanks for the detailed issue. Zygote still does dumb things every so often (c.f. segfault); given that this code is pretty straightforward there's probably either some simple typing issue in broadcast or we're missing a derivative, like you suggest. It would be great to have a manifest for that benchmark as I might not look at it immediately, but I'll dig into it sometime soon.

One thing to note is that your Zygote adjoint definitions won't have any effect, since we use a fused forward mode for broadcasts (I assume Yota uses a simpler reverse mode?).

What do you mean by "full buffering"? It doesn't seem that you're compiling a tape in this script, otherwise I'd assume you meant something like allocating array storage up front -- but I could be missing something.

@dfdx
Copy link
Author

dfdx commented Feb 2, 2019

One thing to note is that your Zygote adjoint definitions won't have any effect, since we use a fused forward mode for broadcasts (I assume Yota uses a simpler reverse mode?).

Yep, for broadcasted operations I simply record broadcasted derivatives of scalar functions, e.g. for sin.(x) I just record cos.(x) .* Δ.

What do you mean by "full buffering"? It doesn't seem that you're compiling a tape in this script,

grad(f, args...) is a caching layer for _grad(f, args...). Perhaps it's easier just to show the code:

function grad(f::Function, args...)
    # check if we have already calculated derivatives for this function and arguments 
    # of such types and sizes
    cache_key = (f, ([isstruct(arg) ? typeof(arg) : size(arg) for arg in args]...,))
    if haskey(GRAD_CACHE, cache_key)
        # if so, just use cached tape
        tape = GRAD_CACHE[cache_key]
        val = play!(tape, args...)
        return val, GradResult(tape)
    else
        # otherwise, calculate the tape
        val, g = _grad(f, args...)
        # compile it
        compile!(g.tape)
        # and put to the cache
        GRAD_CACHE[cache_key] = g.tape
        return val, g
    end
end

[...] but I'll dig into it sometime soon

No hurry! I think I'll do more benchmarks for Yota on CPU and GPU soon and cross-check them with Zygote to detect any unexpected performance issues.

@MikeInnes MikeInnes self-assigned this Mar 26, 2019
bors bot added a commit that referenced this issue Aug 3, 2020
751: Fix FFT type promotions r=CarloLucibello a=wkearn

I ran into an issue with type promotions while using `Conv` layers in Flux when I tried to run model output through FFTs. I tracked it down to the definitions of the adjoints for the various `fft` variants. The 1/N factors needed in the `irfft` adjoints automatically promoted `Float32` arrays to `Float64` arrays, which then caused the error when propagated back through the chain. A minimal example is included below.

This pull request just changes all those multiplications by 1/N into divisions by N, which prevents the promotion. I also added a few tests that assert that the gradients are the right types. The tests with `irfft(x,dims)` throw an error that is not related to the type promotion, so I commented those out.

```julia
x = randn(Float32,16,1,1)
m = Conv((5,),1=>1,relu,pad=SamePad())
loss(x) = sum(abs2,irfft(rfft(m(x)),16))
gradient(()->loss(x),Flux.params(m))

┌ Warning: Slow fallback implementation invoked for ∇conv_data!  You probably don't want this; check your datatypes.
│   yT = AbstractFloat
│   T1 = AbstractFloat
│   T2 = Float32
└ @ NNlib C:\Users\wkearney\.julia\packages\NNlib\sSn9M\src\conv.jl:206
ERROR: UndefRefError: access to undefined reference
Stacktrace:
 [1] getindex at .\array.jl:745 [inlined]
 [2] #conv_direct!#149(::Float64, ::Bool, ::typeof(NNlib.conv_direct!), ::Array{AbstractFloat,5}, ::Array{AbstractFloat,5}, ::Array{Float32,5}, ::DenseConvDims{3,(5, 1, 1),1,1,(1, 1, 1),(2, 2, 0, 0, 0, 0),(1, 1, 1),false}) at C:\Users\wkearney\.julia\packages\NNlib\sSn9M\src\impl\conv_direct.jl:98
 [3] (::NNlib.var"#kw##conv_direct!")(::NamedTuple{(:alpha, :beta),Tuple{Float64,Bool}}, ::typeof(NNlib.conv_direct!), ::Array{AbstractFloat,5}, ::Array{AbstractFloat,5}, ::Array{Float32,5}, ::DenseConvDims{3,(5, 1, 1),1,1,(1, 1, 1),(2, 2, 0, 0, 0, 0),(1, 1, 1),false}) at .\none:0
 [4] #∇conv_data_direct!#152(::Float64, ::Bool, ::typeof(NNlib.∇conv_data_direct!), ::Array{AbstractFloat,5}, ::Array{AbstractFloat,5}, ::Array{Float32,5}, ::DenseConvDims{3,(5, 1, 1),1,1,(1, 1, 1),(2, 2, 0, 0, 0, 0),(1, 1, 1),false}) at C:\Users\wkearney\.julia\packages\NNlib\sSn9M\src\impl\conv_direct.jl:163
 [5] ∇conv_data_direct! at C:\Users\wkearney\.julia\packages\NNlib\sSn9M\src\impl\conv_direct.jl:158 [inlined]
 [6] #∇conv_data!#106(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(∇conv_data!), ::Array{AbstractFloat,5}, ::Array{AbstractFloat,5}, ::Array{Float32,5}, ::DenseConvDims{3,(5, 1, 1),1,1,(1, 1, 1),(2, 2, 0, 0, 0, 0),(1, 1, 1),false}) at C:\Users\wkearney\.julia\packages\NNlib\sSn9M\src\conv.jl:208
 [7] ∇conv_data!(::Array{AbstractFloat,5}, ::Array{AbstractFloat,5}, ::Array{Float32,5}, ::DenseConvDims{3,(5, 1, 1),1,1,(1, 1, 1),(2, 2, 0, 0, 0, 0),(1, 1, 1),false}) at C:\Users\wkearney\.julia\packages\NNlib\sSn9M\src\conv.jl:206
 [8] #∇conv_data!#67(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(∇conv_data!), ::Array{AbstractFloat,3}, ::Array{AbstractFloat,3}, ::Array{Float32,3}, ::DenseConvDims{1,(5,),1,1,(1,),(2, 2),(1,),false}) at C:\Users\wkearney\.julia\packages\NNlib\sSn9M\src\conv.jl:148
 [9] ∇conv_data! at C:\Users\wkearney\.julia\packages\NNlib\sSn9M\src\conv.jl:148 [inlined]
 [10] #∇conv_data#39 at C:\Users\wkearney\.julia\packages\NNlib\sSn9M\src\conv.jl:103 [inlined]
 [11] ∇conv_data at C:\Users\wkearney\.julia\packages\NNlib\sSn9M\src\conv.jl:101 [inlined]
 [12] #1241 at C:\Users\wkearney\.julia\dev\Zygote\src\lib\nnlib.jl:42 [inlined]
 [13] (::Zygote.var"#4101#back#1243"{Zygote.var"#1241#1242"{Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}},Array{Float32,3},Array{Float32,3},DenseConvDims{1,(5,),1,1,(1,),(2, 2),(1,),false}}})(::Array{AbstractFloat,3}) at C:\Users\wkearney\.julia\packages\ZygoteRules\6nssF\src\adjoint.jl:49
 [14] Conv at C:\Users\wkearney\.julia\packages\Flux\IjMZL\src\layers\conv.jl:147 [inlined]
 [15] (::typeof(∂(λ)))(::Array{Float64,3}) at C:\Users\wkearney\.julia\dev\Zygote\src\compiler\interface2.jl:0
 [16] loss at .\REPL[6]:1 [inlined]
 [17] (::typeof(∂(loss)))(::Float32) at C:\Users\wkearney\.julia\dev\Zygote\src\compiler\interface2.jl:0
 [18] #7 at .\REPL[8]:1 [inlined]
 [19] (::typeof(∂(#7)))(::Float32) at C:\Users\wkearney\.julia\dev\Zygote\src\compiler\interface2.jl:0
 [20] (::Zygote.var"#56#57"{Params,Zygote.Context,typeof(∂(#7))})(::Float32) at C:\Users\wkearney\.julia\dev\Zygote\src\compiler\interface.jl:177
 [21] gradient(::Function, ::Params) at C:\Users\wkearney\.julia\dev\Zygote\src\compiler\interface.jl:54
```

Co-authored-by: William Kearney <William.Kearney.ctr@nrlssc.navy.mil>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants