Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backward performance of tr(x1 * x2) #28

Closed
Roger-luo opened this issue Oct 20, 2018 · 4 comments
Closed

Backward performance of tr(x1 * x2) #28

Roger-luo opened this issue Oct 20, 2018 · 4 comments

Comments

@Roger-luo
Copy link
Contributor

@Roger-luo Roger-luo commented Oct 20, 2018

Hi Mike,

I tried to implement a very straight forward AD (YAAD.jl, only took me a few hours to write the core) for fun (writing a blog post about how to implement a AD in a day with Julia), but it somehow have a much better performance (than Zygote) on this expression:

import Zygote: @grad
import Zygote
using BenchmarkTools

Zygote.@grad LinearAlgebra.tr(x) = LinearAlgebra.tr(x), Δ->* Matrix(I, size(x)), )

function bench_tr_mul_zygote(x1, x2)
    g = Zygote.gradient(()->tr(x1 * x2), Zygote.Params([x1, x2]))
    g[x1], g[x2]
end

xv, yv = rand(30, 30), rand(30, 30)
@benchmark bench_tr_mul_zygote(xv, yv)

BTW, it is just similar to AutoGrad.jl

julia> @benchmark bench_tr_mul_autograd(autograd_x, autograd_y)
BenchmarkTools.Trial:
  memory estimate:  33.20 KiB
  allocs estimate:  82
  --------------
  minimum time:     50.218 μs (0.00% GC)
  median time:      62.364 μs (0.00% GC)
  mean time:        90.422 μs (9.86% GC)
  maximum time:     55.386 ms (99.86% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark bench_tr_mul_yaad(yaad_x, yaad_y)
BenchmarkTools.Trial:
  memory estimate:  51.50 KiB
  allocs estimate:  16
  --------------
  minimum time:     10.387 μs (0.00% GC)
  median time:      13.429 μs (0.00% GC)
  mean time:        24.273 μs (45.13% GC)
  maximum time:     55.963 ms (99.96% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark bench_tr_mul_zygote(xv, yv)
BenchmarkTools.Trial:
  memory estimate:  32.45 KiB
  allocs estimate:  60
  --------------
  minimum time:     56.645 μs (0.00% GC)
  median time:      61.555 μs (0.00% GC)
  mean time:        73.006 μs (11.52% GC)
  maximum time:     51.025 ms (99.72% GC)
  --------------
  samples:          10000
  evals/sample:     1

The benchmark codes are in https://github.com/Roger-luo/YAAD.jl/blob/master/benchmark/tr_mul.jl

I thought I won't be faster than Zygote, cuz what I did was just dispatching functions based on types (it is not source to source). Is there anything I did wrong with benchmarking Zygote? Or there is an issue for the compiler? Or there is some problem with my machine (someone can help reproduce it? maybe?)

I did compare the gradient, and I implemented a gradcheck which will compare the Jacobian of each function, and my implementation has the same result with AutoGrad.jl and Zygote.jl, thus I guess there is some issue here, but I tried to read Zygote and had no clue. I'm not sure this is related to control flow? but I didn't use control flow..

version info (I build it from source):

julia> versioninfo()
Julia Version 1.0.1
Commit 0d713926f8 (2018-09-29 19:05 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.0.0)
  CPU: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libimf
  LLVM: libLLVM-6.0.0 (ORCJIT, skylake)

Bests,
Roger

@Roger-luo
Copy link
Contributor Author

@Roger-luo Roger-luo commented Oct 22, 2018

Suggested by @Keno , I benchmarked the functional interface:

bench_tr_mul_zygote_functional(x1, x2) = Zygote.gradient((x1, x2)->tr(x1 * x2), x1, x2)
BenchmarkTools.Trial:
  memory estimate:  29.98 KiB
  allocs estimate:  10
  --------------
  minimum time:     42.527 μs (0.00% GC)
  median time:      46.640 μs (0.00% GC)
  mean time:        56.996 μs (15.31% GC)
  maximum time:     51.718 ms (99.90% GC)
  --------------
  samples:          10000
  evals/sample:     1

Also for reference, I tested manual differentiation:

function bench_tr_mul_base(x1, x2)
    z1 = x1 * x2
    z2 = tr(z1)

    grad_z1 = Matrix{eltype(z1)}(I, size(z1))
    grad_z1 * transpose(x2), transpose(x1) * grad_z1
end

I guess this could be the baseline

julia> @benchmark bench_tr_mul_base(xv, yv)
BenchmarkTools.Trial:
  memory estimate:  28.78 KiB
  allocs estimate:  5
  --------------
  minimum time:     6.413 μs (0.00% GC)
  median time:      8.201 μs (0.00% GC)
  mean time:        12.215 μs (31.57% GC)
  maximum time:     11.012 ms (99.87% GC)
  --------------
  samples:          10000
  evals/sample:     5
@Roger-luo
Copy link
Contributor Author

@Roger-luo Roger-luo commented Dec 5, 2018

Update of the benchmark!!! It is the same with manual gradients now!!!

import Zygote: @adjoint
import Zygote
using BenchmarkTools, LinearAlgebra

Zygote.@adjoint LinearAlgebra.tr(x) = LinearAlgebra.tr(x), Δ->* Matrix(I, size(x)), )

function bench_tr_mul_zygote(x1, x2)
    g = Zygote.gradient(()->tr(x1 * x2), Zygote.Params([x1, x2]))
    g[x1], g[x2]
end

xv, yv = rand(30, 30), rand(30, 30)
@benchmark bench_tr_mul_zygote(xv, yv)
julia> @benchmark bench_tr_mul_zygote(xv, yv)
BenchmarkTools.Trial:
  memory estimate:  11.50 KiB
  allocs estimate:  45
  --------------
  minimum time:     6.263 μs (0.00% GC)
  median time:      7.181 μs (0.00% GC)
  mean time:        9.379 μs (19.17% GC)
  maximum time:     8.945 ms (99.83% GC)
  --------------
  samples:          10000
  evals/sample:     5
@Roger-luo Roger-luo closed this Dec 5, 2018
@Roger-luo
Copy link
Contributor Author

@Roger-luo Roger-luo commented Dec 9, 2018

I tried this on julia master branch to check the mutate branch, and this is way slower than before...

julia> versioninfo()
Julia Version 1.1.0-DEV.844
Commit d7c3926b85 (2018-12-09 00:03 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.2.0)
  CPU: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libimf
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
julia> @benchmark bench_tr_mul_zygote(xv, yv)
BenchmarkTools.Trial:
  memory estimate:  25.95 KiB
  allocs estimate:  55
  --------------
  minimum time:     76.555 μs (0.00% GC)
  median time:      79.688 μs (0.00% GC)
  mean time:        91.859 μs (7.29% GC)
  maximum time:     44.782 ms (99.70% GC)
  --------------
  samples:          10000
  evals/sample:     1
@Roger-luo Roger-luo reopened this Dec 9, 2018
@MikeInnes
Copy link
Member

@MikeInnes MikeInnes commented Jan 24, 2019

The mutate branch is going to regress a lot before it becomes on par with the master branch again (which is the main reason it's on a branch). I'm mostly trying to figure out semantics first, before I improve performance again.

@MikeInnes MikeInnes closed this Feb 15, 2019
bors bot added a commit that referenced this issue Jan 15, 2020
Merge #409
409: matrix exp(A) returns real-valued adjoint for real-valued A r=MikeInnes a=sdewaele

The current adjoint for the matrix `exp(A)` where `A` is real-valued can return a complex-valued matrix. The complex components are only small numerical errors, close to machine precision. However, the fact that it is complex-valued leads to issues in the adjoint computation, see below for an example. This PR converts the return value to a real-valued array for real-valued arguments.

BTW - This issue has some similarities to #402, in that
1) Having indexing in the test code revealed the issue. For that reason, it may be an idea to add a `gradcheck_getindex`, to complement the current `gradcheck`? Or a test of compatibility of the return value of the adjoint with the input type?

2) An alternative fix would be to allow the adjoint of the indexed array to be complex-valued. This is similar to allowing unconstrainted matrices as adjoints in #402. This would also require changes to the `getindex` adjoint.

```julia
# Note: A randomly generated 8 × 8 also fails (almost?) always. Using this confirmed example to be sure.
A = [ 0.0    1.0    0.0
      0.0    0.0    1.0
     -4.34 -18.31  -0.43]
x = [1.0]

f(A,x) = exp(A*x[1])
Y,Bf = Zygote.pullback(f,A,x)
Ȳ = ones(3,3)
Ā,x̄ = Bf(Ȳ) # fails
```
Error:
```julia
ERROR: InexactError: Float64(14.999150171071268 + 9.678155693021492e-16im)
Stacktrace:
 [1] Type at .\complex.jl:37 [inlined]
 [2] convert at .\number.jl:7 [inlined]
 [3] setindex!(::Array{Float64,1}, ::Complex{Float64}, ::Int64) at .\array.jl:767
 [4] #980 at C:\Users\user\.julia\packages\Zygote\ycnjm\src\lib\array.jl:32 [inlined]
 [5] #2619#back at C:\Users\user\.julia\packages\ZygoteRules\6nssF\src\adjoint.jl:49 [inlined]
 [6] literal_getindex at C:\Users\user\.julia\packages\Zygote\ycnjm\src\lib\lib.jl:77 [inlined]
 [7] (::typeof(∂(Zygote.literal_getindex)))(::Complex{Float64}) at C:\Users\user\.julia\packages\Zygote\ycnjm\src\compiler\interface2.jl:0
 [8] f at .\REPL[7]:1 [inlined]
 [9] (::typeof(∂(f)))(::Array{Float64,2}) at C:\Users\user\.julia\packages\Zygote\ycnjm\src\compiler\interface2.jl:0
 [10] (::getfield(Zygote, Symbol("##28#29")){typeof(∂(f))})(::Array{Float64,2}) at C:\Users\user\.julia\packages\Zygote\ycnjm\src\compiler\interface.jl:38
 [11] top-level scope at none:0
```

Co-authored-by: Stijn de Waele <stijn.de.waele123@gmail.com>
Co-authored-by: sdewaele <14310676+sdewaele@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants