Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backward performance of `tr(x1 * x2)` #28

Closed
Roger-luo opened this Issue Oct 20, 2018 · 4 comments

Comments

Projects
None yet
2 participants
@Roger-luo
Copy link

Roger-luo commented Oct 20, 2018

Hi Mike,

I tried to implement a very straight forward AD (YAAD.jl, only took me a few hours to write the core) for fun (writing a blog post about how to implement a AD in a day with Julia), but it somehow have a much better performance (than Zygote) on this expression:

import Zygote: @grad
import Zygote
using BenchmarkTools

Zygote.@grad LinearAlgebra.tr(x) = LinearAlgebra.tr(x), Δ->* Matrix(I, size(x)), )

function bench_tr_mul_zygote(x1, x2)
    g = Zygote.gradient(()->tr(x1 * x2), Zygote.Params([x1, x2]))
    g[x1], g[x2]
end

xv, yv = rand(30, 30), rand(30, 30)
@benchmark bench_tr_mul_zygote(xv, yv)

BTW, it is just similar to AutoGrad.jl

julia> @benchmark bench_tr_mul_autograd(autograd_x, autograd_y)
BenchmarkTools.Trial:
  memory estimate:  33.20 KiB
  allocs estimate:  82
  --------------
  minimum time:     50.218 μs (0.00% GC)
  median time:      62.364 μs (0.00% GC)
  mean time:        90.422 μs (9.86% GC)
  maximum time:     55.386 ms (99.86% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark bench_tr_mul_yaad(yaad_x, yaad_y)
BenchmarkTools.Trial:
  memory estimate:  51.50 KiB
  allocs estimate:  16
  --------------
  minimum time:     10.387 μs (0.00% GC)
  median time:      13.429 μs (0.00% GC)
  mean time:        24.273 μs (45.13% GC)
  maximum time:     55.963 ms (99.96% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark bench_tr_mul_zygote(xv, yv)
BenchmarkTools.Trial:
  memory estimate:  32.45 KiB
  allocs estimate:  60
  --------------
  minimum time:     56.645 μs (0.00% GC)
  median time:      61.555 μs (0.00% GC)
  mean time:        73.006 μs (11.52% GC)
  maximum time:     51.025 ms (99.72% GC)
  --------------
  samples:          10000
  evals/sample:     1

The benchmark codes are in https://github.com/Roger-luo/YAAD.jl/blob/master/benchmark/tr_mul.jl

I thought I won't be faster than Zygote, cuz what I did was just dispatching functions based on types (it is not source to source). Is there anything I did wrong with benchmarking Zygote? Or there is an issue for the compiler? Or there is some problem with my machine (someone can help reproduce it? maybe?)

I did compare the gradient, and I implemented a gradcheck which will compare the Jacobian of each function, and my implementation has the same result with AutoGrad.jl and Zygote.jl, thus I guess there is some issue here, but I tried to read Zygote and had no clue. I'm not sure this is related to control flow? but I didn't use control flow..

version info (I build it from source):

julia> versioninfo()
Julia Version 1.0.1
Commit 0d713926f8 (2018-09-29 19:05 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.0.0)
  CPU: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libimf
  LLVM: libLLVM-6.0.0 (ORCJIT, skylake)

Bests,
Roger

@Roger-luo

This comment has been minimized.

Copy link
Author

Roger-luo commented Oct 22, 2018

Suggested by @Keno , I benchmarked the functional interface:

bench_tr_mul_zygote_functional(x1, x2) = Zygote.gradient((x1, x2)->tr(x1 * x2), x1, x2)
BenchmarkTools.Trial:
  memory estimate:  29.98 KiB
  allocs estimate:  10
  --------------
  minimum time:     42.527 μs (0.00% GC)
  median time:      46.640 μs (0.00% GC)
  mean time:        56.996 μs (15.31% GC)
  maximum time:     51.718 ms (99.90% GC)
  --------------
  samples:          10000
  evals/sample:     1

Also for reference, I tested manual differentiation:

function bench_tr_mul_base(x1, x2)
    z1 = x1 * x2
    z2 = tr(z1)

    grad_z1 = Matrix{eltype(z1)}(I, size(z1))
    grad_z1 * transpose(x2), transpose(x1) * grad_z1
end

I guess this could be the baseline

julia> @benchmark bench_tr_mul_base(xv, yv)
BenchmarkTools.Trial:
  memory estimate:  28.78 KiB
  allocs estimate:  5
  --------------
  minimum time:     6.413 μs (0.00% GC)
  median time:      8.201 μs (0.00% GC)
  mean time:        12.215 μs (31.57% GC)
  maximum time:     11.012 ms (99.87% GC)
  --------------
  samples:          10000
  evals/sample:     5
@Roger-luo

This comment has been minimized.

Copy link
Author

Roger-luo commented Dec 5, 2018

Update of the benchmark!!! It is the same with manual gradients now!!!

import Zygote: @adjoint
import Zygote
using BenchmarkTools, LinearAlgebra

Zygote.@adjoint LinearAlgebra.tr(x) = LinearAlgebra.tr(x), Δ->* Matrix(I, size(x)), )

function bench_tr_mul_zygote(x1, x2)
    g = Zygote.gradient(()->tr(x1 * x2), Zygote.Params([x1, x2]))
    g[x1], g[x2]
end

xv, yv = rand(30, 30), rand(30, 30)
@benchmark bench_tr_mul_zygote(xv, yv)
julia> @benchmark bench_tr_mul_zygote(xv, yv)
BenchmarkTools.Trial:
  memory estimate:  11.50 KiB
  allocs estimate:  45
  --------------
  minimum time:     6.263 μs (0.00% GC)
  median time:      7.181 μs (0.00% GC)
  mean time:        9.379 μs (19.17% GC)
  maximum time:     8.945 ms (99.83% GC)
  --------------
  samples:          10000
  evals/sample:     5

@Roger-luo Roger-luo closed this Dec 5, 2018

@Roger-luo

This comment has been minimized.

Copy link
Author

Roger-luo commented Dec 9, 2018

I tried this on julia master branch to check the mutate branch, and this is way slower than before...

julia> versioninfo()
Julia Version 1.1.0-DEV.844
Commit d7c3926b85 (2018-12-09 00:03 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.2.0)
  CPU: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libimf
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
julia> @benchmark bench_tr_mul_zygote(xv, yv)
BenchmarkTools.Trial:
  memory estimate:  25.95 KiB
  allocs estimate:  55
  --------------
  minimum time:     76.555 μs (0.00% GC)
  median time:      79.688 μs (0.00% GC)
  mean time:        91.859 μs (7.29% GC)
  maximum time:     44.782 ms (99.70% GC)
  --------------
  samples:          10000
  evals/sample:     1

@Roger-luo Roger-luo reopened this Dec 9, 2018

@MikeInnes

This comment has been minimized.

Copy link
Member

MikeInnes commented Jan 24, 2019

The mutate branch is going to regress a lot before it becomes on par with the master branch again (which is the main reason it's on a branch). I'm mostly trying to figure out semantics first, before I improve performance again.

@MikeInnes MikeInnes closed this Feb 15, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.