Flux#zygote slower than Tracker #815

carstenbauer · 2019-07-25T18:08:46Z

MWE:

using Flux, BenchmarkTools
using Flux: crossentropy, onecold, onehotbatch, throttle, @epochs
using Printf, Statistics, Random
using Base.Iterators: repeated

confs_left = rand(64,4000)
confs_right = rand(64,4000)

# set up as training data
neach = size(confs_left, 2)
X = hcat(confs_left, confs_right)
labels = vcat(fill(1, neach), fill(0, neach))
Y = onehotbatch(labels, 0:1)
dataset = repeated((X, Y), 10)

# create neural network with 10 hidden units and 2 output neurons
Random.seed!(123)
m = Chain(
  Dense(64, 10, relu),
  Dense(10, 2),
  softmax)

# define cost-function
loss(x, y) = crossentropy(m(x), y)
accuracy(x, y) = mean(onecold(m(x)) .== onecold(y))

opt = ADAM()

println("-------- Training")
@btime Flux.train!($loss, params($m), $dataset, $opt)

I get 955.648 ms (28965221 allocations: 587.11 MiB) with Flux + Zygote and 47.298 ms (7471 allocations: 63.52 MiB) with Flux + Tracker.

Is this expected because I'm running Julia 1.1?

The text was updated successfully, but these errors were encountered:

jumerckx · 2019-07-25T21:09:45Z

I can reproduce this issue.
To make sure the performance regression happens in the backwards pass, I ran gradient on both versions. (Tracker: left; Zygote: right)

I also tested on Julia 1.3 and the issue remains.

mkschleg · 2020-01-31T17:30:45Z

So I've been playing around with Zygote a bit, and something I've noticed is it suffers terrible performance when you are doing type conversions in the loss you are optimizing over. Tracker will also have worse performance but isn't effected as much as zygote. If you run the above code, except w/

confs_left = rand(Float32, 64,4000)
confs_right = rand(Float32, 64,4000)

you will get much better performance. Basically, you want to make sure your inputs match the type of your model (at least I think). For me the speed up is significant:

W/ Float64s

julia> @btime Flux.train!($loss, Flux.params($m), $dataset, $opt)
  1.028 s (26084765 allocations: 522.98 MiB)

W/ Float32s

julia> @btime Flux.train!($loss, Flux.params($m), $dataset, $opt)
  55.755 ms (484235 allocations: 54.22 MiB)

oxinabox · 2020-02-01T20:25:44Z

Not the core of the problem,
but the demo code uses non-const globals in the loss and accuracy functions.
That makes them type-unstable.
Which on my computer makes it 10% slower than it otherwise would be.
It should be:

const m = Chain(
  Dense(64, 10, relu),
  Dense(10, 2),
  softmax)

# define cost-function
loss(x, y) = crossentropy(m(x), y)
accuracy(x, y) = mean(onecold(m(x)) .== onecold(y))

mkschleg · 2020-02-02T05:59:51Z

From what @oxinabox mentioned in the issue I opened in Zygote (FluxML/Zygote.jl#491 (comment)), this does not seem to be an issue with Flux or Zygote and is even mentioned in the Flux performance tips.

It might be beneficial for newcomers for this gotcha to be highlighted more directly in parts of the tutorials. Potentially even in the basics, as this seems to be much less of a problem when looking at the likes of Pytorch and Tensorflow (at least as far as I can tell).

mcabbott · 2020-02-02T11:30:48Z

The problem in FluxML/Zygote.jl#491 was that A * B with differing eltypes goes to generic_matmatmul which is slow.

Flux has some logic to avoid that, at least for Dense layers, by simply converting one of the matrices before multiplying:

Flux.jl/src/layers/basic.jl

Lines 116 to 117 in 5839e16

    
           (a::Dense{<:Any,W})(x::AbstractArray{<:AbstractFloat}) where {T <: Union{Float32,Float64}, W <: AbstractArray{T}} = 
        
             a(T.(x))

Making this an error is one way to find problems. (And perhaps that would be a better default for Flux than silent conversions?) But the slowdown with Zygote appears to be caused by precisely how this conversion happens. Here's one way to fix it:

function (a::Flux.Dense{<:Any,W})(x::AbstractArray{<:AbstractFloat}) where {T <: Union{Float32,Float64}, W <: AbstractArray{T}}
  # error("tried to convert types")
  a(arrayconvert(T, x))
end
arrayconvert(T::Type, x::AbstractArray) = T.(x)

using Zygote: @adjoint
@adjoint arrayconvert(T, x) = T.(x), dy -> (nothing, dy)

Probably arrayconvert here should just be convert(AbstractArray{T}, x), although Zygote doesn't like that right now. And perhaps the slowness of T.(x) is a sign that something is wrong in Zygote's broadcasting machinery?

zhulunwu · 2020-03-12T07:37:44Z

I have also this problem
here is my test code:

using Flux

const Nf=128
const FRAMES=512
const DIM_H=128

const m=Chain(GRU(Nf,DIM_H),GRU(DIM_H,DIM_H),GRU(DIM_H,DIM_H),Dense(DIM_H,Nf,relu))

function loss(data::Array{Float32,2},label::Array{Float32,2})
out=[m(data[:,i]) for i=1:FRAMES]
pre=hcat(out...)
l=Flux.mse(pre,label)
Flux.reset!(m)
return l
end

input=rand(Float32,Nf,FRAMES)
label=rand(Float32,Nf,FRAMES)

@time gs=gradient(() -> loss(input, label), params(m))

with Flux@v0.10.3(zygote)
2.367152 seconds (2.14 M allocations: 2.130 GiB, 31.22% gc time)

with Flux@v0.9.0(tracker)
1.257959 seconds (1.12 M allocations: 698.902 MiB, 8.05% gc time)

Tracker is faster than zygote and the difference become larger when input matrix increase. see allocations and gc time.

mkschleg · 2020-03-25T16:47:16Z

Can you check with benchmark tools and @btime? It seems like you may be measuring the creation and compile time of your model's gradient operations. I would expect this to be slower than Tracker as there is more Zygote is doing.

bhvieira · 2020-04-16T14:26:54Z

using Flux, BenchmarkTools
using Flux: crossentropy, onecold, onehotbatch, throttle, @epochs
using Printf, Statistics, Random
using Base.Iterators: repeated

confs_left = rand(Float32,64,4000)
confs_right = rand(Float32,64,4000)

# set up as training data
neach = size(confs_left, 2)
X = hcat(confs_left, confs_right)
labels = vcat(fill(1, neach), fill(0, neach))
Y = onehotbatch(labels, 0:1)
dataset = repeated((X, Y), 10)

# create neural network with 10 hidden units and 2 output neurons
Random.seed!(123)
const m = Chain(
  Dense(64, 10, relu),
  Dense(10, 2),
  softmax)

# define cost-function
loss(x, y) = crossentropy(m(x), y)
accuracy(x, y) = mean(onecold(m(x)) .== onecold(y))

opt = ADAM()

θ = Flux.params(m);

println("-------- Training")
@btime Flux.train!($loss, $θ, $dataset, $opt)

Running the code above I got this:

#Flux@0.10.3+Zygote@0.4.8
#julia> @btime Flux.train!($loss, $θ, $dataset, $opt)
#  73.998 ms (484454 allocations: 54.23 MiB)

#Flux@0.9.0+Tracker@0.2.6
#julia> @btime Flux.train!($loss, $θ, $dataset, $opt)
#77.788 ms (7774 allocations: 46.44 MiB)

bhvieira · 2020-04-16T14:27:53Z

Zygote allocated a lot more, but in the end it amounts to roughly the same time

bhvieira · 2020-04-16T14:38:40Z

Keeping the code the same, just increasing the size of the hidden layer in m:

const m = Chain(
  Dense(64, 1000, relu),
  Dense(1000, 2),
  softmax)

#Flux@0.10.3+Zygote@0.4.8
#julia> @btime Flux.train!($loss, $θ, $dataset, $opt)
# 4.079 s (484493 allocations: 1.54 GiB)

#Flux@0.9.0+Tracker@0.2.6
#julia> @btime Flux.train!($loss, $θ, $dataset, $opt)
#  4.015 s (7824 allocations: 1.53 GiB)

DhairyaLGandhi · 2020-04-16T14:50:18Z

It'd be good to understand what is happening here, but seems like the mwe in the OP seems to be roughly equivalent?

bhvieira · 2020-04-16T14:52:12Z

I just asserted the data is Float32, and set the model to be const

bhvieira · 2020-04-16T14:52:53Z

The rest is pretty much the same. Removed the call to params inside train! though

bhvieira · 2020-04-17T14:58:33Z

Why does logitcrossentropy allocate less though?

using Flux: logitcrossentropy
Random.seed!(123)
const m2 = Chain(
  Dense(64, 10, relu),
  Dense(10, 2))
loss2(x, y) = logitcrossentropy(m2(x), y)
opt2 = ADAM()
θ2 = Flux.params(m2);
@btime Flux.train!($loss2, $θ2, $dataset, $opt2)
#julia> @btime Flux.train!($loss2, $θ2, $dataset, $opt2)
#  63.521 ms (2543 allocations: 49.85 MiB)

Compare with

#julia> @btime Flux.train!($loss, $θ, $dataset, $opt)
#  67.696 ms (484463 allocations: 54.23 MiB)

canthonissen · 2020-08-12T05:53:36Z

Don't know why, but the relu activation function seems problematic on my setup.
I replaced it with myrelu(x) = (abs.(x).+x)./2 and saw an significant improvement in speed.

DhairyaLGandhi · 2020-08-12T06:40:01Z

Is it that the adjoint for relu can be made more efficient? It's defined in Zygote.

CarloLucibello · 2020-12-26T09:09:02Z

the regression with respect to Tracker seems to be solved, if other problems persist please file separate issues

carstenbauer mentioned this issue Jul 26, 2019

Ecosystem overview JuliaPhysics/juliaphysics.github.io#1

Open

itsdfish mentioned this issue Jul 30, 2019

Comments Bob Carpenter StatisticalRethinkingJulia/MCMCBenchmarks.jl#22

Closed

mkschleg mentioned this issue Jan 31, 2020

Zygote suffers poor performance w/ type conversions FluxML/Zygote.jl#491

Closed

This was referenced Feb 2, 2020

Slow type conversions of Arrays FluxML/Zygote.jl#496

Open

Type Promotion often Unwieldy and day Ruining #1026

Closed

Debugging messages about type mismatches #1031

Closed

haampie mentioned this issue Feb 13, 2020

Avoid inv() in / and \ JuliaDiff/DiffRules.jl#46

Merged

mcabbott mentioned this issue Oct 16, 2020

gradient calculation with explicit type cast is broken FluxML/Zygote.jl#810

Closed

CarloLucibello closed this as completed Dec 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flux#zygote slower than Tracker #815

Flux#zygote slower than Tracker #815

carstenbauer commented Jul 25, 2019

jumerckx commented Jul 25, 2019

mkschleg commented Jan 31, 2020 •

edited

Loading

oxinabox commented Feb 1, 2020

mkschleg commented Feb 2, 2020

mcabbott commented Feb 2, 2020

zhulunwu commented Mar 12, 2020

mkschleg commented Mar 25, 2020

bhvieira commented Apr 16, 2020 •

edited

Loading

bhvieira commented Apr 16, 2020

bhvieira commented Apr 16, 2020

DhairyaLGandhi commented Apr 16, 2020

bhvieira commented Apr 16, 2020

bhvieira commented Apr 16, 2020

bhvieira commented Apr 17, 2020

canthonissen commented Aug 12, 2020

DhairyaLGandhi commented Aug 12, 2020

CarloLucibello commented Dec 26, 2020

Flux#zygote slower than Tracker #815

Flux#zygote slower than Tracker #815

Comments

carstenbauer commented Jul 25, 2019

jumerckx commented Jul 25, 2019

mkschleg commented Jan 31, 2020 • edited Loading

oxinabox commented Feb 1, 2020

mkschleg commented Feb 2, 2020

mcabbott commented Feb 2, 2020

zhulunwu commented Mar 12, 2020

mkschleg commented Mar 25, 2020

bhvieira commented Apr 16, 2020 • edited Loading

bhvieira commented Apr 16, 2020

bhvieira commented Apr 16, 2020

DhairyaLGandhi commented Apr 16, 2020

bhvieira commented Apr 16, 2020

bhvieira commented Apr 16, 2020

bhvieira commented Apr 17, 2020

canthonissen commented Aug 12, 2020

DhairyaLGandhi commented Aug 12, 2020

CarloLucibello commented Dec 26, 2020

mkschleg commented Jan 31, 2020 •

edited

Loading

bhvieira commented Apr 16, 2020 •

edited

Loading