Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flux#zygote slower than Tracker #815

Closed
carstenbauer opened this issue Jul 25, 2019 · 17 comments
Closed

Flux#zygote slower than Tracker #815

carstenbauer opened this issue Jul 25, 2019 · 17 comments

Comments

@carstenbauer
Copy link

MWE:

using Flux, BenchmarkTools
using Flux: crossentropy, onecold, onehotbatch, throttle, @epochs
using Printf, Statistics, Random
using Base.Iterators: repeated

confs_left = rand(64,4000)
confs_right = rand(64,4000)

# set up as training data
neach = size(confs_left, 2)
X = hcat(confs_left, confs_right)
labels = vcat(fill(1, neach), fill(0, neach))
Y = onehotbatch(labels, 0:1)
dataset = repeated((X, Y), 10)

# create neural network with 10 hidden units and 2 output neurons
Random.seed!(123)
m = Chain(
  Dense(64, 10, relu),
  Dense(10, 2),
  softmax)

# define cost-function
loss(x, y) = crossentropy(m(x), y)
accuracy(x, y) = mean(onecold(m(x)) .== onecold(y))

opt = ADAM()

println("-------- Training")
@btime Flux.train!($loss, params($m), $dataset, $opt)

I get 955.648 ms (28965221 allocations: 587.11 MiB) with Flux + Zygote and 47.298 ms (7471 allocations: 63.52 MiB) with Flux + Tracker.

Is this expected because I'm running Julia 1.1?

@jumerckx
Copy link

I can reproduce this issue.
To make sure the performance regression happens in the backwards pass, I ran gradient on both versions. (Tracker: left; Zygote: right)
image
I also tested on Julia 1.3 and the issue remains.

@mkschleg
Copy link
Contributor

mkschleg commented Jan 31, 2020

So I've been playing around with Zygote a bit, and something I've noticed is it suffers terrible performance when you are doing type conversions in the loss you are optimizing over. Tracker will also have worse performance but isn't effected as much as zygote. If you run the above code, except w/

confs_left = rand(Float32, 64,4000)
confs_right = rand(Float32, 64,4000)

you will get much better performance. Basically, you want to make sure your inputs match the type of your model (at least I think). For me the speed up is significant:

W/ Float64s

julia> @btime Flux.train!($loss, Flux.params($m), $dataset, $opt)
  1.028 s (26084765 allocations: 522.98 MiB)

W/ Float32s

julia> @btime Flux.train!($loss, Flux.params($m), $dataset, $opt)
  55.755 ms (484235 allocations: 54.22 MiB)

@oxinabox
Copy link
Member

oxinabox commented Feb 1, 2020

Not the core of the problem,
but the demo code uses non-const globals in the loss and accuracy functions.
That makes them type-unstable.
Which on my computer makes it 10% slower than it otherwise would be.
It should be:

const m = Chain(
  Dense(64, 10, relu),
  Dense(10, 2),
  softmax)

# define cost-function
loss(x, y) = crossentropy(m(x), y)
accuracy(x, y) = mean(onecold(m(x)) .== onecold(y))

@mkschleg
Copy link
Contributor

mkschleg commented Feb 2, 2020

From what @oxinabox mentioned in the issue I opened in Zygote (FluxML/Zygote.jl#491 (comment)), this does not seem to be an issue with Flux or Zygote and is even mentioned in the Flux performance tips.

It might be beneficial for newcomers for this gotcha to be highlighted more directly in parts of the tutorials. Potentially even in the basics, as this seems to be much less of a problem when looking at the likes of Pytorch and Tensorflow (at least as far as I can tell).

@mcabbott
Copy link
Member

mcabbott commented Feb 2, 2020

The problem in FluxML/Zygote.jl#491 was that A * B with differing eltypes goes to generic_matmatmul which is slow.

Flux has some logic to avoid that, at least for Dense layers, by simply converting one of the matrices before multiplying:

Flux.jl/src/layers/basic.jl

Lines 116 to 117 in 5839e16

(a::Dense{<:Any,W})(x::AbstractArray{<:AbstractFloat}) where {T <: Union{Float32,Float64}, W <: AbstractArray{T}} =
a(T.(x))

Making this an error is one way to find problems. (And perhaps that would be a better default for Flux than silent conversions?) But the slowdown with Zygote appears to be caused by precisely how this conversion happens. Here's one way to fix it:

function (a::Flux.Dense{<:Any,W})(x::AbstractArray{<:AbstractFloat}) where {T <: Union{Float32,Float64}, W <: AbstractArray{T}}
  # error("tried to convert types")
  a(arrayconvert(T, x))
end
arrayconvert(T::Type, x::AbstractArray) = T.(x)

using Zygote: @adjoint
@adjoint arrayconvert(T, x) = T.(x), dy -> (nothing, dy)

Probably arrayconvert here should just be convert(AbstractArray{T}, x), although Zygote doesn't like that right now. And perhaps the slowness of T.(x) is a sign that something is wrong in Zygote's broadcasting machinery?

@zhulunwu
Copy link

I have also this problem
here is my test code:

using Flux

const Nf=128
const FRAMES=512
const DIM_H=128

const m=Chain(GRU(Nf,DIM_H),GRU(DIM_H,DIM_H),GRU(DIM_H,DIM_H),Dense(DIM_H,Nf,relu))

function loss(data::Array{Float32,2},label::Array{Float32,2})
out=[m(data[:,i]) for i=1:FRAMES]
pre=hcat(out...)
l=Flux.mse(pre,label)
Flux.reset!(m)
return l
end

input=rand(Float32,Nf,FRAMES)
label=rand(Float32,Nf,FRAMES)

@time gs=gradient(() -> loss(input, label), params(m))

with Flux@v0.10.3(zygote)
2.367152 seconds (2.14 M allocations: 2.130 GiB, 31.22% gc time)

with Flux@v0.9.0(tracker)
1.257959 seconds (1.12 M allocations: 698.902 MiB, 8.05% gc time)

Tracker is faster than zygote and the difference become larger when input matrix increase. see allocations and gc time.

@mkschleg
Copy link
Contributor

Can you check with benchmark tools and @btime? It seems like you may be measuring the creation and compile time of your model's gradient operations. I would expect this to be slower than Tracker as there is more Zygote is doing.

@bhvieira
Copy link
Contributor

bhvieira commented Apr 16, 2020

using Flux, BenchmarkTools
using Flux: crossentropy, onecold, onehotbatch, throttle, @epochs
using Printf, Statistics, Random
using Base.Iterators: repeated

confs_left = rand(Float32,64,4000)
confs_right = rand(Float32,64,4000)

# set up as training data
neach = size(confs_left, 2)
X = hcat(confs_left, confs_right)
labels = vcat(fill(1, neach), fill(0, neach))
Y = onehotbatch(labels, 0:1)
dataset = repeated((X, Y), 10)

# create neural network with 10 hidden units and 2 output neurons
Random.seed!(123)
const m = Chain(
  Dense(64, 10, relu),
  Dense(10, 2),
  softmax)

# define cost-function
loss(x, y) = crossentropy(m(x), y)
accuracy(x, y) = mean(onecold(m(x)) .== onecold(y))

opt = ADAM()

θ = Flux.params(m);

println("-------- Training")
@btime Flux.train!($loss, $θ, $dataset, $opt)

Running the code above I got this:

#Flux@0.10.3+Zygote@0.4.8
#julia> @btime Flux.train!($loss, $θ, $dataset, $opt)
#  73.998 ms (484454 allocations: 54.23 MiB)

#Flux@0.9.0+Tracker@0.2.6
#julia> @btime Flux.train!($loss, $θ, $dataset, $opt)
#77.788 ms (7774 allocations: 46.44 MiB) 

@bhvieira
Copy link
Contributor

Zygote allocated a lot more, but in the end it amounts to roughly the same time

@bhvieira
Copy link
Contributor

Keeping the code the same, just increasing the size of the hidden layer in m:

const m = Chain(
  Dense(64, 1000, relu),
  Dense(1000, 2),
  softmax)

#Flux@0.10.3+Zygote@0.4.8
#julia> @btime Flux.train!($loss, $θ, $dataset, $opt)
# 4.079 s (484493 allocations: 1.54 GiB)

#Flux@0.9.0+Tracker@0.2.6
#julia> @btime Flux.train!($loss, $θ, $dataset, $opt)
#  4.015 s (7824 allocations: 1.53 GiB)

@DhairyaLGandhi
Copy link
Member

It'd be good to understand what is happening here, but seems like the mwe in the OP seems to be roughly equivalent?

@bhvieira
Copy link
Contributor

I just asserted the data is Float32, and set the model to be const

@bhvieira
Copy link
Contributor

The rest is pretty much the same. Removed the call to params inside train! though

@bhvieira
Copy link
Contributor

Why does logitcrossentropy allocate less though?

using Flux: logitcrossentropy
Random.seed!(123)
const m2 = Chain(
  Dense(64, 10, relu),
  Dense(10, 2))
loss2(x, y) = logitcrossentropy(m2(x), y)
opt2 = ADAM()
θ2 = Flux.params(m2);
@btime Flux.train!($loss2, $θ2, $dataset, $opt2)
#julia> @btime Flux.train!($loss2, $θ2, $dataset, $opt2)
#  63.521 ms (2543 allocations: 49.85 MiB)

Compare with

#julia> @btime Flux.train!($loss, $θ, $dataset, $opt)
#  67.696 ms (484463 allocations: 54.23 MiB)

@canthonissen
Copy link

Don't know why, but the relu activation function seems problematic on my setup.
I replaced it with myrelu(x) = (abs.(x).+x)./2 and saw an significant improvement in speed.

@DhairyaLGandhi
Copy link
Member

Is it that the adjoint for relu can be made more efficient? It's defined in Zygote.

@CarloLucibello
Copy link
Member

the regression with respect to Tracker seems to be solved, if other problems persist please file separate issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants