-
-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Model optimization fails (NaNs) with Zygote.pullback but works with Tracker.forward #876
Comments
Are you able to run any finite differencing tests to show that the gradients are incorrect? Easiest way to debug would be start with that failing test and gradually simplify forward pass. |
using Random
using Flux
using Tracker
using Logging
using IterTools: take, repeatedly
using Statistics
using MLDataUtils
using NPZ
# Download Dataset
exampledir = joinpath("test","example")
#; wget https://github.com/ecreager/csc2541-f19/raw/master/assignment1/adult/adult_train.npz -P $exampledir
args = Dict(
:seed => 1,
:batch_size => 10,
:num_epochs => 5,
:lr =>1e-5
)
function first_dim_last(A)
"""make python data julia-y"""
last_dim = ndims(A)
return permutedims(A,vcat(last_dim,[1:(last_dim-1);]))
end
train_data = npzread(joinpath(exampledir,"adult_train.npz"))
x = first_dim_last(train_data["x"])
y = first_dim_last(train_data["y"])
FS = size(x)[1]
function make_epochs(dataset,args)
Random.seed!(args[:seed])
function make_batches()
return batchview(shuffleobs(dataset),args[:batch_size])
end
return repeatedly(make_batches,args[:num_epochs])
end
function run_zygote_loop()
Random.seed!(args[:seed])
model = Chain(
Dense(FS,1000),
Dense(1000,1)
)
θ=Flux.params(model)
criterion = (output,target) -> mean(Flux.logitbinarycrossentropy.(output,target))
optimizer = Flux.ADAM(args[:lr],(0.9,0.99))
epochs = make_epochs((x,y),args)
for (i,batches) in enumerate(epochs)
for batch in batches
x,y = batch;
loss, pb = Flux.Zygote.pullback(()->criterion(model(x),y),θ);
∇θ = pb(1.);
Flux.Optimise.update!(optimizer,θ,∇θ);
@show loss
end
println("finished epoch $i")
end
end
function run_tracker_loop()
Random.seed!(args[:seed])
model = Chain(
Dense(FS,1000),
Dense(1000,1)
)
θ=Flux.params(model)
criterion = (output,target) -> mean(Flux.logitbinarycrossentropy.(output,target))
optimizer = Flux.ADAM(args[:lr],(0.9,0.99))
epochs = make_epochs((x,y),args)
for (i,batches) in enumerate(epochs)
for batch in batches
x,y = batch;
loss, pb = Flux.Tracker.forward(()->criterion(model(x),y),θ);
∇θ = pb(1.);
Flux.Optimise.update!(optimizer,θ,∇θ);
@show loss
end
println("finished epoch $i")
end
end @MikeInnes by the way here is the above script with the ability to download and load the dataset. However, I do not know how to test both the |
Note, again, the only difference is |
@MikeInnes I did not do any finite difference yet because implicit -> explicit model utilities aren't in Flux yet, and DiffEqFlux expects Tracker. Instead I sketchily made my model a Tracker-visible, would be nice to have utilities for this. Below you can see that taking the gradients with Zygote result in all # Debuging Zygote Gradient
batches = first(make_epochs((x,y),args))
xi,yi = first(batches)
model = Chain(
Dense(FS,1000),
Dense(1000,1)
)
θ=Flux.params(model)
criterion = (output,target) -> mean(Flux.logitbinarycrossentropy.(output,target))
optimizer = Flux.ADAM(args[:lr],(0.9,0.99))
# Zygote gradients NaN
loss, pb = Flux.Zygote.pullback(()->criterion(model(xi),yi),Zygote.Params([model[1].W]));
any(isnan.(pb(1.)[model[1].W])) #True
# Make the model useable to Tracker
using Flux: mapleaves
function trackerify(leaf)
if leaf isa AbstractArray
return Tracker.param(leaf)
else
return leaf
end
end
tracked_model = mapleaves(trackerify,model)
# Tracker gradients do not NaN
tloss, tpb = Tracker.forward(()->criterion(tracked_model(xi),yi),Tracker.Params(tracked_model[1].W));
loss == Flux.data(tloss)
any(isnan.(tpb(1.)[tracked_model[1].W])) #False |
@jessebett I'm happy to dig into this code, but I can't currently run your last script due to missing definitions. Can you provide a project/manifest file + complete script for that last snippet? |
Hi @MikeInnes, below are the full versions of the script and the project file. Note that I am not able to run However, the using Random
using Flux
using Tracker
using Logging
using IterTools: take, repeatedly
using Statistics
using MLDataUtils
using NPZ
using Zygote
# Download Dataset
exampledir = joinpath("test","example")
#; wget https://github.com/ecreager/csc2541-f19/raw/master/assignment1/adult/adult_train.npz -P $exampledir
args = Dict(
:seed => 1,
:batch_size => 10,
:num_epochs => 5,
:lr =>1e-5
)
function first_dim_last(A)
"""make python data julia-y"""
last_dim = ndims(A)
return permutedims(A,vcat(last_dim,[1:(last_dim-1);]))
end
train_data = npzread(joinpath(exampledir,"adult_train.npz"))
x = first_dim_last(train_data["x"])
y = first_dim_last(train_data["y"])
FS = size(x)[1]
function make_epochs(dataset,args)
Random.seed!(args[:seed])
function make_batches()
return batchview(shuffleobs(dataset),args[:batch_size])
end
return repeatedly(make_batches,args[:num_epochs])
end
function run_zygote_loop()
Random.seed!(args[:seed])
model = Chain(
Dense(FS,1000),
Dense(1000,1)
)
θ=Flux.params(model)
criterion = (output,target) -> mean(Flux.logitbinarycrossentropy.(output,target))
optimizer = Flux.ADAM(args[:lr],(0.9,0.99))
epochs = make_epochs((x,y),args)
for (i,batches) in enumerate(epochs)
for batch in batches
x,y = batch;
loss, pb = Flux.Zygote.pullback(()->criterion(model(x),y),θ);
∇θ = pb(1.);
Flux.Optimise.update!(optimizer,θ,∇θ);
@show loss
end
println("finished epoch $i")
end
end
function run_tracker_loop()
# Cannot figure out how to run this with Flux#master
Random.seed!(args[:seed])
model = Chain(
Dense(FS,1000),
Dense(1000,1)
)|>track
θ=Flux.params(model)
criterion = (output,target) -> mean(Flux.logitbinarycrossentropy.(output,target))
optimizer = Flux.ADAM(args[:lr],(0.9,0.99))
epochs = make_epochs((x,y),args)
for (i,batches) in enumerate(epochs)
for batch in batches
x,y = batch;
loss, pb = Tracker.forward(()->criterion(model(x),y),θ);
∇θ = pb(1.);
Flux.Optimise.update!(optimizer,θ,∇θ);
@show loss
end
println("finished epoch $i")
end
end
# Debuging Zygote Gradient Directly
batches = first(make_epochs((x,y),args))
xi,yi = first(batches)
model = Chain(
Dense(FS,1000),
Dense(1000,1)
)
θ=Flux.params(model)
criterion = (output,target) -> mean(Flux.logitbinarycrossentropy.(output,target))
optimizer = Flux.ADAM(args[:lr],(0.9,0.99))
# Zygote gradients NaN
loss, pb = Flux.Zygote.pullback(()->criterion(model(xi),yi),Zygote.Params([model[1].W]));
any(isnan.(pb(1.)[model[1].W])) #True
# Make the model useable to Tracker
using Flux: mapleaves
function trackerify(leaf)
if leaf isa AbstractArray
return Tracker.param(leaf)
else
return leaf
end
end
tracked_model = mapleaves(trackerify,model)
# Tracker gradients do not NaN
tloss, tpb = Tracker.forward(()->criterion(tracked_model(xi),yi),Tracker.Params(tracked_model[1].W));
loss == Flux.data(tloss)
any(isnan.(tpb(1.)[tracked_model[1].W])) #False name = "A1"
[deps]
Distributions = "31c24e10-a181-5473-b8eb-7969acd0382f"
DrWatson = "634d3b9d-ee7a-5ddf-bec9-22491ea816e1"
Flux = "587475ba-b771-5e3f-ad9e-33799f191a9c"
IRTools = "7869d1d1-7146-5819-86e3-90919afe41df"
IterTools = "c8e1da08-722c-5040-9ed9-7db0dc04731e"
MLDataUtils = "cc2ba9b6-d476-5e6d-8eaf-a92d5412d41d"
NPZ = "15e1cf62-19b3-5cfa-8e77-841668bca605"
Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"
Tracker = "9f7883ad-71c0-57eb-9f7f-b5c9e6d3789c"
Zygote = "e88e6eb3-aa80-5325-afca-941959d7151f"
ZygoteRules = "700de1a5-db45-46bc-99cf-38207098b444" |
Is the training loop / use of Would be great to also have the manifest you're using, so I get the exact same versions of all packages. |
@MikeInnes Okay here is a minimal version that just computes the gradients with using Random
using Flux
using Tracker
using MLDataUtils
using Zygote
using Statistics: mean
using MLDataUtils
# dummy data
x = Float32.(rand(113,100000))
y = sum(x.^2,dims=1)
dataset = batchview((x,y),size=256)
x1,y1 = first(dataset)
# Flux Model
model = Chain(
Dense(113,1000),
Dense(1000,1)
)
θ = params(model)
# Compatibility with Tracker
track(m) = fmap(x -> x isa AbstractArray ? Tracker.param(x) : x, m)
t_model= track(model)
t_θ = params(t_model)
# Objective
criterion(logits,y) = mean(Flux.logitbinarycrossentropy.(logits,y))
#Zygote Gradient
loss,back = Flux.Zygote.pullback(()->criterion(model(x1),y1),θ)
grads=back(1.)
gW1 = grads[model[1].W][1]
#Tracker Gradient
t_loss, t_back = Tracker.forward(() -> criterion(t_model(x1),y1),t_θ)
t_grads = t_back(1.)
t_gW1 = t_grads[t_model[1].W][1]
@show gW1 #0.7067871023580545
@show t_gW1 #0.706787102655657 (tracked)
isapprox(gW1,t_gW1) #true You can see from the results of the print statements that the gradients agree only up to a certain amount of precision. However, if this were tested with Not sure if this causes the divergent behaviour, however. And I'm unable to test with the same package versions loaded because new Let me know if you still need Manifest, and if its best to put them in an issue comment? |
Those kinds of differences shouldn't be enough to cause divergence/NaNs in any remotely-well-conditioned optimization problem. Something else must be off... |
Right, we previously had an example where the gradients coming back from Zygote had |
Here are the Manifest, Project, and test files. https://gist.github.com/jessebett/884cfde5b33aed3dc48802f10610f8d7 |
Ok, can you reproduce the |
This is the definition of softplus(x::Real) = ifelse(x > 0, x + log1p(exp(-x)), log1p(exp(x))) using Flux
softplus'(100f0) # NaN
log1pexp(x) = log1p(exp(x))
log1p'(exp(100f0)) # 0.0f0
exp'(100f0) # Inf32
log1pexp'(100f0) # 0.0f0 * Inf32 == NaN32 |
One can define Zygote.@adjoint function softplus(x::Real)
y = softplus(x)
return y, Δ -> (Δ * σ(Δ),)
end Then softplus'(100f0) == 1f0
Zygote.gradient([100f0]) do x
sum(softplus.(x))
end[1] == [1f0] |
@MikeInnes I have a very simple model that does not train on Flux#master due to NaNs from exploding gradients. However the exact same code works and trains as expected with
Zygote.pullback -> Tracker.forward
.Here are the training loops:
The training is a bit unstable in the first few steps of gradient descent. You can see this in the
tracker_loop
which looks like:And this eventually converges.
However, running this with
zygote_loop
itNaN
s during the initial iterations and does not recover:The dataset and batch handling code is in a private repo, if it isn't clear what is happening here I can add you to that so you can run the scripts which produce these two tests, if you'd like.
However, one thing that's apparent right away is that the loss after the first optimizer step is unequal between
Zygote
andTracker
. This should not be the case, I've set the random seeds and the code is identical, so unless I'm missing something it's possible that this indicates where the numerical instability is coming from?The text was updated successfully, but these errors were encountered: