Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model optimization fails (NaNs) with Zygote.pullback but works with Tracker.forward #876

Open
jessebett opened this issue Sep 27, 2019 · 14 comments

Comments

@jessebett
Copy link
Contributor

jessebett commented Sep 27, 2019

@MikeInnes I have a very simple model that does not train on Flux#master due to NaNs from exploding gradients. However the exact same code works and trains as expected with Zygote.pullback -> Tracker.forward.

Here are the training loops:

args_dict = Dict(
                 :log_dir => datadir("logs"),
                 :seed => 1,
                 :batch_size => 10,
                 :num_epochs => 5,
                 :lr =>1e-5
                )
args_list = dict_list(args_dict)
args=args_list[1]


function run_tracker_loop()
  Random.seed!(args[:seed])
  train_batches, test_batches = utils.gen_epoch(args[:batch_size], seed=args[:seed])
  model = Chain(
                Dense(utils.NUM_FEATURES,1000),
                Dense(1000,1)
               )
  θ=Flux.params(model)

  criterion = (output,target) -> mean(Flux.logitbinarycrossentropy.(output,target))
  optimizer = Flux.ADAM(args[:lr],(0.9,0.99))

  epochs = take(train_batches,args[:num_epochs])
  batches = first(epochs)
  for batch in batches
    x,a,y = batch;
    loss, pullback = Flux.Tracker.forward(()->criterion(model(x),y),θ);
    ∇θ = pullback(1.);
    @show loss
    Flux.Optimise.update!(optimizer,θ,∇θ);
  end
end

function run_zygote_loop()
  Random.seed!(args[:seed])
  train_batches, test_batches = utils.gen_epoch(args[:batch_size], seed=args[:seed])
  model = Chain(
                Dense(utils.NUM_FEATURES,1000),
                Dense(1000,1)
               )
  θ=Flux.params(model)

  criterion = (output,target) -> mean(Flux.logitbinarycrossentropy.(output,target))
  optimizer = Flux.ADAM(args[:lr],(0.9,0.99))

  epochs = take(train_batches,args[:num_epochs])
  batches = first(epochs)
  for batch in batches
    x,a,y = batch;
    loss, pullback = Flux.Zygote.pullback(()->criterion(model(x),y),θ);
    ∇θ = pullback(1.);
    @show loss
    Flux.Optimise.update!(optimizer,θ,∇θ);
  end
end

The training is a bit unstable in the first few steps of gradient descent. You can see this in the tracker_loop which looks like:

julia> run_tracker_loop()
loss = 4.153761120699346 (tracked)
loss = 8.558516177907586 (tracked)
loss = 77.52821938525885 (tracked)
loss = 22.02223796546459 (tracked)
loss = 2.345904890820384 (tracked)
loss = 62.89235497526825 (tracked)
loss = 2.1650534560903907 (tracked)
loss = 2.3495737865567206 (tracked)
loss = 1.98077632188797 (tracked)

And this eventually converges.

However, running this with zygote_loop it NaNs during the initial iterations and does not recover:

julia> run_zygote_loop()
loss = 4.153761120699346
loss = 8.747800693288445
loss = 80.12227465640754
loss = NaN
loss = NaN
loss = NaN
loss = NaN
loss = NaN
loss = NaN

The dataset and batch handling code is in a private repo, if it isn't clear what is happening here I can add you to that so you can run the scripts which produce these two tests, if you'd like.

However, one thing that's apparent right away is that the loss after the first optimizer step is unequal between Zygote and Tracker. This should not be the case, I've set the random seeds and the code is identical, so unless I'm missing something it's possible that this indicates where the numerical instability is coming from?

@MikeInnes
Copy link
Member

Are you able to run any finite differencing tests to show that the gradients are incorrect? Easiest way to debug would be start with that failing test and gradually simplify forward pass.

@jessebett
Copy link
Contributor Author

using Random
using Flux
using Tracker
using Logging
using IterTools: take, repeatedly
using Statistics
using MLDataUtils
using NPZ

# Download Dataset
exampledir = joinpath("test","example")
#; wget https://github.com/ecreager/csc2541-f19/raw/master/assignment1/adult/adult_train.npz -P $exampledir

args = Dict(
                 :seed => 1,
                 :batch_size => 10,
                 :num_epochs => 5,
                 :lr =>1e-5
                )

function first_dim_last(A)
  """make python data julia-y"""
  last_dim = ndims(A)
  return permutedims(A,vcat(last_dim,[1:(last_dim-1);]))
end

train_data = npzread(joinpath(exampledir,"adult_train.npz"))
x = first_dim_last(train_data["x"])
y = first_dim_last(train_data["y"])
FS = size(x)[1]


function make_epochs(dataset,args)
  Random.seed!(args[:seed])
  function make_batches()
    return batchview(shuffleobs(dataset),args[:batch_size])
  end
  return repeatedly(make_batches,args[:num_epochs])
end



function run_zygote_loop()
  Random.seed!(args[:seed])
  model = Chain(
                Dense(FS,1000),
                Dense(1000,1)
               )
  θ=Flux.params(model)

  criterion = (output,target) -> mean(Flux.logitbinarycrossentropy.(output,target))
  optimizer = Flux.ADAM(args[:lr],(0.9,0.99))

  epochs = make_epochs((x,y),args)
  for (i,batches) in enumerate(epochs)
    for batch in batches
      x,y = batch;
      loss, pb = Flux.Zygote.pullback(()->criterion(model(x),y),θ);
      ∇θ = pb(1.);
      Flux.Optimise.update!(optimizer,θ,∇θ);
      @show loss
    end
    println("finished epoch $i")
  end
end

function run_tracker_loop()
  Random.seed!(args[:seed])
  model = Chain(
                Dense(FS,1000),
                Dense(1000,1)
               )
  θ=Flux.params(model)

  criterion = (output,target) -> mean(Flux.logitbinarycrossentropy.(output,target))
  optimizer = Flux.ADAM(args[:lr],(0.9,0.99))

  epochs = make_epochs((x,y),args)
  for (i,batches) in enumerate(epochs)
    for batch in batches
      x,y = batch;
      loss, pb = Flux.Tracker.forward(()->criterion(model(x),y),θ);
      ∇θ = pb(1.);
      Flux.Optimise.update!(optimizer,θ,∇θ);
      @show loss
    end
    println("finished epoch $i")
  end
end

@MikeInnes by the way here is the above script with the ability to download and load the dataset.

However, I do not know how to test both the zygote and tracker loop in the same branch because Zygote.Params are incompatible with Tracker. What can I do to make the Zygote Flux model Tracker friendly so I can run both of these without switching branches?

@jessebett
Copy link
Contributor Author

Note, again, the only difference is Zygote.pullback becomes Tracker.forward and this will result in the above NaNs for Zygote, but not for Tracker.

@jessebett
Copy link
Contributor Author

@MikeInnes I did not do any finite difference yet because implicit -> explicit model utilities aren't in Flux yet, and DiffEqFlux expects Tracker.

Instead I sketchily made my model a Tracker-visible, would be nice to have utilities for this.

Below you can see that taking the gradients with Zygote result in all NaN but Tracker is fine.

# Debuging Zygote Gradient
batches = first(make_epochs((x,y),args))
xi,yi = first(batches)

model = Chain(
              Dense(FS,1000),
              Dense(1000,1)
             )
θ=Flux.params(model)
criterion = (output,target) -> mean(Flux.logitbinarycrossentropy.(output,target))
optimizer = Flux.ADAM(args[:lr],(0.9,0.99))


# Zygote gradients NaN
loss, pb = Flux.Zygote.pullback(()->criterion(model(xi),yi),Zygote.Params([model[1].W]));
any(isnan.(pb(1.)[model[1].W])) #True


# Make the model useable to Tracker
using Flux: mapleaves
function trackerify(leaf)
  if leaf isa AbstractArray
    return Tracker.param(leaf)
  else
    return leaf
  end
end
tracked_model = mapleaves(trackerify,model)

# Tracker gradients do not NaN
tloss, tpb = Tracker.forward(()->criterion(tracked_model(xi),yi),Tracker.Params(tracked_model[1].W));
loss == Flux.data(tloss)
any(isnan.(tpb(1.)[tracked_model[1].W])) #False

@MikeInnes
Copy link
Member

@jessebett I'm happy to dig into this code, but I can't currently run your last script due to missing definitions. Can you provide a project/manifest file + complete script for that last snippet?

@jessebett
Copy link
Contributor Author

Hi @MikeInnes, below are the full versions of the script and the project file.

Note that I am not able to run run_tracker_loop() with Flux#master even with #883 because it cannot use update! as it tries to update tracked array of parameters with untracked value. Not gonna miss these errors with Zygote...

However, the run_tracker_loop() does run when ]add Flux without Zygote. Figuring out why |>track on model still breaks the training loop might also help #833 and make this issue, and others like it, easier to run.

using Random
using Flux
using Tracker
using Logging
using IterTools: take, repeatedly
using Statistics
using MLDataUtils
using NPZ
using Zygote


# Download Dataset
exampledir = joinpath("test","example")
#; wget https://github.com/ecreager/csc2541-f19/raw/master/assignment1/adult/adult_train.npz -P $exampledir

args = Dict(
                 :seed => 1,
                 :batch_size => 10,
                 :num_epochs => 5,
                 :lr =>1e-5
                )

function first_dim_last(A)
  """make python data julia-y"""
  last_dim = ndims(A)
  return permutedims(A,vcat(last_dim,[1:(last_dim-1);]))
end

train_data = npzread(joinpath(exampledir,"adult_train.npz"))
x = first_dim_last(train_data["x"])
y = first_dim_last(train_data["y"])
FS = size(x)[1]


function make_epochs(dataset,args)
  Random.seed!(args[:seed])
  function make_batches()
    return batchview(shuffleobs(dataset),args[:batch_size])
  end
  return repeatedly(make_batches,args[:num_epochs])
end



function run_zygote_loop()
  Random.seed!(args[:seed])
  model = Chain(
                Dense(FS,1000),
                Dense(1000,1)
               )
  θ=Flux.params(model)

  criterion = (output,target) -> mean(Flux.logitbinarycrossentropy.(output,target))
  optimizer = Flux.ADAM(args[:lr],(0.9,0.99))

  epochs = make_epochs((x,y),args)
  for (i,batches) in enumerate(epochs)
    for batch in batches
      x,y = batch;
      loss, pb = Flux.Zygote.pullback(()->criterion(model(x),y),θ);
      ∇θ = pb(1.);
      Flux.Optimise.update!(optimizer,θ,∇θ);
      @show loss
    end
    println("finished epoch $i")
  end
end

function run_tracker_loop()
  # Cannot figure out how to run this with Flux#master
  Random.seed!(args[:seed])
  model = Chain(
                Dense(FS,1000),
                Dense(1000,1)
               )|>track
  θ=Flux.params(model)

  criterion = (output,target) -> mean(Flux.logitbinarycrossentropy.(output,target))
  optimizer = Flux.ADAM(args[:lr],(0.9,0.99))

  epochs = make_epochs((x,y),args)
  for (i,batches) in enumerate(epochs)
    for batch in batches
      x,y = batch;
      loss, pb = Tracker.forward(()->criterion(model(x),y),θ);
      ∇θ = pb(1.);
      Flux.Optimise.update!(optimizer,θ,∇θ);
      @show loss
    end
    println("finished epoch $i")
  end
end


# Debuging Zygote Gradient Directly
batches = first(make_epochs((x,y),args))
xi,yi = first(batches)

model = Chain(
              Dense(FS,1000),
              Dense(1000,1)
             )
θ=Flux.params(model)
criterion = (output,target) -> mean(Flux.logitbinarycrossentropy.(output,target))
optimizer = Flux.ADAM(args[:lr],(0.9,0.99))


# Zygote gradients NaN
loss, pb = Flux.Zygote.pullback(()->criterion(model(xi),yi),Zygote.Params([model[1].W]));
any(isnan.(pb(1.)[model[1].W])) #True


# Make the model useable to Tracker
using Flux: mapleaves
function trackerify(leaf)
  if leaf isa AbstractArray
    return Tracker.param(leaf)
  else
    return leaf
  end
end
tracked_model = mapleaves(trackerify,model)

# Tracker gradients do not NaN
tloss, tpb = Tracker.forward(()->criterion(tracked_model(xi),yi),Tracker.Params(tracked_model[1].W));
loss == Flux.data(tloss)
any(isnan.(tpb(1.)[tracked_model[1].W])) #False
name = "A1"

[deps]
Distributions = "31c24e10-a181-5473-b8eb-7969acd0382f"
DrWatson = "634d3b9d-ee7a-5ddf-bec9-22491ea816e1"
Flux = "587475ba-b771-5e3f-ad9e-33799f191a9c"
IRTools = "7869d1d1-7146-5819-86e3-90919afe41df"
IterTools = "c8e1da08-722c-5040-9ed9-7db0dc04731e"
MLDataUtils = "cc2ba9b6-d476-5e6d-8eaf-a92d5412d41d"
NPZ = "15e1cf62-19b3-5cfa-8e77-841668bca605"
Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"
Tracker = "9f7883ad-71c0-57eb-9f7f-b5c9e6d3789c"
Zygote = "e88e6eb3-aa80-5325-afca-941959d7151f"
ZygoteRules = "700de1a5-db45-46bc-99cf-38207098b444"

@MikeInnes
Copy link
Member

MikeInnes commented Oct 15, 2019

Is the training loop / use of update! necessary at all there? My understanding was that we could replicate the difference in gradients between Zygote and Tracker directly, without any optimisation involved. It doesn't seem like you're calling run_tracker_loop or run_zygote_loop either. It'll be much easier to debug this if it's as minimal as possible.

Would be great to also have the manifest you're using, so I get the exact same versions of all packages.

@jessebett
Copy link
Contributor Author

jessebett commented Oct 18, 2019

@MikeInnes Okay here is a minimal version that just computes the gradients with Zygote and Tracker and compares the results:

using Random
using Flux
using Tracker
using MLDataUtils
using Zygote
using Statistics: mean
using MLDataUtils

# dummy data
x = Float32.(rand(113,100000))
y = sum(x.^2,dims=1)
dataset = batchview((x,y),size=256)
x1,y1 = first(dataset)

# Flux Model
model = Chain(
              Dense(113,1000),
              Dense(1000,1)
             )
θ = params(model)

# Compatibility with Tracker
track(m) = fmap(x -> x isa AbstractArray ? Tracker.param(x) : x, m)
t_model= track(model)
t_θ = params(t_model)

# Objective
criterion(logits,y) = mean(Flux.logitbinarycrossentropy.(logits,y))

#Zygote Gradient
loss,back = Flux.Zygote.pullback(()->criterion(model(x1),y1),θ)
grads=back(1.)
gW1 = grads[model[1].W][1]

#Tracker Gradient
t_loss, t_back = Tracker.forward(() -> criterion(t_model(x1),y1),t_θ)
t_grads = t_back(1.)
t_gW1 = t_grads[t_model[1].W][1]

@show gW1    #0.7067871023580545
@show t_gW1  #0.706787102655657 (tracked)

isapprox(gW1,t_gW1) #true

You can see from the results of the print statements that the gradients agree only up to a certain amount of precision. However, if this were tested with isapprox then it would not have been caught, as that returns true.

Not sure if this causes the divergent behaviour, however. And I'm unable to test with the same package versions loaded because new Optimise doesn't seem to be compatible with Tracker now, which I will discuss in #883.

Let me know if you still need Manifest, and if its best to put them in an issue comment?

@jekbradbury
Copy link
Contributor

Those kinds of differences shouldn't be enough to cause divergence/NaNs in any remotely-well-conditioned optimization problem. Something else must be off...

@MikeInnes
Copy link
Member

MikeInnes commented Oct 24, 2019

Right, we previously had an example where the gradients coming back from Zygote had NaNs in them whereas the Tracker gradients did not. That's a clear bug so if I can reproduce that, it gives us an easy path forward. Yes, a manifest would still be ideal; it can just be your global manifest if you don't want to create a new project. They can go in a github gist or pastebin.

@jessebett
Copy link
Contributor Author

Here are the Manifest, Project, and test files.

https://gist.github.com/jessebett/884cfde5b33aed3dc48802f10610f8d7

@MikeInnes
Copy link
Member

Ok, can you reproduce the NaN issue within that project though? The current test file just shows expected behaviour; minor numerical divergence is very unlikely to be causing the bugs you had earlier.

@AStupidBear
Copy link
Contributor

logitbinarycrossentropy calls logσ which calls softplus.

This is the definition of softplus

softplus(x::Real) = ifelse(x > 0, x + log1p(exp(-x)), log1p(exp(x)))
using Flux
softplus'(100f0)           # NaN
log1pexp(x) = log1p(exp(x))
log1p'(exp(100f0))         # 0.0f0
exp'(100f0)                # Inf32
log1pexp'(100f0)           # 0.0f0 * Inf32 == NaN32

@AStupidBear
Copy link
Contributor

AStupidBear commented Jun 5, 2020

One can define

Zygote.@adjoint function softplus(x::Real)
  y = softplus(x)
  return y, Δ ->* σ(Δ),)
end

Then

softplus'(100f0) == 1f0
Zygote.gradient([100f0]) do x
    sum(softplus.(x))
end[1] == [1f0]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants