Model optimization fails (NaNs) with Zygote.pullback but works with Tracker.forward #876

jessebett · 2019-09-27T21:18:15Z

@MikeInnes I have a very simple model that does not train on Flux#master due to NaNs from exploding gradients. However the exact same code works and trains as expected with Zygote.pullback -> Tracker.forward.

Here are the training loops:

args_dict = Dict(
                 :log_dir => datadir("logs"),
                 :seed => 1,
                 :batch_size => 10,
                 :num_epochs => 5,
                 :lr =>1e-5
                )
args_list = dict_list(args_dict)
args=args_list[1]


function run_tracker_loop()
  Random.seed!(args[:seed])
  train_batches, test_batches = utils.gen_epoch(args[:batch_size], seed=args[:seed])
  model = Chain(
                Dense(utils.NUM_FEATURES,1000),
                Dense(1000,1)
               )
  θ=Flux.params(model)

  criterion = (output,target) -> mean(Flux.logitbinarycrossentropy.(output,target))
  optimizer = Flux.ADAM(args[:lr],(0.9,0.99))

  epochs = take(train_batches,args[:num_epochs])
  batches = first(epochs)
  for batch in batches
    x,a,y = batch;
    loss, pullback = Flux.Tracker.forward(()->criterion(model(x),y),θ);
    ∇θ = pullback(1.);
    @show loss
    Flux.Optimise.update!(optimizer,θ,∇θ);
  end
end

function run_zygote_loop()
  Random.seed!(args[:seed])
  train_batches, test_batches = utils.gen_epoch(args[:batch_size], seed=args[:seed])
  model = Chain(
                Dense(utils.NUM_FEATURES,1000),
                Dense(1000,1)
               )
  θ=Flux.params(model)

  criterion = (output,target) -> mean(Flux.logitbinarycrossentropy.(output,target))
  optimizer = Flux.ADAM(args[:lr],(0.9,0.99))

  epochs = take(train_batches,args[:num_epochs])
  batches = first(epochs)
  for batch in batches
    x,a,y = batch;
    loss, pullback = Flux.Zygote.pullback(()->criterion(model(x),y),θ);
    ∇θ = pullback(1.);
    @show loss
    Flux.Optimise.update!(optimizer,θ,∇θ);
  end
end

The training is a bit unstable in the first few steps of gradient descent. You can see this in the tracker_loop which looks like:

julia> run_tracker_loop()
loss = 4.153761120699346 (tracked)
loss = 8.558516177907586 (tracked)
loss = 77.52821938525885 (tracked)
loss = 22.02223796546459 (tracked)
loss = 2.345904890820384 (tracked)
loss = 62.89235497526825 (tracked)
loss = 2.1650534560903907 (tracked)
loss = 2.3495737865567206 (tracked)
loss = 1.98077632188797 (tracked)

And this eventually converges.

However, running this with zygote_loop it NaNs during the initial iterations and does not recover:

julia> run_zygote_loop()
loss = 4.153761120699346
loss = 8.747800693288445
loss = 80.12227465640754
loss = NaN
loss = NaN
loss = NaN
loss = NaN
loss = NaN
loss = NaN

The dataset and batch handling code is in a private repo, if it isn't clear what is happening here I can add you to that so you can run the scripts which produce these two tests, if you'd like.

However, one thing that's apparent right away is that the loss after the first optimizer step is unequal between Zygote and Tracker. This should not be the case, I've set the random seeds and the code is identical, so unless I'm missing something it's possible that this indicates where the numerical instability is coming from?

The text was updated successfully, but these errors were encountered:

MikeInnes · 2019-09-30T10:59:41Z

Are you able to run any finite differencing tests to show that the gradients are incorrect? Easiest way to debug would be start with that failing test and gradually simplify forward pass.

jessebett · 2019-10-03T18:35:32Z

using Random
using Flux
using Tracker
using Logging
using IterTools: take, repeatedly
using Statistics
using MLDataUtils
using NPZ

# Download Dataset
exampledir = joinpath("test","example")
#; wget https://github.com/ecreager/csc2541-f19/raw/master/assignment1/adult/adult_train.npz -P $exampledir

args = Dict(
                 :seed => 1,
                 :batch_size => 10,
                 :num_epochs => 5,
                 :lr =>1e-5
                )

function first_dim_last(A)
  """make python data julia-y"""
  last_dim = ndims(A)
  return permutedims(A,vcat(last_dim,[1:(last_dim-1);]))
end

train_data = npzread(joinpath(exampledir,"adult_train.npz"))
x = first_dim_last(train_data["x"])
y = first_dim_last(train_data["y"])
FS = size(x)[1]


function make_epochs(dataset,args)
  Random.seed!(args[:seed])
  function make_batches()
    return batchview(shuffleobs(dataset),args[:batch_size])
  end
  return repeatedly(make_batches,args[:num_epochs])
end



function run_zygote_loop()
  Random.seed!(args[:seed])
  model = Chain(
                Dense(FS,1000),
                Dense(1000,1)
               )
  θ=Flux.params(model)

  criterion = (output,target) -> mean(Flux.logitbinarycrossentropy.(output,target))
  optimizer = Flux.ADAM(args[:lr],(0.9,0.99))

  epochs = make_epochs((x,y),args)
  for (i,batches) in enumerate(epochs)
    for batch in batches
      x,y = batch;
      loss, pb = Flux.Zygote.pullback(()->criterion(model(x),y),θ);
      ∇θ = pb(1.);
      Flux.Optimise.update!(optimizer,θ,∇θ);
      @show loss
    end
    println("finished epoch $i")
  end
end

function run_tracker_loop()
  Random.seed!(args[:seed])
  model = Chain(
                Dense(FS,1000),
                Dense(1000,1)
               )
  θ=Flux.params(model)

  criterion = (output,target) -> mean(Flux.logitbinarycrossentropy.(output,target))
  optimizer = Flux.ADAM(args[:lr],(0.9,0.99))

  epochs = make_epochs((x,y),args)
  for (i,batches) in enumerate(epochs)
    for batch in batches
      x,y = batch;
      loss, pb = Flux.Tracker.forward(()->criterion(model(x),y),θ);
      ∇θ = pb(1.);
      Flux.Optimise.update!(optimizer,θ,∇θ);
      @show loss
    end
    println("finished epoch $i")
  end
end

@MikeInnes by the way here is the above script with the ability to download and load the dataset.

However, I do not know how to test both the zygote and tracker loop in the same branch because Zygote.Params are incompatible with Tracker. What can I do to make the Zygote Flux model Tracker friendly so I can run both of these without switching branches?

jessebett · 2019-10-03T18:36:42Z

Note, again, the only difference is Zygote.pullback becomes Tracker.forward and this will result in the above NaNs for Zygote, but not for Tracker.

jessebett · 2019-10-03T19:53:18Z

@MikeInnes I did not do any finite difference yet because implicit -> explicit model utilities aren't in Flux yet, and DiffEqFlux expects Tracker.

Instead I sketchily made my model a Tracker-visible, would be nice to have utilities for this.

Below you can see that taking the gradients with Zygote result in all NaN but Tracker is fine.

# Debuging Zygote Gradient
batches = first(make_epochs((x,y),args))
xi,yi = first(batches)

model = Chain(
              Dense(FS,1000),
              Dense(1000,1)
             )
θ=Flux.params(model)
criterion = (output,target) -> mean(Flux.logitbinarycrossentropy.(output,target))
optimizer = Flux.ADAM(args[:lr],(0.9,0.99))


# Zygote gradients NaN
loss, pb = Flux.Zygote.pullback(()->criterion(model(xi),yi),Zygote.Params([model[1].W]));
any(isnan.(pb(1.)[model[1].W])) #True


# Make the model useable to Tracker
using Flux: mapleaves
function trackerify(leaf)
  if leaf isa AbstractArray
    return Tracker.param(leaf)
  else
    return leaf
  end
end
tracked_model = mapleaves(trackerify,model)

# Tracker gradients do not NaN
tloss, tpb = Tracker.forward(()->criterion(tracked_model(xi),yi),Tracker.Params(tracked_model[1].W));
loss == Flux.data(tloss)
any(isnan.(tpb(1.)[tracked_model[1].W])) #False

MikeInnes · 2019-10-14T14:43:03Z

@jessebett I'm happy to dig into this code, but I can't currently run your last script due to missing definitions. Can you provide a project/manifest file + complete script for that last snippet?

jessebett · 2019-10-15T16:04:11Z

Hi @MikeInnes, below are the full versions of the script and the project file.

Note that I am not able to run run_tracker_loop() with Flux#master even with #883 because it cannot use update! as it tries to update tracked array of parameters with untracked value. Not gonna miss these errors with Zygote...

However, the run_tracker_loop() does run when ]add Flux without Zygote. Figuring out why |>track on model still breaks the training loop might also help #833 and make this issue, and others like it, easier to run.

using Random
using Flux
using Tracker
using Logging
using IterTools: take, repeatedly
using Statistics
using MLDataUtils
using NPZ
using Zygote


# Download Dataset
exampledir = joinpath("test","example")
#; wget https://github.com/ecreager/csc2541-f19/raw/master/assignment1/adult/adult_train.npz -P $exampledir

args = Dict(
                 :seed => 1,
                 :batch_size => 10,
                 :num_epochs => 5,
                 :lr =>1e-5
                )

function first_dim_last(A)
  """make python data julia-y"""
  last_dim = ndims(A)
  return permutedims(A,vcat(last_dim,[1:(last_dim-1);]))
end

train_data = npzread(joinpath(exampledir,"adult_train.npz"))
x = first_dim_last(train_data["x"])
y = first_dim_last(train_data["y"])
FS = size(x)[1]


function make_epochs(dataset,args)
  Random.seed!(args[:seed])
  function make_batches()
    return batchview(shuffleobs(dataset),args[:batch_size])
  end
  return repeatedly(make_batches,args[:num_epochs])
end



function run_zygote_loop()
  Random.seed!(args[:seed])
  model = Chain(
                Dense(FS,1000),
                Dense(1000,1)
               )
  θ=Flux.params(model)

  criterion = (output,target) -> mean(Flux.logitbinarycrossentropy.(output,target))
  optimizer = Flux.ADAM(args[:lr],(0.9,0.99))

  epochs = make_epochs((x,y),args)
  for (i,batches) in enumerate(epochs)
    for batch in batches
      x,y = batch;
      loss, pb = Flux.Zygote.pullback(()->criterion(model(x),y),θ);
      ∇θ = pb(1.);
      Flux.Optimise.update!(optimizer,θ,∇θ);
      @show loss
    end
    println("finished epoch $i")
  end
end

function run_tracker_loop()
  # Cannot figure out how to run this with Flux#master
  Random.seed!(args[:seed])
  model = Chain(
                Dense(FS,1000),
                Dense(1000,1)
               )|>track
  θ=Flux.params(model)

  criterion = (output,target) -> mean(Flux.logitbinarycrossentropy.(output,target))
  optimizer = Flux.ADAM(args[:lr],(0.9,0.99))

  epochs = make_epochs((x,y),args)
  for (i,batches) in enumerate(epochs)
    for batch in batches
      x,y = batch;
      loss, pb = Tracker.forward(()->criterion(model(x),y),θ);
      ∇θ = pb(1.);
      Flux.Optimise.update!(optimizer,θ,∇θ);
      @show loss
    end
    println("finished epoch $i")
  end
end


# Debuging Zygote Gradient Directly
batches = first(make_epochs((x,y),args))
xi,yi = first(batches)

model = Chain(
              Dense(FS,1000),
              Dense(1000,1)
             )
θ=Flux.params(model)
criterion = (output,target) -> mean(Flux.logitbinarycrossentropy.(output,target))
optimizer = Flux.ADAM(args[:lr],(0.9,0.99))


# Zygote gradients NaN
loss, pb = Flux.Zygote.pullback(()->criterion(model(xi),yi),Zygote.Params([model[1].W]));
any(isnan.(pb(1.)[model[1].W])) #True


# Make the model useable to Tracker
using Flux: mapleaves
function trackerify(leaf)
  if leaf isa AbstractArray
    return Tracker.param(leaf)
  else
    return leaf
  end
end
tracked_model = mapleaves(trackerify,model)

# Tracker gradients do not NaN
tloss, tpb = Tracker.forward(()->criterion(tracked_model(xi),yi),Tracker.Params(tracked_model[1].W));
loss == Flux.data(tloss)
any(isnan.(tpb(1.)[tracked_model[1].W])) #False

name = "A1"

[deps]
Distributions = "31c24e10-a181-5473-b8eb-7969acd0382f"
DrWatson = "634d3b9d-ee7a-5ddf-bec9-22491ea816e1"
Flux = "587475ba-b771-5e3f-ad9e-33799f191a9c"
IRTools = "7869d1d1-7146-5819-86e3-90919afe41df"
IterTools = "c8e1da08-722c-5040-9ed9-7db0dc04731e"
MLDataUtils = "cc2ba9b6-d476-5e6d-8eaf-a92d5412d41d"
NPZ = "15e1cf62-19b3-5cfa-8e77-841668bca605"
Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"
Tracker = "9f7883ad-71c0-57eb-9f7f-b5c9e6d3789c"
Zygote = "e88e6eb3-aa80-5325-afca-941959d7151f"
ZygoteRules = "700de1a5-db45-46bc-99cf-38207098b444"

MikeInnes · 2019-10-15T16:29:10Z

Is the training loop / use of update! necessary at all there? My understanding was that we could replicate the difference in gradients between Zygote and Tracker directly, without any optimisation involved. It doesn't seem like you're calling run_tracker_loop or run_zygote_loop either. It'll be much easier to debug this if it's as minimal as possible.

Would be great to also have the manifest you're using, so I get the exact same versions of all packages.

jessebett · 2019-10-18T19:37:19Z

@MikeInnes Okay here is a minimal version that just computes the gradients with Zygote and Tracker and compares the results:

using Random
using Flux
using Tracker
using MLDataUtils
using Zygote
using Statistics: mean
using MLDataUtils

# dummy data
x = Float32.(rand(113,100000))
y = sum(x.^2,dims=1)
dataset = batchview((x,y),size=256)
x1,y1 = first(dataset)

# Flux Model
model = Chain(
              Dense(113,1000),
              Dense(1000,1)
             )
θ = params(model)

# Compatibility with Tracker
track(m) = fmap(x -> x isa AbstractArray ? Tracker.param(x) : x, m)
t_model= track(model)
t_θ = params(t_model)

# Objective
criterion(logits,y) = mean(Flux.logitbinarycrossentropy.(logits,y))

#Zygote Gradient
loss,back = Flux.Zygote.pullback(()->criterion(model(x1),y1),θ)
grads=back(1.)
gW1 = grads[model[1].W][1]

#Tracker Gradient
t_loss, t_back = Tracker.forward(() -> criterion(t_model(x1),y1),t_θ)
t_grads = t_back(1.)
t_gW1 = t_grads[t_model[1].W][1]

@show gW1    #0.7067871023580545
@show t_gW1  #0.706787102655657 (tracked)

isapprox(gW1,t_gW1) #true

You can see from the results of the print statements that the gradients agree only up to a certain amount of precision. However, if this were tested with isapprox then it would not have been caught, as that returns true.

Not sure if this causes the divergent behaviour, however. And I'm unable to test with the same package versions loaded because new Optimise doesn't seem to be compatible with Tracker now, which I will discuss in #883.

Let me know if you still need Manifest, and if its best to put them in an issue comment?

jekbradbury · 2019-10-19T01:45:40Z

Those kinds of differences shouldn't be enough to cause divergence/NaNs in any remotely-well-conditioned optimization problem. Something else must be off...

MikeInnes · 2019-10-24T15:10:35Z

Right, we previously had an example where the gradients coming back from Zygote had NaNs in them whereas the Tracker gradients did not. That's a clear bug so if I can reproduce that, it gives us an easy path forward. Yes, a manifest would still be ideal; it can just be your global manifest if you don't want to create a new project. They can go in a github gist or pastebin.

jessebett · 2019-10-24T18:09:37Z

Here are the Manifest, Project, and test files.

https://gist.github.com/jessebett/884cfde5b33aed3dc48802f10610f8d7

MikeInnes · 2019-10-25T10:33:13Z

Ok, can you reproduce the NaN issue within that project though? The current test file just shows expected behaviour; minor numerical divergence is very unlikely to be causing the bugs you had earlier.

AStupidBear · 2020-06-05T06:37:10Z

logitbinarycrossentropy calls logσ which calls softplus.

This is the definition of softplus

softplus(x::Real) = ifelse(x > 0, x + log1p(exp(-x)), log1p(exp(x)))

using Flux
softplus'(100f0)           # NaN
log1pexp(x) = log1p(exp(x))
log1p'(exp(100f0))         # 0.0f0
exp'(100f0)                # Inf32
log1pexp'(100f0)           # 0.0f0 * Inf32 == NaN32

AStupidBear · 2020-06-05T06:43:48Z

One can define

Zygote.@adjoint function softplus(x::Real)
  y = softplus(x)
  return y, Δ -> (Δ * σ(Δ),)
end

Then

softplus'(100f0) == 1f0
Zygote.gradient([100f0]) do x
    sum(softplus.(x))
end[1] == [1f0]

MikeInnes mentioned this issue Oct 4, 2019

Compatibility with Tracker #883

Open

jessebett mentioned this issue Oct 7, 2019

Poor performance relative to PyTorch #886

Closed

jessebett mentioned this issue Nov 19, 2019

Numerical issues for (logit)binarycrossentropy #914

Open

xukai92 mentioned this issue Apr 11, 2020

More comprehensive AD benchmarking TuringLang/Turing.jl#1184

Closed

AStupidBear mentioned this issue May 9, 2020

LSTM cannot be trained successfully with the latest release version #1168

Closed

AStupidBear mentioned this issue Jun 5, 2020

Define adjoint for softplus FluxML/Zygote.jl#674

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model optimization fails (NaNs) with Zygote.pullback but works with Tracker.forward #876

Model optimization fails (NaNs) with Zygote.pullback but works with Tracker.forward #876

jessebett commented Sep 27, 2019 •

edited

MikeInnes commented Sep 30, 2019

jessebett commented Oct 3, 2019

jessebett commented Oct 3, 2019

jessebett commented Oct 3, 2019

MikeInnes commented Oct 14, 2019

jessebett commented Oct 15, 2019

MikeInnes commented Oct 15, 2019 •

edited

jessebett commented Oct 18, 2019 •

edited

jekbradbury commented Oct 19, 2019

MikeInnes commented Oct 24, 2019 •

edited

jessebett commented Oct 24, 2019

MikeInnes commented Oct 25, 2019

AStupidBear commented Jun 5, 2020

AStupidBear commented Jun 5, 2020 •

edited

Model optimization fails (NaNs) with Zygote.pullback but works with Tracker.forward #876

Model optimization fails (NaNs) with Zygote.pullback but works with Tracker.forward #876

Comments

jessebett commented Sep 27, 2019 • edited

MikeInnes commented Sep 30, 2019

jessebett commented Oct 3, 2019

jessebett commented Oct 3, 2019

jessebett commented Oct 3, 2019

MikeInnes commented Oct 14, 2019

jessebett commented Oct 15, 2019

MikeInnes commented Oct 15, 2019 • edited

jessebett commented Oct 18, 2019 • edited

jekbradbury commented Oct 19, 2019

MikeInnes commented Oct 24, 2019 • edited

jessebett commented Oct 24, 2019

MikeInnes commented Oct 25, 2019

AStupidBear commented Jun 5, 2020

AStupidBear commented Jun 5, 2020 • edited

jessebett commented Sep 27, 2019 •

edited

MikeInnes commented Oct 15, 2019 •

edited

jessebett commented Oct 18, 2019 •

edited

MikeInnes commented Oct 24, 2019 •

edited

AStupidBear commented Jun 5, 2020 •

edited