Preallocation Option for vjp calculation #671

ba2tro · 2022-01-06T07:34:53Z

Added precaching feature to avoid heap allocations on vjp calculation in reverse pass

ChrisRackauckas · 2022-01-06T13:50:31Z

src/fast_layers.jl

@@ -40,11 +40,25 @@ struct FastDense{F,F2} <: FastLayer
  σ::F
  initial_params::F2
  bias::Bool
+  precache::Bool


there's no need to store this. If precache=false, then just store nothing and check for that in the function.

ChrisRackauckas · 2022-01-06T13:50:55Z

src/fast_layers.jl

@@ -40,11 +40,25 @@ struct FastDense{F,F2} <: FastLayer
  σ::F
  initial_params::F2
  bias::Bool
+  precache::Bool
+  cs :: NamedTuple


Use informative variable names. cache instead of cs.

Also, this is not type stable, instead parameterize it ::C.

ChrisRackauckas · 2022-01-06T13:51:40Z

src/fast_layers.jl

@@ -78,28 +92,53 @@ ZygoteRules.@adjoint function (f::FastDense)(x,p)
  y = f.σ.(r)


you missed this allocation and some before.

ChrisRackauckas · 2022-01-06T13:52:06Z

src/fast_layers.jl

+      if typeof(f.σ) <: typeof(tanh)
+        f.cs.zbar = ȳ .* (1 .- y.^2)
+      elseif typeof(f.σ) <: typeof(identity)
+        f.cs.zbar = ȳ
+      else
+        f.cs.zbar = ȳ .* ForwardDiff.derivative.(f.σ,r)
+      end
+      f.cs.Wbar = f.cs.zbar * x'
+      f.cs.bbar = f.cs.zbar
+      f.cs.xbar = W' * f.cs.zbar
+      f.cs.pbar = if f.bias == true


These lines all still allocate. They need .=, mul!, etc.

ChrisRackauckas · 2022-01-06T13:52:16Z

src/fast_layers.jl

+          tmp = typeof(f.cs.bbar) <: AbstractVector ? #how to find if bbar is AbstractVector and allocate its shape and size
+                           vec(vcat(vec(f.cs.Wbar),f.cs.bbar)) :
+                           vec(vcat(vec(f.cs.Wbar),sum(f.cs.bbar,dims=2)))
+          ifgpufree(f.cs.bbar)


If it's in the cache, then don't free it.

ChrisRackauckas · 2022-01-06T13:52:24Z

src/fast_layers.jl

+      f.cs.xbar = W' * f.cs.zbar
+      f.cs.pbar = if f.bias == true
+          tmp = typeof(f.cs.bbar) <: AbstractVector ? #how to find if bbar is AbstractVector and allocate its shape and size
+                           vec(vcat(vec(f.cs.Wbar),f.cs.bbar)) :


These are allocating statements.

ChrisRackauckas · 2022-01-06T13:53:33Z

It's a start but still has a long way to go. Need to test gradient accuracy in https://github.com/SciML/DiffEqFlux.jl/blob/master/test/fast_layers.jl (and inaccuracy in second calls), and should add a test for its usage in neural ODEs in https://github.com/SciML/DiffEqFlux.jl/blob/master/test/neural_de.jl and https://github.com/SciML/DiffEqFlux.jl/blob/master/test/neural_de_gpu.jl

ba2tro · 2022-01-06T13:58:18Z

Thanks for the feedback : ), I'll fix these issues

ChrisRackauckas · 2022-01-22T20:52:33Z

What's the status here?

Parameter numcols can be specified with precache true to provide the max number of columns in the input(s), which otherwise defaults to 1.

ba2tro · 2022-01-24T02:37:13Z

Just pushed the required updates for allowing matrix inputs. It takes views when number of columns in input is less than pre specified number, otherwise everything is done with the preallocated buffers with full size(1 by default).

ChrisRackauckas · 2022-01-24T08:02:46Z

test/neural_de.jl

+@test ! iszero(grads[x])
+@test ! iszero(grads[node.p])
+


We should test that these gradients match the non-caching one.

ChrisRackauckas · 2022-01-24T08:03:21Z

test/fast_layers.jl

@@ -38,6 +44,12 @@ fsgrad = Flux.Zygote.gradient((x,p)->sum(fs(x,p)),x,pd)
 @test fdgrad[1] ≈ fsgrad[1]
 @test fdgrad[2] ≈ fsgrad[2] rtol=1e-5

+fdcgrad = Flux.Zygote.gradient((x,p)->sum(fdc(x,p)),x,pd)
+@test fdgrad[1] ≈ fdcgrad[1]
+@test fdgrad[2] ≈ fdcgrad[2] rtol=1e-5


why so high of a tolerance? Seems like an issue?

would 1e-9 be ok, any specific value?

1e-9 is fine. For something that's just changing to no allocs, I would expect it to pass at like 1e-12 at least.

ChrisRackauckas · 2022-01-24T08:03:52Z

test/fast_layers.jl

+@test fdgrad[1] ≈ fdcgrad[1]
+@test fdgrad[2] ≈ fdcgrad[2] rtol=1e-5
+@allocated fdc(x, pd);
+@test @allocated fdc(x, pd) == 1024


What are these allocations from?

ChrisRackauckas · 2022-01-24T08:17:29Z

src/fast_layers.jl

-      zbar = ȳ .* (1 .- y.^2)
-    elseif typeof(f.σ) <: typeof(identity)
-      zbar = ȳ
+    cols = length(size(x)) == 1 ? 1 : size(x)[2]


Suggested change

cols = length(size(x)) == 1 ? 1 : size(x)[2]

cols = size(x,2)

This would cause error when x::AbstractVector but numcols>1, although with separated dispatches as you suggested below it won't occur

ChrisRackauckas · 2022-01-24T08:18:11Z

src/fast_layers.jl

-      zbar = ȳ
+    cols = length(size(x)) == 1 ? 1 : size(x)[2]
+    if !isgpu(p)
+      f.cache.W .= @view p[reshape(1:(f.out*f.in),f.out,f.in)]


why is this cached?

Did this just to avoid the pointer allocation from taking @view. Would you prefer it uncached?

if uncached it causes further allocations when its transpose is taken here

Interesting, transpose of a view allocates?

ChrisRackauckas · 2022-01-24T08:21:31Z

src/fast_layers.jl

+        cache = nothing
+      end
+      new{typeof(σ), typeof(initial_params), typeof(cache)}(out,in,σ,initial_params,cache,bias,numcols)
+      # new{typeof(σ),typeof(initial_params)}(out,in,σ,initial_params,bias)
  end
 end

 # (f::FastDense)(x,p) = f.σ.(reshape(uview(p,1:(f.out*f.in)),f.out,f.in)*x .+ uview(p,(f.out*f.in+1):lastindex(p)))
 (f::FastDense)(x,p) = ((f.bias == true) ? (f.σ.(reshape(p[1:(f.out*f.in)],f.out,f.in)*x .+ p[(f.out*f.in+1):end])) : (f.σ.(reshape(p[1:(f.out*f.in)],f.out,f.in)*x)))

 ZygoteRules.@adjoint function (f::FastDense)(x,p)


Make this two separate dispatches, one for x::AbstractVector and one for x::AbstractMatrix. That will make it a lot simpler.

ChrisRackauckas · 2022-01-24T15:14:35Z

Looks like a real test failure? https://github.com/SciML/DiffEqFlux.jl/runs/4918538935?check_suite_focus=true#step:6:499

Added separated dispatches for vector and matrix inputs, fixed tests and some allocations.

ba2tro · 2022-01-31T13:14:41Z

This return is causing the allocations here. Is there any workaround for this?

ChrisRackauckas · 2022-01-31T18:35:20Z

You can have a preallocated vector for the return that you write into. Indeed it seems to instantiate the view.

Fixed some statements causing runtime allocations from adjoint calculation.

ba2tro · 2022-02-01T09:25:44Z

Making y a preallocated vector reduced some allocations but the Fastdense_adjoint that is also being returned here seems to be the main problem.

Just for experimenting I returned nothing, y , i.e., nothing for y and y instead of FastDense_adjoint and with the following @allocated call it returned just 176, which suggests FastDense_adjoint allocating

y is just a placeholder here because if nothing is returned in place of FastDense_adjoint then Zygote.pullback will error out.

ChrisRackauckas · 2022-02-01T11:19:08Z

src/fast_layers.jl

+  if typeof(f.cache) <: Nothing
+    y,FastDense_adjoint
+  else
+    @view(f.cache.y[:,1:f.cache.cols[1]]),FastDense_adjoint


have an out array that you write into and then return, instead of returning a view which will allocate the mutable struct of the view itself. That's probably the 176.

can we do anything for this view without the confirmation that numcols will equal cols?

we can return the whole y if numcols equals cols and if cols is lesser, the view

ba2tro · 2022-02-01T11:40:55Z

src/fast_layers.jl

  else
-    r = W*x
+    f.cache.yvec,FastDense_adjoint


I am talking about this one. After replacing the view with a preallocated array yvec the allocations decrease by ~400 bytes but it still allocates ~1400, which disappears when FastDense_adjoint is removed from above, although we can't compute if its removed.

oh yes that closure will need to allocate. Don't worry about that for now. If that's all that's left, then we're good. Let's try to get this finished, merged, and then talk about what to do here.

This seems good to run tests on

test/fast_layers.jl

ChrisRackauckas · 2022-02-01T12:46:21Z

test/fast_layers.jl

@@ -38,6 +44,12 @@ fsgrad = Flux.Zygote.gradient((x,p)->sum(fs(x,p)),x,pd)
 @test fdgrad[1] ≈ fsgrad[1]
 @test fdgrad[2] ≈ fsgrad[2] rtol=1e-5

+fdcgrad = Flux.Zygote.gradient((x,p)->sum(fdc(x,p)),x,pd)
+@test fdgrad[1] ≈ fdcgrad[1]


lower tolerance

ChrisRackauckas · 2022-02-01T12:46:58Z

test/neural_de.jl

+gradsnc = Zygote.gradient(()->sum(node(x)),Flux.params(x,node))
+@test ! iszero(gradsnc[x])
+@test ! iszero(gradsnc[node.p])


check that this matches the one without caching to a low tolerance. Set the ODE solver tolerance low for this test.

ChrisRackauckas · 2022-02-01T12:47:29Z

test/neural_de.jl

+gradsc = Zygote.gradient(()->sum(node(x)),Flux.params(x,node))
+@test ! iszero(gradsc[x])
+@test ! iszero(gradsc[node.p])
+@test gradsnc[x] ≈ gradsc[x]


tolerance on here?

ChrisRackauckas · 2022-02-01T12:47:40Z

test/neural_de.jl

+grads = Zygote.gradient(()->sum(node(xs)),Flux.params(xs,node))
+@test ! iszero(grads[xs])
+@test ! iszero(grads[node.p])
+
+node = NeuralODE(fastcdudt,tspan,Tsit5(),save_everystep=false,save_start=false,sensealg=TrackerAdjoint())
+grads = Zygote.gradient(()->sum(node(x)),Flux.params(x,node))
+@test ! iszero(grads[x])
+@test ! iszero(grads[node.p])
+
+grads = Zygote.gradient(()->sum(node(xs)),Flux.params(xs,node))
+@test ! iszero(grads[xs])
+@test ! iszero(grads[node.p])
+
+goodgrad = grads[node.p]
+p = node.p
+
+node = NeuralODE(fastcdudt,tspan,Tsit5,save_everystep=false,save_start=false, sensealg=BacksolveAdjoint(),p=p)
+grads = Zygote.gradient(()->sum(node(x)),Flux.params(x,node))
+@test ! iszero(grads[x])
+@test ! iszero(grads[node.p])
+
+grads = Zygote.gradient(()->sum(node(xs)),Flux.params(xs,node))
+@test !iszero(grads[xs])
+@test ! iszero(grads[node.p])
+goodgrad2 = grads[node.p]
+@test goodgrad ≈ goodgrad2


ChrisRackauckas · 2022-02-01T12:48:20Z

test/neural_de.jl

+    node = NeuralODE(fastcdudt,tspan,Tsit5(),save_everystep=false,save_start=false)
+    grads = Zygote.gradient(()->sum(node(x)),Flux.params(x,node))
+    @test ! iszero(grads[x])
+    @test ! iszero(grads[node.p])
+
+    @test_broken grads = Zygote.gradient(()->sum(node(xs)),Flux.params(xs,node)) isa Tuple
+    @test_broken ! iszero(grads[xs])
+    @test_broken ! iszero(grads[node.p])
+
+    node = NeuralODE(fastcdudt,tspan,Tsit5(),saveat=0.0:0.1:1.0)
+    grads = Zygote.gradient(()->sum(node(x)),Flux.params(x,node))
+    @test ! iszero(grads[x])
+    @test ! iszero(grads[node.p])
+
+    @test_broken grads = Zygote.gradient(()->sum(node(xs)),Flux.params(xs,node)) isa Tuple
+    @test_broken ! iszero(grads[xs])
+    @test_broken ! iszero(grads[node.p])
+
+    node = NeuralODE(fastcdudt,tspan,Tsit5(),saveat=0.1)
+    grads = Zygote.gradient(()->sum(node(x)),Flux.params(x,node))
+    @test ! iszero(grads[x])
+    @test ! iszero(grads[node.p])
+
+    @test_broken grads = Zygote.gradient(()->sum(node(xs)),Flux.params(xs,node)) isa Tuple
+    @test_broken ! iszero(grads[xs])
+    @test_broken ! iszero(grads[node.p])


It would be easier to read if the cached test is paired right next to the uncached test

ChrisRackauckas · 2022-02-01T12:49:58Z

test/neural_de.jl

@@ -244,8 +334,21 @@ grads = Zygote.gradient(()->sum(sode(x)),Flux.params(x,sode))
 @test ! iszero(grads[sode.p])
 @test ! iszero(grads[sode.p][end])

+sode = NeuralDSDE(fastcdudt,fastcdudt2,(0.0f0,.1f0),SOSRI(),saveat=0.0:0.01:0.1)


if you set RNG seeds, do these gradients match the uncached versions? That's a good test.

ChrisRackauckas · 2022-02-01T12:50:15Z

test/neural_de.jl

+
+fastcddudt = FastChain(FastDense(6,50,tanh,numcols=size(xs)[2],precache=true),FastDense(50,2,numcols=size(xs)[2],precache=true))
+NeuralCDDE(fastcddudt,(0.0f0,2.0f0),(p,t)->zero(x),(1f-1,2f-1),MethodOfSteps(Tsit5()),saveat=0.1)(x)
+dode = NeuralCDDE(fastcddudt,(0.0f0,2.0f0),(p,t)->zero(x),(1f-1,2f-1),MethodOfSteps(Tsit5()),saveat=0.0:0.1:2.0)
+
+grads = Zygote.gradient(()->sum(dode(x)),Flux.params(x,dode))
+@test ! iszero(grads[x])
+@test ! iszero(grads[dode.p])
+
+@test_broken grads = Zygote.gradient(()->sum(dode(xs)),Flux.params(xs,dode)) isa Tuple
+@test_broken ! iszero(grads[xs])
+@test ! iszero(grads[dode.p])


Check against the uncached.

Made corrections to existing tests and added a couple of new tests

ChrisRackauckas · 2022-02-05T17:20:54Z

Test failure

ba2tro · 2022-02-05T17:42:55Z

On a simple adjoint calculation we get ~2x speedup with precaching

Preallocation Option for vjp calculation

bc12819

Added precaching feature to avoid heap allocations on vjp calculation in reverse pass

ChrisRackauckas reviewed Jan 6, 2022

View reviewed changes

ba2tro and others added 2 commits January 8, 2022 08:18

Merge branch 'SciML:master' into precache

1711da3

modified allocating statements to use pre allocated memory

9c44b2d

Added support for matrix inputs

896fc48

Parameter numcols can be specified with precache true to provide the max number of columns in the input(s), which otherwise defaults to 1.

ChrisRackauckas reviewed Jan 24, 2022

View reviewed changes

ba2tro and others added 2 commits January 31, 2022 18:34

Dispatches, tests and allocations for adjoint calculation

ea48ce9

Added separated dispatches for vector and matrix inputs, fixed tests and some allocations.

Merge branch 'SciML:master' into precache

5ff2129

Some allocations removed

2197440

Fixed some statements causing runtime allocations from adjoint calculation.

ChrisRackauckas reviewed Feb 1, 2022

View reviewed changes

ba2tro commented Feb 1, 2022

View reviewed changes

Minor fix

1c755ed

ChrisRackauckas reviewed Feb 1, 2022

View reviewed changes

test/fast_layers.jl Show resolved Hide resolved

ChrisRackauckas reviewed Feb 1, 2022

View reviewed changes

ChrisRackauckas mentioned this pull request Feb 4, 2022

Replace unrolled foldl used to evaluate Chain with a better one FluxML/Flux.jl#1809

Merged

Fixed up Tests

4534df0

Made corrections to existing tests and added a couple of new tests

ChrisRackauckas approved these changes Feb 5, 2022

View reviewed changes

Tiny bug in test/neural_de.jl

e07e5ab

ChrisRackauckas merged commit 30274ae into SciML:master Feb 5, 2022

ChrisRackauckas mentioned this pull request Feb 5, 2022

Fast vjp form #681

Closed

ba2tro deleted the precache branch February 7, 2022 05:52

ba2tro restored the precache branch February 7, 2022 05:53

ba2tro deleted the precache branch February 11, 2022 13:42

ChrisRackauckas mentioned this pull request Feb 16, 2022

Precaching and FFT planning SciML/OperatorLearning.jl#32

Open

		@@ -78,28 +92,53 @@ ZygoteRules.@adjoint function (f::FastDense)(x,p)
		y = f.σ.(r)

	cols = length(size(x)) == 1 ? 1 : size(x)[2]
	cols = size(x,2)

Preallocation Option for vjp calculation #671

Preallocation Option for vjp calculation #671

Conversation

ba2tro commented Jan 6, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ChrisRackauckas commented Jan 6, 2022

ba2tro commented Jan 6, 2022

ChrisRackauckas commented Jan 22, 2022

ba2tro commented Jan 24, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ba2tro Jan 24, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ChrisRackauckas Jan 24, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ChrisRackauckas commented Jan 24, 2022

ba2tro commented Jan 31, 2022

ChrisRackauckas commented Jan 31, 2022

ba2tro commented Feb 1, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ba2tro Feb 1, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ChrisRackauckas commented Feb 5, 2022

ba2tro commented Feb 5, 2022

ba2tro Jan 24, 2022 •

edited

Loading

ChrisRackauckas Jan 24, 2022 •

edited

Loading

ba2tro Feb 1, 2022 •

edited

Loading