Add SimpleGMRES implementation #366

avik-pal · 2023-08-22T22:16:53Z

ai-maintainer

AI-Maintainer Review for PR - Add SimpleGMRES implementation

Title and Description 👍

The Title and Description are clear and focused

The title and description of the pull request are clear and focused. They effectively communicate the purpose of the changes, which is to add a SimpleGMRES implementation. The description provides a checklist of tasks that need to be completed, indicating the different aspects of the implementation that are being worked on.

Scope of Changes 👍

The changes are narrowly focused

The changes in this pull request appear to be focused on a single issue, which is the addition of the SimpleGMRES implementation. The author is working on various aspects of this implementation, such as handling different types of matrices, improving performance, and adding tests. All these tasks are related to the main goal of implementing SimpleGMRES.

Testing 👎

Testing details are missing

The description does not mention how the author tested the changes. It would be helpful for the author to include information about the testing methodology, such as the specific test cases used, the platforms or environments tested on, and any relevant test results or observations.

Documentation 👎

Docstrings are missing for several functions, classes, or methods

The following newly added functions, classes, or methods do not have docstrings:

LinearSolveBlockDiagonalsExt
LinearSolveNNlibExt
SimpleGMRES
SimpleGMRESCache
init_cacheval
_init_cacheval
default_alias_A
default_alias_b
SciMLBase.solve! (for SimpleGMRESCache)
SciMLBase.solve! (for SimpleGMRES)

These entities should have docstrings added to describe their behavior, arguments, and return values.

Suggested Changes

Please add docstrings to the functions, classes, or methods listed above. Each docstring should describe the behavior, arguments, and return values of the function, class, or method.
Please provide details about how you tested the changes. This could include the specific test cases you used, the platforms or environments you tested on, and any relevant test results or observations.

Reviewed with AI Maintainer

codecov · 2023-08-22T22:21:12Z

Codecov Report

Merging #366 (38e3401) into main (60ae26a) will increase coverage by 30.51%.
The diff coverage is 85.08%.

@@             Coverage Diff             @@
##             main     #366       +/-   ##
===========================================
+ Coverage   41.83%   72.35%   +30.51%     
===========================================
  Files          20       23        +3     
  Lines        1446     1787      +341     
===========================================
+ Hits          605     1293      +688     
+ Misses        841      494      -347

Files Changed	Coverage Δ
ext/LinearSolveKernelAbstractionsExt.jl	`0.00% <0.00%> (ø)`
ext/LinearSolveBlockDiagonalsExt.jl	`81.81% <81.81%> (ø)`
src/simplegmres.jl	`88.25% <88.25%> (ø)`
src/LinearSolve.jl	`93.84% <100.00%> (-1.40%)`	⬇️
src/iterative_wrappers.jl	`72.78% <100.00%> (+16.41%)`	⬆️

... and 14 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

avik-pal · 2023-08-23T18:32:06Z

Simple Benchmark Script

using LinearAlgebra, LinearSolve, SciMLOperators
using BlockDiagonals, NNlib
using Plots

blocksizes_ = [1, 2, 4, 5, 10, 25, 50]
nblocks = 100 .÷ blocksizes_
simplegmres_timings = Vector{Float64}(undef, length(blocksizes_))
simplegmres_nobsize_timings = Vector{Float64}(undef, length(blocksizes_))
krylovjlgmres_timings = Vector{Float64}(undef, length(blocksizes_))
for (i, (bsize, N)) in enumerate(zip(blocksizes_, nblocks))
    A_ = Array(BlockDiagonal([randn(bsize, bsize) for _ in 1:N]))
    b_ = randn(size(A_, 1))

    lincache_ = init(LinearProblem(A_, b_), SimpleGMRES(; blocksize=bsize, restart = 20);
        maxiters = 1000)
    lincache_krylov_ = init(LinearProblem(A_, b_), KrylovJL_GMRES(; gmres_restart = 20);
        maxiters = 1000)
    lincache_simple_nobsize = init(LinearProblem(A_, b_), SimpleGMRES(; restart = 20);
        maxiters = 1000)

    simplegmres_timings[i] = @belapsed solve!($(deepcopy(lincache_)))
    krylovjlgmres_timings[i] = @belapsed solve!($(deepcopy(lincache_krylov_)))
    simplegmres_nobsize_timings[i] = @belapsed solve!($(deepcopy(lincache_simple_nobsize)))

    @info "Trial $i" bsize=bsize N=N SimpleGMRES=simplegmres_timings[i] KrylovJL_GMRES=krylovjlgmres_timings[i] SimpleGMRES_nobsize=simplegmres_nobsize_timings[i]
end

begin
    plt = plot(blocksizes_, simplegmres_timings, label = "SimpleGMRES", xlabel = "Block Size",
        ylabel = "Time (s)", title = "SimpleGMRES vs KrylovJL_GMRES", legend = :topleft,
        xscale = :log10, yscale = :log10)
    scatter!(plt, blocksizes_, simplegmres_timings; label = "")
    plot!(plt, blocksizes_, krylovjlgmres_timings, label = "KrylovJL_GMRES")
    scatter!(plt, blocksizes_, krylovjlgmres_timings; label = "")
    plot!(plt, blocksizes_, simplegmres_nobsize_timings, label = "SimpleGMRES (no blocksize)")
    scatter!(plt, blocksizes_, simplegmres_nobsize_timings; label = "")
    savefig(plt, "wip/simplegmres_vs_krylovjl_gmres.png")
end

Takeaways

We should exploit the structure, it significantly brings down the computational complexity and should have better convergence rate (~O(convergence rate of blocksize))
The SimpleGMRES code is pathetically slow compared to GMRES

avik-pal · 2023-08-24T22:15:54Z

avik-pal · 2023-08-25T21:53:52Z

The non-unform blocks handling is a bit more finicky (not hard, just a bit irritating to handle). Unless we have a concrete use case for that, we can hold off implementing that in a later PR.

SimpleGMRES is mostly a copy of KrylovJL_GMRES, but we should hold off using it as a default anywhere, till we have battle-tested it at places.

I will follow up with benchmarks in DEQs.jl and SciMLSensitivity.jl, my initial runs suggest massive improvements in backpropagation (simply due to less number of iterations ==> less VJPs computed).

Again, we can't default to the blocksize version. It will make code really really fast for 99% users, for the 1% (say some using BatchNorm) it will give incorrect results. So, I will add a doc entry in SteadyStateAdjoint encouraging the use of this if the user is absolutely sure of what is happening here.

avik-pal · 2023-08-25T23:23:52Z

(Don't know why I cant seem to request reviews)

@ChrisRackauckas, this is ready for review

…gmres

ChrisRackauckas · 2023-09-20T20:28:07Z

src/simplegmres.jl

+            # Update the QR factorization of Hₖ₊₁.ₖ.
+            # Apply previous Givens reflections Ωᵢ.
+            # [cᵢ  sᵢ] [ r̄ᵢ.ₖ ] = [ rᵢ.ₖ ]
+            # [s̄ᵢ -cᵢ] [rᵢ₊₁.ₖ]   [r̄ᵢ₊₁.ₖ]
+            for i in 1:(inner_iter - 1)
+                Rtmp = c[i] * R[nr + i] + s[i] * R[nr + i + 1]
+                R[nr + i + 1] = conj(s[i]) * R[nr + i] - c[i] * R[nr + i + 1]
+                R[nr + i] = Rtmp
+            end
+
+            # Compute and apply current Givens reflection Ωₖ.
+            # [cₖ  sₖ] [ r̄ₖ.ₖ ] = [rₖ.ₖ]
+            # [s̄ₖ -cₖ] [hₖ₊₁.ₖ]   [ 0  ]
+            (c[inner_iter], s[inner_iter], R[nr + inner_iter]) = _sym_givens(R[nr + inner_iter],
+                Hbis)


Aren't there Base functions for this?

Seems like there is. I have opened an issue to track those #375

@avik-pal This a 2x2 Givens reflection and not a 2x2 Givens rotation.
This a not the same thing.
The advantage of the reflections is that they are Hermitian and it's easier to perform products with Q'.

ChrisRackauckas · 2023-09-20T20:30:09Z

It's at least good enough to merge. I think some things could be simplified but that can be done over time.

amontoison · 2023-10-31T05:22:30Z

@ChrisRackauckas @avik-pal
Why not just implement an efficient linear operator to perform your matrix-vector products?
You can just create a structure and apply a dense mat-vec product (BLAS gemm) with each block.
You just need to define LinearAlgebra.mul! for your structure.
You can also apply each mat-vec in parallel.
We are doing that for years in Krylov.jl...

ChrisRackauckas · 2023-10-31T08:43:42Z

That's fine, but it imposes that the user does that. Sometimes doing something for the user is simpler and removes restrictions.

avik-pal · 2024-03-10T17:11:02Z

@amontoison I was testing out if we can use a efficient Linear Operator to get rid of this. I have a UniformBlockDiagonalMatrix which internally stores a N x N x M array which translates into M -- N x N blocks on the diagonal. Even with really fast matrix multiply (I call the CUBLAS batched gemm routines) and such the problem arises from the dot operation which needs to perform a much larger reduction than needed. See this flamegraph

julia> @benchmark CUDA.@sync solve(prob1, KrylovJL_GMRES())
BenchmarkTools.Trial: 11 samples with 1 evaluation.
 Range (min … max):  462.166 ms … 480.612 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     476.150 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   473.402 ms ±   6.727 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁     ▁            ▁▁             ▁           ▁ ▁   ▁      █▁  
  █▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁██▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁█▁█▁▁▁█▁▁▁▁▁▁██ ▁
  462 ms           Histogram: frequency by time          481 ms <

 Memory estimate: 1.25 MiB, allocs estimate: 37978.

julia> @benchmark CUDA.@sync solve(prob2, KrylovJL_GMRES())
BenchmarkTools.Trial: 11 samples with 1 evaluation.
 Range (min … max):  467.924 ms … 488.971 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     474.196 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   474.975 ms ±   6.809 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁█ ▁         ▁    ▁▁      ▁           ▁ ▁                   ▁  
  ██▁█▁▁▁▁▁▁▁▁▁█▁▁▁▁██▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁█▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  468 ms           Histogram: frequency by time          489 ms <

 Memory estimate: 1.09 MiB, allocs estimate: 36442.

julia> @benchmark CUDA.@sync solve(prob2, SimpleGMRES(; blocksize = size(A1.data, 1)))
BenchmarkTools.Trial: 27 samples with 1 evaluation.
 Range (min … max):  143.097 ms … 641.160 ms  ┊ GC (min … max): 0.00% … 8.66%
 Time  (median):     152.867 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   186.930 ms ± 125.551 ms  ┊ GC (mean ± σ):  2.32% ± 2.52%

  █▆                                                             
  ██▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▃ ▁
  143 ms           Histogram: frequency by time          641 ms <

 Memory estimate: 10.75 MiB, allocs estimate: 289135.

prob1 is constructed with the operator and prob2 is with the equivalent Matrix form.

amontoison · 2024-03-11T04:20:33Z

@avik-pal
Can you display the output of CUDA.@profile solve(prob1, KrylovJL_GMRES()) and with your SimpleGMRES?

I don't understand why the dot is different between your variant and mine, do you use your own kernel?
The linear system has the same size so the time spent in dot products should be similar with the generic dispatch.

avik-pal · 2024-03-11T15:39:26Z

src/simplegmres.jl

+            mul!(w, A, p)                         # w ← ANvₖ
+            PlisI || ldiv!(q, Pl, w)                 # q ← MANvₖ
+            for i in 1:inner_iter
+                sum!(R[nr + i]', __batch(V[i]) .* __batch(q))


@amontoison see this line. For the batched case (N x B input size), the dot needs to only reduce to 1 x B instead of reducing to a scalar.

avik-pal · 2024-03-11T15:40:00Z

Can you display the output of CUDA.@Profile solve(prob1, KrylovJL_GMRES()) and with your SimpleGMRES?

I will share the profile in a couple of hrs.

avik-pal · 2024-03-11T19:13:30Z

julia> CUDA.@profile solve(prob2, SimpleGMRES(; blocksize=8))
Profiler ran for 5.86 ms, capturing 5269 events.

Host-side activity: calling CUDA APIs took 3.58 ms (61.09% of the trace)
┌──────────┬────────────┬───────┬───────────────────────────────────────┬─────────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                     │ Name                        │
├──────────┼────────────┼───────┼───────────────────────────────────────┼─────────────────────────────┤
│   25.14% │    1.47 ms │   469 │   3.14 µs ± 26.0   (  0.72 ‥ 535.73)  │ cuMemAllocFromPoolAsync     │
│   16.65% │  976.32 µs │   417 │   2.34 µs ± 0.96   (  1.67 ‥ 14.07)   │ cuLaunchKernel              │
│    5.96% │  349.28 µs │     2 │ 174.64 µs ± 240.57 (  4.53 ‥ 344.75)  │ cuMemcpyDtoDAsync           │
│    3.07% │  179.77 µs │    24 │   7.49 µs ± 0.91   (   6.2 ‥ 10.01)   │ cuMemcpyDtoHAsync           │
│    1.32% │   77.25 µs │     2 │  38.62 µs ± 43.5   (  7.87 ‥ 69.38)   │ cudaMemcpyAsync             │
│    0.56% │    32.9 µs │    11 │   2.99 µs ± 1.13   (  1.91 ‥ 5.01)    │ cudaLaunchKernel            │
│    0.50% │   29.09 µs │    48 │ 605.98 ns ± 230.47 (238.42 ‥ 1192.09) │ cuStreamSynchronize         │
│    0.11% │    6.68 µs │     1 │                                       │ cuMemsetD32Async            │
│    0.07% │    4.29 µs │     1 │                                       │ cudaFuncGetAttributes       │
│    0.04% │    2.62 µs │    14 │ 187.33 ns ± 166.72 (   0.0 ‥ 476.84)  │ cudaGetLastError            │
│    0.04% │    2.15 µs │     1 │                                       │ cudaStreamSynchronize       │
│    0.02% │    1.43 µs │     5 │  286.1 ns ± 261.17 (   0.0 ‥ 715.26)  │ cudaStreamGetCaptureInfo_v2 │
│    0.02% │    1.43 µs │     1 │                                       │ cudaEventRecord             │
│    0.01% │  476.84 ns │     4 │ 119.21 ns ± 137.65 (   0.0 ‥ 238.42)  │ cuDeviceGet                 │
└──────────┴────────────┴───────┴───────────────────────────────────────┴─────────────────────────────┘

Device-side activity: GPU was busy for 679.49 µs (11.59% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution                    │ Name                                                                                                                                                                                                  ⋯
├──────────┼────────────┼───────┼──────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│    1.38% │   80.82 µs │    36 │   2.25 µs ± 0.16   (  1.91 ‥ 2.38)   │ _Z22partial_mapreduce_grid8identity7add_sumv16CartesianIndicesILi2E5TupleI5OneToI5Int64ES3_IS4_EEES1_ILi2ES2_IS3_IS4_ES3_IS4_EEE3ValILitrueEE13ReshapedArrayI7Float32Li3E7AdjointIS7_13CuDeviceArrayI ⋯
│    1.10% │   64.37 µs │    52 │   1.24 µs ± 0.15   (  0.95 ‥ 1.43)   │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float32Li1ELi1EE11BroadcastedI12CuArrayStyleILi1E12DeviceBufferE5TupleI5OneToI5Int64EE1_S5_I8ExtrudedIS0_IS1_Li1ELi1EES5_I4BoolES5_IS7_EES9_IS0 ⋯
│    0.98% │   57.46 µs │    37 │   1.55 µs ± 0.12   (  1.43 ‥ 1.67)   │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float32Li2ELi1EE11BroadcastedI12CuArrayStyleILi2E12DeviceBufferE5TupleI5OneToI5Int64ES6_IS7_EE1_S5_I8ExtrudedIS0_IS1_Li2ELi1EES5_I4BoolS10_ES5_ ⋯
│    0.94% │   54.84 µs │    36 │   1.52 µs ± 0.13   (  1.19 ‥ 1.67)   │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float32Li2ELi1EE11BroadcastedI12CuArrayStyleILi2E12DeviceBufferE5TupleI5OneToI5Int64ES6_IS7_EE1_S5_I8ExtrudedI7AdjointIS1_S0_IS1_Li1ELi1EEES5_I ⋯
│    0.92% │   54.12 µs │    45 │    1.2 µs ± 0.14   (  0.95 ‥ 1.43)   │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float32Li1ELi1EE11BroadcastedI12CuArrayStyleILi1E12DeviceBufferE5TupleI5OneToI5Int64EE8identityS5_I8ExtrudedIS0_IS1_Li1ELi1EES5_I4BoolES5_IS7_E ⋯
│    0.70% │   41.25 µs │    28 │   1.47 µs ± 0.15   (  1.19 ‥ 1.67)   │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float32Li1ELi1EE11BroadcastedI12CuArrayStyleILi1E12DeviceBufferE5TupleI5OneToI5Int64EE1_S5_IS2_IS3_ILi1ES4_EvS8_S5_IS2_IS3_ILi1ES4_Ev4conjS5_I8 ⋯
│    0.69% │   40.53 µs │    28 │   1.45 µs ± 0.14   (  1.19 ‥ 1.67)   │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float32Li1ELi1EE11BroadcastedI12CuArrayStyleILi1E12DeviceBufferE5TupleI5OneToI5Int64EE1_S5_IS2_IS3_ILi1ES4_EvS8_S5_I8ExtrudedIS0_IS1_Li1ELi1EES ⋯
│    0.67% │    39.1 µs │    28 │    1.4 µs ± 0.18   (  1.19 ‥ 1.67)   │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float32Li1ELi1EE11BroadcastedI12CuArrayStyleILi1E12DeviceBufferE5TupleI5OneToI5Int64EE1_S5_I8ExtrudedIS0_IS1_Li1ELi1EES5_I4BoolES5_IS7_EES2_IS3 ⋯
│    0.64% │   37.43 µs │    36 │   1.04 µs ± 0.12   (  0.95 ‥ 1.19)   │ _Z2_615CuKernelContext7AdjointI7Float3213CuDeviceArrayIS1_Li1ELi1EEES1_                                                                                                                               ⋯
│    0.48% │   28.13 µs │    25 │   1.13 µs ± 0.19   (  0.95 ‥ 1.67)   │ [copy device to pageable memory]                                                                                                                                                                      ⋯
│    0.38% │   22.41 µs │    15 │   1.49 µs ± 0.17   (  1.19 ‥ 1.67)   │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float32Li2ELi1EE11BroadcastedI12CuArrayStyleILi2E12DeviceBufferE5TupleI5OneToI5Int64ES6_IS7_EE1_S5_I8ExtrudedIS0_IS1_Li2ELi1EES5_I4BoolS10_ES5_ ⋯
│    0.36% │   20.98 µs │     9 │   2.33 µs ± 0.11   (  2.15 ‥ 2.38)   │ _Z22partial_mapreduce_grid8identity7add_sum7Float3216CartesianIndicesILi2E5TupleI5OneToI5Int64ES4_IS5_EEES2_ILi2ES3_IS4_IS5_ES4_IS5_EEE3ValILitrueEE13CuDeviceArrayIS1_Li3ELi1EE11BroadcastedI12CuArr ⋯
│    0.31% │   18.12 µs │     8 │   2.26 µs ± 0.18   (  2.15 ‥ 2.62)   │ _ZN8internal5gemvx6kernelIiiffffLb0ELb1ELb0ELb0ELi7ELb0E18cublasGemvParamsExIi30cublasGemvTensorStridedBatchedIKfES5_S3_IfEfEEENSt9enable_ifIXntT5_EvE4typeET11_                                      ⋯
│    0.28% │   16.69 µs │     8 │   2.09 µs ± 0.11   (  1.91 ‥ 2.15)   │ _Z22partial_mapreduce_grid8identity1_4Bool16CartesianIndicesILi1E5TupleI5OneToI5Int64EEES2_ILi1ES3_IS4_IS5_EEE3ValILitrueEE13CuDeviceArrayIS1_Li2ELi1EE11BroadcastedI12CuArrayStyleILi1E12DeviceBuffe ⋯
│    0.27% │   15.74 µs │     8 │   1.97 µs ± 0.21   (  1.67 ‥ 2.15)   │ _Z22partial_mapreduce_grid8identity3max7Float3216CartesianIndicesILi1E5TupleI5OneToI5Int64EEES2_ILi1ES3_IS4_IS5_EEE3ValILitrueEE13CuDeviceArrayIS1_Li2ELi1EE11BroadcastedI12CuArrayStyleILi1E12Device ⋯
│    0.27% │   15.74 µs │     8 │   1.97 µs ± 0.17   (  1.67 ‥ 2.15)   │ _Z22partial_mapreduce_grid8identity3max7Float3216CartesianIndicesILi1E5TupleI5OneToI5Int64EEES2_ILi1ES3_IS4_IS5_EEE3ValILitrueEE13CuDeviceArrayIS1_Li2ELi1EE11BroadcastedI12CuArrayStyleILi1E12Device ⋯
│    0.22% │   12.87 µs │     9 │   1.43 µs ± 0.21   (  1.19 ‥ 1.67)   │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float32Li2ELi1EE11BroadcastedI12CuArrayStyleILi2E12DeviceBufferE5TupleI5OneToI5Int64ES6_IS7_EE4sqrtS5_I8ExtrudedIS0_IS1_Li2ELi1EES5_I4BoolS10_E ⋯
│    0.21% │   12.16 µs │     8 │   1.52 µs ± 0.18   (  1.19 ‥ 1.67)   │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float32Li1ELi1EE11BroadcastedI12CuArrayStyleILi1E12DeviceBufferE5TupleI5OneToI5Int64EE6ifelseS5_IS2_IS3_ILi1ES4_Ev2__S5_IS2_IS3_ILi1ES4_Ev3absS ⋯
│    0.20% │   11.92 µs │     8 │   1.49 µs ± 0.21   (  1.19 ‥ 1.91)   │ _Z29gpu___fast_sym_givens_kernel_16CompilerMetadataI11DynamicSize12DynamicCheckv16CartesianIndicesILi1E5TupleI5OneToI5Int64EEE7NDRangeILi1ES0_S0_S2_ILi1ES3_IS4_IS5_EEES2_ILi1ES3_IS4_IS5_EEEEE13CuDe ⋯
│    0.18% │   10.73 µs │     8 │   1.34 µs ± 0.12   (  1.19 ‥ 1.43)   │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float32Li1ELi1EE11BroadcastedI12CuArrayStyleILi1E12DeviceBufferE5TupleI5OneToI5Int64EE1_S5_IS2_IS3_ILi1ES4_Ev4conjS5_I8ExtrudedIS0_IS1_Li1ELi1E ⋯
│    0.18% │   10.73 µs │     8 │   1.34 µs ± 0.18   (  1.19 ‥ 1.67)   │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI4BoolLi1ELi1EE11BroadcastedI12CuArrayStyleILi1E12DeviceBufferE5TupleI5OneToI5Int64EE2__S5_IS2_IS3_ILi1ES4_Ev3absS5_I8ExtrudedIS0_I7Float32Li1ELi ⋯
│    0.10% │    5.72 µs │     2 │   2.86 µs ± 0.0    (  2.86 ‥ 2.86)   │ _Z11nrm2_kernelIfffLi1ELi0ELi128EEv16cublasNrm2ParamsIiT_T1_E                                                                                                                                         ⋯
│    0.06% │    3.58 µs │     2 │   1.79 µs ± 0.84   (  1.19 ‥ 2.38)   │ [copy device to device memory]                                                                                                                                                                        ⋯
│    0.03% │    1.67 µs │     2 │ 834.47 ns ± 168.59 (715.26 ‥ 953.67) │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float32Li1ELi1EE11BroadcastedI12CuArrayStyleILi1E12DeviceBufferE5TupleI5OneToI5Int64EE8identityS5_IS1_EES7_                                     ⋯
│    0.02% │    1.43 µs │     1 │                                      │ _Z15axpy_kernel_valIffEv19cublasAxpyParamsValIT_S1_T0_E                                                                                                                                               ⋯
│    0.01% │  476.84 ns │     1 │                                      │ [set device memory]                                                                                                                                                                                   ⋯
│    0.01% │  476.84 ns │     1 │                                      │ [copy pageable to device memory]                                                                                                                                                                      ⋯
└──────────┴────────────┴───────┴──────────────────────────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
                                                                                                                                                                                                                                                                1 column omitted

julia> CUDA.@profile solve(prob2, KrylovJL_GMRES())
Profiler ran for 672.18 ms, capturing 767034 events.

Host-side activity: calling CUDA APIs took 573.04 ms (85.25% of the trace)
┌──────────┬────────────┬────────┬──────────────────────────────────────────────┬─────────────────────────────┐
│ Time (%) │ Total time │  Calls │ Time distribution                            │ Name                        │
├──────────┼────────────┼────────┼──────────────────────────────────────────────┼─────────────────────────────┤
│   32.08% │  215.65 ms │  33412 │   6.45 µs ± 1.86   (  1.91 ‥ 108.48)         │ cudaMemcpyAsync             │
│   30.94% │  207.96 ms │  99716 │   2.09 µs ± 10.1   (  1.43 ‥ 2249.96)        │ cudaLaunchKernel            │
│    7.66% │   51.47 ms │  33154 │   1.55 µs ± 0.94   (  1.19 ‥ 113.49)         │ cudaFuncGetAttributes       │
│    4.98% │   33.45 ms │ 165770 │ 201.78 ns ± 164.18 (   0.0 ‥ 12397.77)       │ cudaStreamGetCaptureInfo_v2 │
│    3.49% │   23.49 ms │  33154 │ 708.47 ns ± 11711.15 (476.84 ‥ 2.13193893e6) │ cudaStreamSynchronize       │
│    2.27% │   15.25 ms │ 166280 │  91.73 ns ± 5416.7 (   0.0 ‥ 2.20823288e6)   │ cudaGetLastError            │
│    1.78% │   11.93 ms │  33154 │ 359.97 ns ± 693.56 (238.42 ‥ 117301.94)      │ cudaEventRecord             │
│    1.01% │    6.81 ms │      1 │                                              │ cuMemsetD32Async            │
│    0.16% │    1.07 ms │    262 │   4.07 µs ± 2.01   (  0.72 ‥ 16.21)          │ cuMemAllocFromPoolAsync     │
│    0.12% │  833.75 µs │    278 │    3.0 µs ± 1.41   (  1.43 ‥ 16.69)          │ cuLaunchKernel              │
└──────────┴────────────┴────────┴──────────────────────────────────────────────┴─────────────────────────────┘

Device-side activity: GPU was busy for 163.36 ms (24.30% of the trace)
┌──────────┬────────────┬───────┬────────────────────────────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution                      │ Name                                                                                                                                                                                                ⋯
├──────────┼────────────┼───────┼────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│    7.12% │   47.83 ms │ 32896 │   1.45 µs ± 0.17   (  1.19 ‥ 1.91)     │ _Z10dot_kernelIfLi128ELi0E15cublasDotParamsI16cublasGemvTensorIKfE30cublasGemvTensorStridedBatchedIfEEEvT2_                                                                                         ⋯
│    6.64% │   44.64 ms │ 32896 │   1.36 µs ± 0.16   (  1.19 ‥ 1.67)     │ _Z20reduce_1Block_kernelIfLi128ELi7E30cublasGemvTensorStridedBatchedIfES1_S1_EvPKT_S2_T2_iS4_S2_T3_T4_19cublasPointerMode_t18cublasLtEpilogue_tS0_IKN8biasTypeINS7_10value_typeES2_E4typeEE         ⋯
│    5.61% │   37.68 ms │ 33152 │   1.14 µs ± 0.17   (  0.95 ‥ 1.43)     │ _Z15axpy_kernel_valIffEv19cublasAxpyParamsValIT_S1_T0_E                                                                                                                                             ⋯
│    4.57% │   30.74 ms │ 33154 │ 927.25 ns ± 353.61 (715.26 ‥ 12397.77) │ [copy device to pageable memory]                                                                                                                                                                    ⋯
│    0.21% │    1.39 ms │   516 │    2.7 µs ± 0.24   (  2.15 ‥ 3.1)      │ _Z11nrm2_kernelIfffLi1ELi0ELi128EEv16cublasNrm2ParamsIiT_T1_E                                                                                                                                       ⋯
│    0.10% │  641.11 µs │   256 │    2.5 µs ± 0.17   (  2.15 ‥ 2.86)     │ _ZN8internal5gemvx6kernelIiiffffLb0ELb1ELb0ELb0ELi7ELb0E18cublasGemvParamsExIi30cublasGemvTensorStridedBatchedIKfES5_S3_IfEfEEENSt9enable_ifIXntT5_EvE4typeET11_                                    ⋯
│    0.04% │  302.08 µs │   256 │   1.18 µs ± 0.16   (  0.95 ‥ 1.43)     │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float32Li1ELi1EE11BroadcastedI12CuArrayStyleILi1E12DeviceBufferE5TupleI5OneToI5Int64EE1_S5_I8ExtrudedIS0_IS1_Li1ELi1EES5_I4BoolES5_IS7_EES1_E ⋯
│    0.02% │  103.95 µs │   258 │ 402.91 ns ± 165.07 (238.42 ‥ 953.67)   │ [copy pageable to device memory]                                                                                                                                                                    ⋯
│    0.00% │   21.46 µs │    21 │   1.02 µs ± 0.11   (  0.95 ‥ 1.19)     │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float32Li1ELi1EE11BroadcastedI12CuArrayStyleILi1E12DeviceBufferE5TupleI5OneToI5Int64EE8identityS5_IS1_EES7_                                   ⋯
│    0.00% │    1.43 µs │     1 │                                        │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float32Li1ELi1EE11BroadcastedI12CuArrayStyleILi1E12DeviceBufferE5TupleI5OneToI5Int64EE8identityS5_I8ExtrudedIS0_IS1_Li1ELi1EES5_I4BoolES5_IS7 ⋯
│    0.00% │  715.26 ns │     1 │                                        │ [set device memory]                                                                                                                                                                                 ⋯
└──────────┴────────────┴───────┴────────────────────────────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
                                                                                                                                                                                                                                                                1 column omitted

For sanity check both the algorithms do converge:

julia> solve(prob1, KrylovJL_GMRES()).resid
0.00023247192f0

julia> solve(prob2, KrylovJL_GMRES()).resid
0.00023807214f0

julia> solve(prob2, SimpleGMRES(; blocksize=8)).resid
1.1054129f-5

ai-maintainer bot reviewed Aug 22, 2023

View reviewed changes

avik-pal changed the title ~~[WIP] Add SimpleGMRES implementation~~ Add SimpleGMRES implementation Aug 25, 2023

avik-pal mentioned this pull request Aug 28, 2023

Better Steady State Adjoint SciML/SciMLSensitivity.jl#877

Closed

8 tasks

avik-pal and others added 12 commits September 15, 2023 19:37

Add SimpleGMRES implementation

2b9d8fd

Fix Project.toml

786f9b1

Working version for GMRES with blocksize

8547c19

Add performance tip

f23f83c

Add faster GMRES version

79fd696

Remove NNlib

07fc3fa

Use a local copy of sym_givens

bc74893

Use KA.jl for faster sym_givens

19312b1

Use inplace accumulation

d354ab1

Propagate Iteration Statistics

d738dcc

Add tests

b1e4bf4

Add documentation entry

f6155f6

avik-pal force-pushed the ap/simplegmres branch from 91c0ea0 to f6155f6 Compare September 15, 2023 23:37

avik-pal added 2 commits September 20, 2023 16:12

Remove comment

38191f6

Merge branch 'main' of github.com:SciML/LinearSolve.jl into ap/simple…

38e3401

…gmres

ChrisRackauckas reviewed Sep 20, 2023

View reviewed changes

ChrisRackauckas approved these changes Sep 20, 2023

View reviewed changes

ChrisRackauckas merged commit ac2b5aa into SciML:main Sep 20, 2023
15 of 17 checks passed

avik-pal deleted the ap/simplegmres branch September 20, 2023 20:30

avik-pal commented Mar 11, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SimpleGMRES implementation #366

Add SimpleGMRES implementation #366

avik-pal commented Aug 22, 2023 •

edited

ai-maintainer bot left a comment •

edited

codecov bot commented Aug 22, 2023 •

edited

avik-pal commented Aug 23, 2023 •

edited

avik-pal commented Aug 24, 2023

avik-pal commented Aug 25, 2023

avik-pal commented Aug 25, 2023

ChrisRackauckas Sep 20, 2023

avik-pal Sep 20, 2023

amontoison Oct 31, 2023

ChrisRackauckas commented Sep 20, 2023

amontoison commented Oct 31, 2023 •

edited

ChrisRackauckas commented Oct 31, 2023

avik-pal commented Mar 10, 2024

amontoison commented Mar 11, 2024 •

edited

avik-pal Mar 11, 2024

avik-pal commented Mar 11, 2024

avik-pal commented Mar 11, 2024 •

edited

Add SimpleGMRES implementation #366

Add SimpleGMRES implementation #366

Conversation

avik-pal commented Aug 22, 2023 • edited

ai-maintainer bot left a comment • edited

Choose a reason for hiding this comment

AI-Maintainer Review for PR - Add SimpleGMRES implementation

Title and Description 👍

Scope of Changes 👍

Testing 👎

Documentation 👎

Suggested Changes

codecov bot commented Aug 22, 2023 • edited

Codecov Report

avik-pal commented Aug 23, 2023 • edited

Simple Benchmark Script

Takeaways

avik-pal commented Aug 24, 2023

avik-pal commented Aug 25, 2023

avik-pal commented Aug 25, 2023

ChrisRackauckas Sep 20, 2023

Choose a reason for hiding this comment

avik-pal Sep 20, 2023

Choose a reason for hiding this comment

amontoison Oct 31, 2023

Choose a reason for hiding this comment

ChrisRackauckas commented Sep 20, 2023

amontoison commented Oct 31, 2023 • edited

ChrisRackauckas commented Oct 31, 2023

avik-pal commented Mar 10, 2024

amontoison commented Mar 11, 2024 • edited

avik-pal Mar 11, 2024

Choose a reason for hiding this comment

avik-pal commented Mar 11, 2024

avik-pal commented Mar 11, 2024 • edited

avik-pal commented Aug 22, 2023 •

edited

ai-maintainer bot left a comment •

edited

codecov bot commented Aug 22, 2023 •

edited

avik-pal commented Aug 23, 2023 •

edited

amontoison commented Oct 31, 2023 •

edited

amontoison commented Mar 11, 2024 •

edited

avik-pal commented Mar 11, 2024 •

edited