Add choice for triangular solver implementation for Ginkgo #585

fritzgoebel · 2023-01-13T14:43:40Z

This PR adds a parameter ginkgo_trisolve to select the triangular solver used in Ginkgo. The options are sparselib and syncfree to distinguish between the vendor library and our own implementation.

As discussed with @pelesh it also removes the previously present refinement steps with GMRES.

I also updated the glu_experimental branch in Ginkgo to make the triangular solver choice possible, so you will need to update your CI pipelines.

pelesh · 2023-01-13T15:56:33Z

@cameronrutherford, we need to update our pipelines with the latest head from Ginkgo. See related GitLab issue.

pelesh · 2023-01-17T22:17:17Z

@fritzgoebel, does 2bcae30 requires Ginkgo version with corresponding updates?

fritzgoebel · 2023-01-17T22:19:23Z

@pelesh no, this only changes hiop code.

barracuda156 · 2023-01-29T23:11:52Z

I also updated the glu_experimental branch in Ginkgo to make the triangular solver choice possible, so you will need to update your CI pipelines.

@fritzgoebel Could this by chance be addressed? ginkgo-project/ginkgo#1258

fritzgoebel · 2023-01-30T21:39:59Z

The glu_experimental branch in Ginkgo is experimental and we are working on replacing it with develop. But until we have all necessary functionality merged, I'm sorry to have to disappoint you.

barracuda156 · 2023-01-30T21:47:51Z

The glu_experimental branch in Ginkgo is experimental and we are working on replacing it with develop. But until we have all necessary functionality merged, I'm sorry to have to disappoint you.

As long as that gets fixed in develop, all is good.

fritzgoebel · 2023-01-30T21:52:32Z

This will sure happen, but as the error seems to originate in third-party code we used as a starting point for this effort but will not rely on in develop, we do not plan on fixing it in this particular branch.

pelesh

This branch is well tested and ready to merge. It might be a good idea to get the updated version of Ginkgo on all CI pipelines first.

cameronrutherford · 2023-02-08T21:14:20Z

Pushed updated modules, hopefully tests pass

pelesh · 2023-02-08T22:21:21Z

This is the error on Marianas. The convex problem is failing with the error below, but the non-convex test case works fine.

@cnpetra @nychiang do you have any suggestions what is going on here?

22: Test command: /share/apps/openmpi/4.1.0mlx5.0/gcc/10.2.0/bin/mpirun "-n" "1" "/people/svcexasgd/gitlab/113026/build/src/Drivers/Sparse/NlpSparseEx1.exe" "500" "-ginkgo_cuda" "-selfcheck"
22: Test timeout computed to be: 10000000
22: [1675891897.306782] [dl09:11014:0]       ib_iface.c:665  UCX  ERROR ibv_create_cq(cqe=4096) failed: Cannot allocate memory
22: [dl09.local:11014] pml_ucx.c:273  Error: Failed to create UCP worker
22: ===============
22: Hiop SOLVER
22: ===============
22: Using 1 MPI ranks.
22: ---------------
22: Problem Summary
22: ---------------
22: Total number of variables: 500
22:      lower/upper/lower_and_upper bounds: 499 / 1 / 1
22: Total number of equality constraints: 1
22: Total number of inequality constraints: 498
22:      lower/upper/lower_and_upper bounds: 498 / 497 / 497
22: LSQ linear solver --- KKT_SPARSE_XDYcYd linsys: MA57 size 1497 cons 499 nnz 3991 (option 'duals_init_linear_solver_sparse' 'auto')
22: iter    objective     inf_pr     inf_du   lg(mu)  alpha_du   alpha_pr linesrch
22:    0  7.6705009e+00 9.980e+00  1.118e+00  -1.00  0.000e+00  0.000e+00  -(-)
22: Setting up Ginkgo solver ... 
22:    1  7.4398508e+00 9.733e+00  9.690e+00  -1.00  2.424e-03  2.474e-02  1(s)
22:    2  5.6597411e+01 3.144e-12  7.214e+01  -1.00  1.230e-02  1.000e+00  1(s)
22:    3  5.3831008e+01 2.752e-12  5.103e+01  -1.00  9.011e-01  1.246e-01  1(f)
22:    4  1.1479960e-01 1.084e-12  5.715e+01  -1.00  1.354e-01  6.046e-01  1(f)
22:    5  6.5161913e+00 5.418e-13  2.875e+01  -1.00  4.395e-02  5.000e-01  2(f)
22:    6  7.7841332e+00 1.563e-13  5.185e+00  -1.00  9.927e-01  7.884e-01  1(f)
22:    7  7.8320269e+00 1.368e-13  3.665e+00  -1.00  1.000e+00  1.250e-01  4(f)
22:    8  8.3081034e+00 8.882e-16  2.619e-02  -1.00  1.000e+00  1.000e+00  1(f)
22:    9  3.3051737e+00 8.882e-16  2.171e+00  -2.55  7.838e-01  1.000e+00  1(f)
22: iter    objective     inf_pr     inf_du   lg(mu)  alpha_du   alpha_pr linesrch
22:   10  1.2846460e+00 1.776e-15  7.038e-01  -2.55  6.203e-02  1.000e+00  1(f)
22: [Warning] BiCGStab did NOT converged after 9 iters. The solution from iter 9 was returned.
22: 	 - Error code 1
22: 	 - Abs res=-nann	 - Rel res=-nan
22: 	 - ||rhs||_2=1.14224   ||sol||_2=nan
22: [Warning] Requesting additional accuracy and stability from the KKT linear system at iteration 10 (safe mode ON) [2]
22: [Warning] BiCGStab did NOT converged after 9 iters. The solution from iter 9 was returned.
22: 	 - Error code 1
22: 	 - Abs res=nann	 - Rel res=nan
22: 	 - ||rhs||_2=nan   ||sol||_2=-nan
22: Minimum step size reached. The problem may be locally infeasible or the gradient inaccurate. Will try to restore feasibility.
22: NlpSparseEx1.exe: /people/svcexasgd/gitlab/113026/src/Optimization/hiopIterate.cpp:380: virtual bool hiop::hiopIterate::takeStep_duals(const hiop::hiopIterate&, const hiop::hiopIterate&, const double&, const double&): Assertion `zl->matchesPattern(nlp->get_ixl())' failed.
22: [dl09:11014] *** Process received signal ***

nychiang · 2023-02-08T22:38:47Z

@pelesh I tried this problem with MA57 and it seems the outer iterative refinement is not required.
Seems to me ginkgo requires iterative refinement and something returns nan.
Can you set verbosity_level 7 and show me the log file?
In addition, can you set ir_outer_maxit 0 and rerun this failed case?

cameronrutherford · 2023-02-08T22:43:11Z

Fixed the newell variables file and CI should at least run there now.

pelesh · 2023-02-09T00:24:03Z

@pelesh I tried this problem with MA57 and it seems the outer iterative refinement is not required. Seems to me ginkgo requires iterative refinement and something returns nan. Can you set verbosity_level 7 and show me the log file? In addition, can you set ir_outer_maxit 0 and rerun this failed case?

Tests pass with IR turned off. I don't have access to marianas and newell machines; I'll try to reproduce on my machine. BTW, @nychiang, are you sure HiOp sparse tests don't give you false positives?

nychiang · 2023-02-09T01:11:59Z

Tests pass with IR turned off. I don't have access to marianas and newell machines; I'll try to reproduce on my machine. BTW, @nychiang, are you sure HiOp sparse tests don't give you false positives?

I think it happens in this example:

22: iter    objective     inf_pr     inf_du   lg(mu)  alpha_du   alpha_pr linesrch
22:   10  1.2846460e+00 8.882e-16  7.038e-01  -2.55  6.203e-02  1.000e+00  1(f)
22: [Warning] Requesting additional accuracy and stability from the KKT linear system at iteration 10 (safe mode ON) [2]
22: Minimum step size reached. The problem may be locally infeasible or the gradient inaccurate. Will try to restore feasibility.
22:   11            nan 0.000e+00  0.000e+00  -2.55  1.000e+00  5.551e-17  0(R)
22: Successfull termination.
22: Total time 4.836s  
22: Hiop internal time:     total 4.834s      avg iter 0.439s  
22:     internal total std dev across ranks 0.000 percent
22: Fcn/deriv time:     total=0.003s  ( obj=0.001 grad=0.000 cons=0.001 Jac=0.000 Hess=0.000) 
22:     Fcn/deriv total std dev across ranks 0.000 percent
22: Fcn/deriv #: obj 125 grad 12 eq cons 126 ineq cons 126 eq Jac 12 ineq Jac 12
22: Total KKT time 4.851s  
22: 	update init 0.000s     update linsys 0.001s     fact 4.808s  
22: 	solve rhs-manip 0.001s    inner solve 0.013s    resid 0.007s    IR 0.000iters  
22: 
22: selfcheck success (6 digits)
22/43 Test #22: NlpSparse1_6 ......................   Passed    5.39 sec

I think this is a false positive, as hiop converges with inf_pr=inf_du=0.
You can add nlp.options->SetIntegerValue("verbosity_level", 7); to this problem, and let it prints all the information in pipeline.

@pelesh @cnpetra

cnpetra · 2023-02-09T17:27:32Z

@pelesh I tried this problem with MA57 and it seems the outer iterative refinement is not required. Seems to me ginkgo requires iterative refinement and something returns nan. Can you set verbosity_level 7 and show me the log file? In addition, can you set ir_outer_maxit 0 and rerun this failed case?

Tests pass with IR turned off. I don't have access to marianas and newell machines; I'll try to reproduce on my machine. BTW, @nychiang, are you sure HiOp sparse tests don't give you false positives?

[edited-removed comment]

cnpetra · 2023-02-09T17:32:06Z

Tests pass with IR turned off. I don't have access to marianas and newell machines; I'll try to reproduce on my machine. BTW, @nychiang, are you sure HiOp sparse tests don't give you false positives?

I think it happens in this example:

22: iter    objective     inf_pr     inf_du   lg(mu)  alpha_du   alpha_pr linesrch
22:   10  1.2846460e+00 8.882e-16  7.038e-01  -2.55  6.203e-02  1.000e+00  1(f)
22: [Warning] Requesting additional accuracy and stability from the KKT linear system at iteration 10 (safe mode ON) [2]
22: Minimum step size reached. The problem may be locally infeasible or the gradient inaccurate. Will try to restore feasibility.
22:   11            nan 0.000e+00  0.000e+00  -2.55  1.000e+00  5.551e-17  0(R)
22: Successfull termination.
22: Total time 4.836s  
22: Hiop internal time:     total 4.834s      avg iter 0.439s  
22:     internal total std dev across ranks 0.000 percent
22: Fcn/deriv time:     total=0.003s  ( obj=0.001 grad=0.000 cons=0.001 Jac=0.000 Hess=0.000) 
22:     Fcn/deriv total std dev across ranks 0.000 percent
22: Fcn/deriv #: obj 125 grad 12 eq cons 126 ineq cons 126 eq Jac 12 ineq Jac 12
22: Total KKT time 4.851s  
22: 	update init 0.000s     update linsys 0.001s     fact 4.808s  
22: 	solve rhs-manip 0.001s    inner solve 0.013s    resid 0.007s    IR 0.000iters  
22: 
22: selfcheck success (6 digits)
22/43 Test #22: NlpSparse1_6 ......................   Passed    5.39 sec

I think this is a false positive, as hiop converges with inf_pr=inf_du=0. You can add nlp.options->SetIntegerValue("verbosity_level", 7); to this problem, and let it prints all the information in pipeline.

@pelesh @cnpetra

@nychiang is this for MA57?

nychiang · 2023-02-09T18:12:14Z

@nychiang is this for MA57?
No. That is from PNNL CI, in which @pelesh runs the convex problem with Ginkgo and without outer IR. I think this is a corner case, where Ginkgo fails the FR problem (probably the 1st iter) and returns NAN and infeas_pr=infeas_du=0 (initial value) . @cnpetra

pelesh · 2023-02-09T19:04:51Z

@nychiang is this for MA57?
No. That is from PNNL CI, in which @pelesh runs the convex problem with Ginkgo and without outer IR. I think this is a corner case, where Ginkgo fails the FR problem (probably the 1st iter) and returns NAN and infeas_pr=infeas_du=0 (initial value) . @cnpetra

@nychiang, @cnpetra: The verbose output is here.

@fritzgoebel, please take a look, as well.

cnpetra · 2023-02-09T19:30:44Z

@nychiang is this for MA57?
No. That is from PNNL CI, in which @pelesh runs the convex problem with Ginkgo and without outer IR. I think this is a corner case, where Ginkgo fails the FR problem (probably the 1st iter) and returns NAN and infeas_pr=infeas_du=0 (initial value) . @cnpetra

Judging from the detailed output, HiOp enters FR because of the nans.

Very likely the linear solve fails and gives nan search direction, since the nans first appear when the residual is computed. I would say that one of the triangular solves fail. BiCGStab IR may make it more likely for this issue to appear since it does multiple backsolves (one with each triangular factor) and it does it for decreasingly small right-hand sides. @fritzgoebel

pelesh · 2023-02-09T19:43:25Z

BiCGStab IR may make it more likely for this issue to appear since it does multiple backsolves (one with each triangular factor) and it does it for decreasingly small right-hand sides

Actually it is comparison with nan that gives false positive (see #594). IR seems to keep the numbers finite and then fails after max number of iterations is exceeded thus avoiding comparison with nans.

cnpetra · 2023-02-10T19:29:21Z

BiCGStab IR may make it more likely for this issue to appear since it does multiple backsolves (one with each triangular factor) and it does it for decreasingly small right-hand sides

Actually it is comparison with nan that gives false positive (see #594). IR seems to keep the numbers finite and then fails after max number of iterations is exceeded thus avoiding comparison with nans.

my comment was not about the false positive issue, but about the source of nans. Just talked to @nychiang and we're pretty sure it originates in the linear solver.

fritzgoebel · 2023-02-10T21:15:29Z

I rebased onto develop and changed the triangular solver choice to default to the Ginkgo implementation. It still seems to need some regularization, but contrary to the cusparse solver at least does not generate nans.
Please confirm if this fixes the issue.

pelesh · 2023-02-15T19:57:56Z

@fritzgoebel: You need to rebase this branch to trigger newly configured CI pipelines. The old ones don't work anymore. Once you do that and tests pass, we can merge this PR.

…erative refinement inside Ginkgo integration

…sts temporarily.

pelesh

All CI pipelines pass. This branch is well tested (it was used for recent HiOp profiling) and it is safe to merge.

BUILD.sh

src/Utils/hiopOptions.cpp

src/LinAlg/hiopLinSolverSparseGinkgo.cpp

fritzgoebel requested review from pelesh and nkoukpaizan January 13, 2023 14:43

pelesh mentioned this pull request Jan 29, 2023

Build with ginkgo back-end fails #586

Open

pelesh reviewed Feb 3, 2023

View reviewed changes

pelesh mentioned this pull request Feb 9, 2023

False positives in HiOp functionality tests #594

Open

fritzgoebel force-pushed the ginkgo_trisolve_choice branch from b0a8b1f to b68ab6e Compare February 10, 2023 21:12

fritzgoebel and others added 2 commits February 15, 2023 15:29

Add choice for triangular solver implementation for Ginkgo, remove it…

1e0cabd

…erative refinement inside Ginkgo integration

Add options to configure Ginkgo GMRES

a6bcc7c

cameronrutherford and others added 8 commits February 15, 2023 15:29

Update ginkgo install on Newell and Deception.

b328027

Remove verbose output from build script and fix newell.

f38c9a7

Turn of outer iterative refinement with Ginkgo

ed8d18d

Increase verbosity for Ginkgo CUDA test and turning off non-sparse te…

aecf61c

…sts temporarily.

Remove nvblas.conf from tracked files

ec99cf1

Update GINKGO on Ascent

854eadc

Use Ginkgo triangular solver as default instead of vendor implementation

be6ac6c

Revert temporary changes made for debugging.

ee374e5

fritzgoebel force-pushed the ginkgo_trisolve_choice branch from 592c712 to ee374e5 Compare February 15, 2023 20:29

pelesh approved these changes Feb 15, 2023

View reviewed changes

Used gcc/9.1.0-based GINKGO module on Ascent for consistency

aa40705

nkoukpaizan approved these changes Feb 17, 2023

View reviewed changes

BUILD.sh Show resolved Hide resolved

cnpetra self-requested a review February 17, 2023 16:59

cnpetra requested changes Feb 17, 2023

View reviewed changes

src/Utils/hiopOptions.cpp Outdated Show resolved Hide resolved

src/LinAlg/hiopLinSolverSparseGinkgo.cpp Outdated Show resolved Hide resolved

src/LinAlg/hiopLinSolverSparseGinkgo.cpp Show resolved Hide resolved

src/LinAlg/hiopLinSolverSparseGinkgo.cpp Outdated Show resolved Hide resolved

Address Review Comments

9e09c99

cnpetra approved these changes Feb 17, 2023

View reviewed changes

src/LinAlg/hiopLinSolverSparseGinkgo.cpp Show resolved Hide resolved

cnpetra merged commit fc91d88 into develop Feb 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add choice for triangular solver implementation for Ginkgo #585

Add choice for triangular solver implementation for Ginkgo #585

fritzgoebel commented Jan 13, 2023 •

edited

Loading

pelesh commented Jan 13, 2023

pelesh commented Jan 17, 2023

fritzgoebel commented Jan 17, 2023

barracuda156 commented Jan 29, 2023

fritzgoebel commented Jan 30, 2023

barracuda156 commented Jan 30, 2023

fritzgoebel commented Jan 30, 2023

pelesh left a comment •

edited

Loading

cameronrutherford commented Feb 8, 2023

pelesh commented Feb 8, 2023

nychiang commented Feb 8, 2023

cameronrutherford commented Feb 8, 2023

pelesh commented Feb 9, 2023 •

edited

Loading

nychiang commented Feb 9, 2023

cnpetra commented Feb 9, 2023 •

edited

Loading

cnpetra commented Feb 9, 2023

nychiang commented Feb 9, 2023

pelesh commented Feb 9, 2023

cnpetra commented Feb 9, 2023

pelesh commented Feb 9, 2023

cnpetra commented Feb 10, 2023

fritzgoebel commented Feb 10, 2023 •

edited

Loading

pelesh commented Feb 15, 2023

pelesh left a comment

Add choice for triangular solver implementation for Ginkgo #585

Add choice for triangular solver implementation for Ginkgo #585

Conversation

fritzgoebel commented Jan 13, 2023 • edited Loading

pelesh commented Jan 13, 2023

pelesh commented Jan 17, 2023

fritzgoebel commented Jan 17, 2023

barracuda156 commented Jan 29, 2023

fritzgoebel commented Jan 30, 2023

barracuda156 commented Jan 30, 2023

fritzgoebel commented Jan 30, 2023

pelesh left a comment • edited Loading

Choose a reason for hiding this comment

cameronrutherford commented Feb 8, 2023

pelesh commented Feb 8, 2023

nychiang commented Feb 8, 2023

cameronrutherford commented Feb 8, 2023

pelesh commented Feb 9, 2023 • edited Loading

nychiang commented Feb 9, 2023

cnpetra commented Feb 9, 2023 • edited Loading

cnpetra commented Feb 9, 2023

nychiang commented Feb 9, 2023

pelesh commented Feb 9, 2023

cnpetra commented Feb 9, 2023

pelesh commented Feb 9, 2023

cnpetra commented Feb 10, 2023

fritzgoebel commented Feb 10, 2023 • edited Loading

pelesh commented Feb 15, 2023

pelesh left a comment

Choose a reason for hiding this comment

fritzgoebel commented Jan 13, 2023 •

edited

Loading

pelesh left a comment •

edited

Loading

pelesh commented Feb 9, 2023 •

edited

Loading

cnpetra commented Feb 9, 2023 •

edited

Loading

fritzgoebel commented Feb 10, 2023 •

edited

Loading