Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add choice for triangular solver implementation for Ginkgo #585

Merged
merged 12 commits into from
Feb 20, 2023

Conversation

fritzgoebel
Copy link
Contributor

@fritzgoebel fritzgoebel commented Jan 13, 2023

This PR adds a parameter ginkgo_trisolve to select the triangular solver used in Ginkgo. The options are sparselib and syncfree to distinguish between the vendor library and our own implementation.

As discussed with @pelesh it also removes the previously present refinement steps with GMRES.

I also updated the glu_experimental branch in Ginkgo to make the triangular solver choice possible, so you will need to update your CI pipelines.

@pelesh
Copy link
Collaborator

pelesh commented Jan 13, 2023

@cameronrutherford, we need to update our pipelines with the latest head from Ginkgo. See related GitLab issue.

@pelesh
Copy link
Collaborator

pelesh commented Jan 17, 2023

@fritzgoebel, does 2bcae30 requires Ginkgo version with corresponding updates?

@fritzgoebel
Copy link
Contributor Author

@pelesh no, this only changes hiop code.

@barracuda156
Copy link

I also updated the glu_experimental branch in Ginkgo to make the triangular solver choice possible, so you will need to update your CI pipelines.

@fritzgoebel Could this by chance be addressed? ginkgo-project/ginkgo#1258

@fritzgoebel
Copy link
Contributor Author

The glu_experimental branch in Ginkgo is experimental and we are working on replacing it with develop. But until we have all necessary functionality merged, I'm sorry to have to disappoint you.

@barracuda156
Copy link

The glu_experimental branch in Ginkgo is experimental and we are working on replacing it with develop. But until we have all necessary functionality merged, I'm sorry to have to disappoint you.

As long as that gets fixed in develop, all is good.

@fritzgoebel
Copy link
Contributor Author

This will sure happen, but as the error seems to originate in third-party code we used as a starting point for this effort but will not rely on in develop, we do not plan on fixing it in this particular branch.

Copy link
Collaborator

@pelesh pelesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This branch is well tested and ready to merge. It might be a good idea to get the updated version of Ginkgo on all CI pipelines first.

@cameronrutherford
Copy link
Collaborator

Pushed updated modules, hopefully tests pass

@pelesh
Copy link
Collaborator

pelesh commented Feb 8, 2023

This is the error on Marianas. The convex problem is failing with the error below, but the non-convex test case works fine.

@cnpetra @nychiang do you have any suggestions what is going on here?

22: Test command: /share/apps/openmpi/4.1.0mlx5.0/gcc/10.2.0/bin/mpirun "-n" "1" "/people/svcexasgd/gitlab/113026/build/src/Drivers/Sparse/NlpSparseEx1.exe" "500" "-ginkgo_cuda" "-selfcheck"
22: Test timeout computed to be: 10000000
22: [1675891897.306782] [dl09:11014:0]       ib_iface.c:665  UCX  ERROR ibv_create_cq(cqe=4096) failed: Cannot allocate memory
22: [dl09.local:11014] pml_ucx.c:273  Error: Failed to create UCP worker
22: ===============
22: Hiop SOLVER
22: ===============
22: Using 1 MPI ranks.
22: ---------------
22: Problem Summary
22: ---------------
22: Total number of variables: 500
22:      lower/upper/lower_and_upper bounds: 499 / 1 / 1
22: Total number of equality constraints: 1
22: Total number of inequality constraints: 498
22:      lower/upper/lower_and_upper bounds: 498 / 497 / 497
22: LSQ linear solver --- KKT_SPARSE_XDYcYd linsys: MA57 size 1497 cons 499 nnz 3991 (option 'duals_init_linear_solver_sparse' 'auto')
22: iter    objective     inf_pr     inf_du   lg(mu)  alpha_du   alpha_pr linesrch
22:    0  7.6705009e+00 9.980e+00  1.118e+00  -1.00  0.000e+00  0.000e+00  -(-)
22: Setting up Ginkgo solver ... 
22:    1  7.4398508e+00 9.733e+00  9.690e+00  -1.00  2.424e-03  2.474e-02  1(s)
22:    2  5.6597411e+01 3.144e-12  7.214e+01  -1.00  1.230e-02  1.000e+00  1(s)
22:    3  5.3831008e+01 2.752e-12  5.103e+01  -1.00  9.011e-01  1.246e-01  1(f)
22:    4  1.1479960e-01 1.084e-12  5.715e+01  -1.00  1.354e-01  6.046e-01  1(f)
22:    5  6.5161913e+00 5.418e-13  2.875e+01  -1.00  4.395e-02  5.000e-01  2(f)
22:    6  7.7841332e+00 1.563e-13  5.185e+00  -1.00  9.927e-01  7.884e-01  1(f)
22:    7  7.8320269e+00 1.368e-13  3.665e+00  -1.00  1.000e+00  1.250e-01  4(f)
22:    8  8.3081034e+00 8.882e-16  2.619e-02  -1.00  1.000e+00  1.000e+00  1(f)
22:    9  3.3051737e+00 8.882e-16  2.171e+00  -2.55  7.838e-01  1.000e+00  1(f)
22: iter    objective     inf_pr     inf_du   lg(mu)  alpha_du   alpha_pr linesrch
22:   10  1.2846460e+00 1.776e-15  7.038e-01  -2.55  6.203e-02  1.000e+00  1(f)
22: [Warning] BiCGStab did NOT converged after 9 iters. The solution from iter 9 was returned.
22: 	 - Error code 1
22: 	 - Abs res=-nann	 - Rel res=-nan
22: 	 - ||rhs||_2=1.14224   ||sol||_2=nan
22: [Warning] Requesting additional accuracy and stability from the KKT linear system at iteration 10 (safe mode ON) [2]
22: [Warning] BiCGStab did NOT converged after 9 iters. The solution from iter 9 was returned.
22: 	 - Error code 1
22: 	 - Abs res=nann	 - Rel res=nan
22: 	 - ||rhs||_2=nan   ||sol||_2=-nan
22: Minimum step size reached. The problem may be locally infeasible or the gradient inaccurate. Will try to restore feasibility.
22: NlpSparseEx1.exe: /people/svcexasgd/gitlab/113026/src/Optimization/hiopIterate.cpp:380: virtual bool hiop::hiopIterate::takeStep_duals(const hiop::hiopIterate&, const hiop::hiopIterate&, const double&, const double&): Assertion `zl->matchesPattern(nlp->get_ixl())' failed.
22: [dl09:11014] *** Process received signal ***

@nychiang
Copy link
Collaborator

nychiang commented Feb 8, 2023

@pelesh I tried this problem with MA57 and it seems the outer iterative refinement is not required.
Seems to me ginkgo requires iterative refinement and something returns nan.
Can you set verbosity_level 7 and show me the log file?
In addition, can you set ir_outer_maxit 0 and rerun this failed case?

@cameronrutherford
Copy link
Collaborator

Fixed the newell variables file and CI should at least run there now.

@pelesh
Copy link
Collaborator

pelesh commented Feb 9, 2023

@pelesh I tried this problem with MA57 and it seems the outer iterative refinement is not required. Seems to me ginkgo requires iterative refinement and something returns nan. Can you set verbosity_level 7 and show me the log file? In addition, can you set ir_outer_maxit 0 and rerun this failed case?

Tests pass with IR turned off. I don't have access to marianas and newell machines; I'll try to reproduce on my machine. BTW, @nychiang, are you sure HiOp sparse tests don't give you false positives?

@nychiang
Copy link
Collaborator

nychiang commented Feb 9, 2023

Tests pass with IR turned off. I don't have access to marianas and newell machines; I'll try to reproduce on my machine. BTW, @nychiang, are you sure HiOp sparse tests don't give you false positives?

I think it happens in this example:

22: iter    objective     inf_pr     inf_du   lg(mu)  alpha_du   alpha_pr linesrch
22:   10  1.2846460e+00 8.882e-16  7.038e-01  -2.55  6.203e-02  1.000e+00  1(f)
22: [Warning] Requesting additional accuracy and stability from the KKT linear system at iteration 10 (safe mode ON) [2]
22: Minimum step size reached. The problem may be locally infeasible or the gradient inaccurate. Will try to restore feasibility.
22:   11            nan 0.000e+00  0.000e+00  -2.55  1.000e+00  5.551e-17  0(R)
22: Successfull termination.
22: Total time 4.836s  
22: Hiop internal time:     total 4.834s      avg iter 0.439s  
22:     internal total std dev across ranks 0.000 percent
22: Fcn/deriv time:     total=0.003s  ( obj=0.001 grad=0.000 cons=0.001 Jac=0.000 Hess=0.000) 
22:     Fcn/deriv total std dev across ranks 0.000 percent
22: Fcn/deriv #: obj 125 grad 12 eq cons 126 ineq cons 126 eq Jac 12 ineq Jac 12
22: Total KKT time 4.851s  
22: 	update init 0.000s     update linsys 0.001s     fact 4.808s  
22: 	solve rhs-manip 0.001s    inner solve 0.013s    resid 0.007s    IR 0.000iters  
22: 
22: selfcheck success (6 digits)
22/43 Test #22: NlpSparse1_6 ......................   Passed    5.39 sec

I think this is a false positive, as hiop converges with inf_pr=inf_du=0.
You can add nlp.options->SetIntegerValue("verbosity_level", 7); to this problem, and let it prints all the information in pipeline.

@pelesh @cnpetra

@cnpetra
Copy link
Collaborator

cnpetra commented Feb 9, 2023

@pelesh I tried this problem with MA57 and it seems the outer iterative refinement is not required. Seems to me ginkgo requires iterative refinement and something returns nan. Can you set verbosity_level 7 and show me the log file? In addition, can you set ir_outer_maxit 0 and rerun this failed case?

Tests pass with IR turned off. I don't have access to marianas and newell machines; I'll try to reproduce on my machine. BTW, @nychiang, are you sure HiOp sparse tests don't give you false positives?

[edited-removed comment]

@cnpetra
Copy link
Collaborator

cnpetra commented Feb 9, 2023

Tests pass with IR turned off. I don't have access to marianas and newell machines; I'll try to reproduce on my machine. BTW, @nychiang, are you sure HiOp sparse tests don't give you false positives?

I think it happens in this example:

22: iter    objective     inf_pr     inf_du   lg(mu)  alpha_du   alpha_pr linesrch
22:   10  1.2846460e+00 8.882e-16  7.038e-01  -2.55  6.203e-02  1.000e+00  1(f)
22: [Warning] Requesting additional accuracy and stability from the KKT linear system at iteration 10 (safe mode ON) [2]
22: Minimum step size reached. The problem may be locally infeasible or the gradient inaccurate. Will try to restore feasibility.
22:   11            nan 0.000e+00  0.000e+00  -2.55  1.000e+00  5.551e-17  0(R)
22: Successfull termination.
22: Total time 4.836s  
22: Hiop internal time:     total 4.834s      avg iter 0.439s  
22:     internal total std dev across ranks 0.000 percent
22: Fcn/deriv time:     total=0.003s  ( obj=0.001 grad=0.000 cons=0.001 Jac=0.000 Hess=0.000) 
22:     Fcn/deriv total std dev across ranks 0.000 percent
22: Fcn/deriv #: obj 125 grad 12 eq cons 126 ineq cons 126 eq Jac 12 ineq Jac 12
22: Total KKT time 4.851s  
22: 	update init 0.000s     update linsys 0.001s     fact 4.808s  
22: 	solve rhs-manip 0.001s    inner solve 0.013s    resid 0.007s    IR 0.000iters  
22: 
22: selfcheck success (6 digits)
22/43 Test #22: NlpSparse1_6 ......................   Passed    5.39 sec

I think this is a false positive, as hiop converges with inf_pr=inf_du=0. You can add nlp.options->SetIntegerValue("verbosity_level", 7); to this problem, and let it prints all the information in pipeline.

@pelesh @cnpetra

@nychiang is this for MA57?

@nychiang
Copy link
Collaborator

nychiang commented Feb 9, 2023

@nychiang is this for MA57?
No. That is from PNNL CI, in which @pelesh runs the convex problem with Ginkgo and without outer IR. I think this is a corner case, where Ginkgo fails the FR problem (probably the 1st iter) and returns NAN and infeas_pr=infeas_du=0 (initial value) . @cnpetra

@pelesh
Copy link
Collaborator

pelesh commented Feb 9, 2023

@nychiang is this for MA57?
No. That is from PNNL CI, in which @pelesh runs the convex problem with Ginkgo and without outer IR. I think this is a corner case, where Ginkgo fails the FR problem (probably the 1st iter) and returns NAN and infeas_pr=infeas_du=0 (initial value) . @cnpetra

@nychiang, @cnpetra: The verbose output is here.

@fritzgoebel, please take a look, as well.

@cnpetra
Copy link
Collaborator

cnpetra commented Feb 9, 2023

@nychiang is this for MA57?
No. That is from PNNL CI, in which @pelesh runs the convex problem with Ginkgo and without outer IR. I think this is a corner case, where Ginkgo fails the FR problem (probably the 1st iter) and returns NAN and infeas_pr=infeas_du=0 (initial value) . @cnpetra

Judging from the detailed output, HiOp enters FR because of the nans.

Very likely the linear solve fails and gives nan search direction, since the nans first appear when the residual is computed. I would say that one of the triangular solves fail. BiCGStab IR may make it more likely for this issue to appear since it does multiple backsolves (one with each triangular factor) and it does it for decreasingly small right-hand sides. @fritzgoebel

@pelesh
Copy link
Collaborator

pelesh commented Feb 9, 2023

BiCGStab IR may make it more likely for this issue to appear since it does multiple backsolves (one with each triangular factor) and it does it for decreasingly small right-hand sides

Actually it is comparison with nan that gives false positive (see #594). IR seems to keep the numbers finite and then fails after max number of iterations is exceeded thus avoiding comparison with nans.

@cnpetra
Copy link
Collaborator

cnpetra commented Feb 10, 2023

BiCGStab IR may make it more likely for this issue to appear since it does multiple backsolves (one with each triangular factor) and it does it for decreasingly small right-hand sides

Actually it is comparison with nan that gives false positive (see #594). IR seems to keep the numbers finite and then fails after max number of iterations is exceeded thus avoiding comparison with nans.

my comment was not about the false positive issue, but about the source of nans. Just talked to @nychiang and we're pretty sure it originates in the linear solver.

@fritzgoebel
Copy link
Contributor Author

fritzgoebel commented Feb 10, 2023

I rebased onto develop and changed the triangular solver choice to default to the Ginkgo implementation. It still seems to need some regularization, but contrary to the cusparse solver at least does not generate nans.
Please confirm if this fixes the issue.

@pelesh
Copy link
Collaborator

pelesh commented Feb 15, 2023

@fritzgoebel: You need to rebase this branch to trigger newly configured CI pipelines. The old ones don't work anymore. Once you do that and tests pass, we can merge this PR.

Copy link
Collaborator

@pelesh pelesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All CI pipelines pass. This branch is well tested (it was used for recent HiOp profiling) and it is safe to merge.

BUILD.sh Show resolved Hide resolved
@cnpetra cnpetra self-requested a review February 17, 2023 16:59
src/Utils/hiopOptions.cpp Outdated Show resolved Hide resolved
src/LinAlg/hiopLinSolverSparseGinkgo.cpp Outdated Show resolved Hide resolved
src/LinAlg/hiopLinSolverSparseGinkgo.cpp Show resolved Hide resolved
src/LinAlg/hiopLinSolverSparseGinkgo.cpp Outdated Show resolved Hide resolved
@cnpetra cnpetra merged commit fc91d88 into develop Feb 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants