New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Constrained optimization episode 2: revenge of the slack variables #303

Closed
wants to merge 40 commits into
base: master
from

Conversation

Projects
None yet
9 participants
@timholy
Contributor

timholy commented Nov 8, 2016

No matter how the American election turns out, today we have something to celebrate: the beginning of what I hope will be a merge-worthy native Julia constrained optimization framework.

Long ago said that I would not work on hard-core optimization without a debugger (errors triggered 1 hour into an optimization run are too painful to fix otherwise), but @Keno has delivered, so now it's time for me to pony up. Besides, I've grown tired of maintaining my own fork of Optim and want access to all the goodies and cleanups (especially the cleanups) that others have contributed.

Rather than implementing the whole framework at once, I decided to try to keep things "simple" (if a 16002000 line diff can be called simple) by implementing just one method. I picked the interior-point Newton method, since that seems to be the mostly widely-used and/or described method, and for which the iteration is relatively simple. There are also many, many bells and whistles one could add. But I want to get a basic implementation of something merged before going on to other things.

Compared to #50, there are almost too many differences to mention, as this was a clean start. A few of the highlights:

  • it's more in line with the MathProgBase interface (still not a perfect match, but pretty close), for example in how the bounds are specified and how the constraint hessian is requested.
  • it supports both equality and inequality constraints
  • it should exhibit far greater numerical stability near the edges of the feasible region (this has been a big problem with the approach in #50)

It's also worth briefly mentioning that this might have some nice features, though it needs real-world testing for us to know whether these will materialize. For example, unlike Ipopt this is guaranteed to roughly preserve the descent direction of the user's objective function (it will not choose such a high barrier coefficient that it pushes the solution into a different basin).

Testing are a mixed bag: there's quite a few tests for the low-level mathematics, but any overall tests (e.g., ones that involve a call to optimize) are a work in progress. I'm hoping to make use of CUTEst (JuliaSmoothOptimizers/CUTEst.jl#43 (comment)) and do a fairly thorough job, but that will take some time.

It's worth noting a few details:

  • argument order may be somewhat confused; in some places I adopt mutated-first, in others (where I'm trying to match some preexisting API) it's last. It's my understanding that Optim will eventually migrate to mutated-first, but perhaps this is too inconsistent now
  • I needed to customize the line search algorithm to account for constraints, so this goes against the migration to LineSearches.jl and reintroduces a linesearch method to Optim. Aside from simplicity in making updates (I don't want to be bothered with tagging another package just to make tests for a RFC pass), I kept this here because I was also trying to move towards a more sane linesearch API and so it doesn't fit with our classic functions. (It's based on the fact that anonymous functions are fast with julia 0.5 and higher---if people are unhappy with the overhead this causes on 0.4, tough. We have to write code for the future, not the past.)

CC @Cody-G, @dpo, @abelsiqueira.

timholy added some commits Nov 4, 2016

Implement barrier function and equality-constraints
Computes and tests the gradient, too. The Hessian will come later.
Add a principled initialization for μ and λ
Based on the notion that we want to (largely) preserve the objective function's initial descent direction.
@johnmyleswhite

This comment has been minimized.

Contributor

johnmyleswhite commented Nov 8, 2016

Tim Holy for BDFL!

@johnmyleswhite johnmyleswhite self-assigned this Nov 8, 2016

@timholy

This comment has been minimized.

Contributor

timholy commented Nov 8, 2016

Noticed that @tkelman doesn't watch this repo, and that he provided very useful feedback on my previous attempt.

@tkelman

This comment has been minimized.

tkelman commented Nov 8, 2016

is this a primal algorithm or primal-dual?

@pkofod

This comment has been minimized.

Collaborator

pkofod commented Nov 8, 2016

Wonderful. We need more positive emoji-"like"-types I see.

argument order may be somewhat confused; in some places I adopt mutated-first, in others (where I'm trying to match some preexisting API) it's last. It's my understanding that Optim will eventually migrate to mutated-first, but perhaps this is too inconsistent now

Mutated first seems quite standard, so I'll be happy to go through the code and clean it up. Would it be annoying if I did this now?

it supports both equality and inequality constraints
👍

I'll have a closer look one of these days.

@timholy

This comment has been minimized.

Contributor

timholy commented Nov 8, 2016

is this a primal algorithm or primal-dual?

I'm so glad you asked, because as I've learned more I've grown increasingly confused about this point. I form the KKT equations and solve them, so I suspect that many would call this primal-dual. However, the matrices in the primal-dual equations are asymmetric (e.g., the equation after eq. 5), and for generic nonlinear optimization that's problematic because you usually want to enforce positive-definiteness somewhere. So, at least some algorithms (e.g., Ipopt and this one) use a step of elimination to solve for certain constraints, which makes the reduced problem symmetric and amenable to being forced to be positive-definite (e.g., see Wächter & Biegler 2006, "On the implementation of an interior-point filter line-search algorithm for large-scale nonlinear programming," eq. 11). From what I can tell, this basically converts the primal-dual equation back into something that's virtually a primal one---the resulting equation is almost identical to what you'd get if you went the primal route in the first place.

Can you clear this up?

@tkelman

This comment has been minimized.

tkelman commented Nov 8, 2016

I think you've partially misread the Ipopt paper. It doesn't force or require anything to be positive definite - if it did, it would do a very poor job of solving non convex problems. It's not really constraints that are being eliminated, but complementarity rows of the Newton step direction of the KKT conditions. The KKT linear system that Ipopt solves is symmetric, but indefinite. It requires using a sparse Bunch Kaufman factorization in order to obtain the inertia of the KKT system at each iteration. It does a perturbation if the inertia does not give n positive and m negative eigenvalues, where n is the number of primal variables and m is the number of constraints.

The main difference between primal and primal-dual interior point, if I'm remembering the literature right, is the treatment of the barrier term for inequalities. If you replace inequalities with primal objective barrier terms before posing the KKT conditions and forming the Newton system, you have a primal method and the barrier term enters the objective Hessian in a way that squares the slack. For a primal dual method, you state the KKT conditions before approximating the inequalities with barrier terms. Instead you relax the complementarity condition to equal the barrier parameter instead of exactly zero. The contribution to the objective Hessian due to the barrier terms (which here aren't actually used in the derivation, it's purely a descriptive analogy) then depends on both the slack and dual variables. This is equivalent at a KKT point, but gives a different path to get there from an initial guess. Typically primal dual follows the central path better, and gives a Newton system that isn't quite as badly conditioned.

@codecov-io

This comment has been minimized.

codecov-io commented Nov 8, 2016

Current coverage is 85.64% (diff: 72.41%)

Merging #303 into master will decrease coverage by 4.95%

@@             master       #303   diff @@
==========================================
  Files            27         31     +4   
  Lines          1565       2160   +595   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits           1418       1850   +432   
- Misses          147        310   +163   
  Partials          0          0          

Powered by Codecov. Last update ccc51d7...cada264

@timholy

This comment has been minimized.

Contributor

timholy commented Nov 9, 2016

Tim Holy for BDFL!

I was planning to decline, but given the results of the last few hours I might take you up on that offer, if it's still open. 😯

@tkelman, thanks for diving in. I'm learning a lot by thinking through your points.

It's not really constraints that are being eliminated, but complementarity rows of the Newton step direction of the KKT conditions.

Agreed. I simply meant it in the sense that you've introduced additional variables, and they too are subject to "constraints" (the perturbed complementarity condition). Also agreed my choice of words could be confusing, since there are also user-supplied constraints that are in a distinct category.

It doesn't force or require anything to be positive definite - if it did, it would do a very poor job of solving non convex problems.
It does a perturbation if the inertia does not give n positive and m negative eigenvalues, where n is the number of primal variables and m is the number of constraints.

OK, let's apply that to a simple non-convex problem, f(x) = -x^2 + x^4 with no constraints, so we have n=1 and m=0. The second derivative (hessian) is negative over the region |x| <= 1/sqrt(6). Ipopt will "perturb" the hessian until it has 1 positive "eigenvalue" (and there's only one eigenvalue in this case). Now, in this case it's very clear that the negativity of the hessian has nothing to do with roundoff error, so the "perturbation" will be large, bigger than the hessian itself. That's what I mean by "forcing nonnegativity." (I've also argued elsewhere that this very common approach to generating nonnegativity---adding a positive diagonal until the matrix is just barely positive-definite---has serious weaknesses, because it throws away useful information about the length scales over which your function is likely to change, effectively swapping "clearly negative" directions in which the objective changes over short length scales with "nearly flat" directions that require big jumps to produce changes in the objective. In my opinion we have a much better approach available in PositiveFactorizations.jl. But that's a separate discussion.)

I agree, though, that you can't impose positive-definiteness to the total problem, and that the equality constraints are what prevent that. In this PR I only impose nonnegativity on the hessian for the objective + terms that come from the inequalities. I don't have a convergence proof yet, and I wonder if I should somehow project out the components of the equality constraints first (see the discussion in the Ipopt paper following eq. 12). That's totally doable (at least for dense problems), but before worrying too much about that, I thought I'd see how the simple version fared in tests.

As far as the relationship between the primal problem and the primal-dual problem, AFAICT here's the difference: adopting the problem and approximately the notation on this page (i.e., temporarily ignoring equality constraints), the Newton equation from following the primal route is
primal
and the one from the dual route, after solving for Δλ and substituting it into the other equation, is
dual
(Note the wikipedia page is a little misleading in calling W the hessian matrix of B, it's really the coefficient of Δx in the first-order expansion of their eq 5, which is not the same thing.) If you apply the complementarity condition to this, a short calculation shows that these are the same. So AFAICT the dual method is simply a "lagged primal" update, where the complementarity condition is not exactly satisfied at all stages of the update but is instead "frozen" at the base point (and even the base point might not satisfy it exactly). So to the extent that the dual method is better, it arises from the "ability" to not exactly satisfy the perturbed complementarity equation.

@tkelman

This comment has been minimized.

tkelman commented Nov 9, 2016

Ah good point, the interpretation of ensuring positive-definiteness of the Hessian block "projected onto the null space of the constraint Jacobian" works. I keep forgetting about that way of looking at it because that's not how it's calculated, unless you look at the unconstrained case.

In my opinion we have a much better approach available in PositiveFactorizations.jl

With apologies for my skepticism, but I'd personally like to see a convergence proof make it through peer review, and/or demonstrated results across common test sets, before being totally sold on the approach. Pure Julia implementation, abstraction and API design is hard enough work that simultaneous innovation on the algorithm side is a bit beyond my own risk tolerance when comparing to known-quantity existing solvers, but that's coming from someone who isn't in academia any more. You must have a motivating application that you've been developing this for and using it on already, right? Would be interesting if you could point to any details about the kinds of problems where this has been working well, in terms of size, convexity, number of constraints, Hessian and Jacobian structure, etc.

Looks like all the Hessian and Jacobian fields here are dense matrices? I hope that's a planned-to-be-temporary limitation?

@tkelman

This comment has been minimized.

tkelman commented Nov 9, 2016

Note the wikipedia page is a little misleading in calling W the hessian matrix of B, it's really the coefficient of Δx in the first-order expansion of their eq 5, which is not the same thing.

How do they differ?

@timholy

This comment has been minimized.

Contributor

timholy commented Nov 9, 2016

With apologies for my skepticism, but I'd personally like to see a convergence proof make it through peer review, and/or demonstrated results across common test sets, before being totally sold on the approach.

I want those things too, and hopefully I'll get to it eventually. But surely you understand the problem? After correction, a Hessian [-k 0; 0 ϵ] becomes [ϵ 0; 0 k+2ϵ], which means that you're modeling the function as having strong curvature in x_2 and weak curvature in x_1 whereas the reality is that it has strong curvature in x_1 and weak curvature in x_2. So your Newton step will be huge in x_1, and then you'll be puzzled about why the function is behaving entirely differently from what you expected---now you have a huge backtrack on your hands. There's no world in which that's a good thing. That doesn't prove that my approach is better, of course, just that the common approach has obvious limitations and there's a ton of room for doing better.

How do they differ?

It doesn't yet have the JDJ^T term, that comes about by solving and substituting for Δλ.

@timholy

This comment has been minimized.

Contributor

timholy commented Nov 9, 2016

Looks like all the Hessian and Jacobian fields here are dense matrices? I hope that's a planned-to-be-temporary limitation?

AFAICT that's true of all of Optim. Subject of another PR, and it doesn't have to be me. I use L-BFGS, so none of this is relevant anyway. I'm just trying to get something to pass muster, and on #50 people didn't like the fact that it wasn't like the conventional Newton approach. So I'm starting with Newton, even though I have no intention of using it for heavy-duty stuff.

@tkelman

This comment has been minimized.

tkelman commented Nov 9, 2016

It doesn't yet have the JDJ^T term, that comes about by solving and substituting for Δλ.

You'll get that term from the gradient of A^T λ since A is a function of x, no?

I use L-BFGS, so none of this is relevant anyway

For constrained problems too?

@timholy

This comment has been minimized.

Contributor

timholy commented Nov 9, 2016

You'll get that term from the gradient of A^T λ since A is a function of x, no?

No, that gives you the ∇∇c term.

For constrained problems too?

Yes.

@timholy timholy changed the title from (RFC) Constrained optimization episode 2: revenge of the slack variables to Constrained optimization episode 2: revenge of the slack variables Nov 12, 2016

@timholy

This comment has been minimized.

Contributor

timholy commented Nov 12, 2016

OK, I'd say this is no longer WIP: it seems to be working quite well on many test problems. I used CUTEst.jl and selected problems of 100 variables or less (and 100 constraints or less) that had second derivatives. Out of 293 such problems, it errored on 4 (I haven't yet dug into why), failed to converge on 16, and converged on the rest. On 32 problems, it found a better minimum than the one recorded in the CUTEst database, sometimes by quite a lot (possibly an error in the database, of course). Code is https://github.com/timholy/OptimTests.jl. I don't know if you want to incorporate that here, or exactly how you want this tested, but at least now this has been tested pretty thoroughly even if the code isn't part of this PR.

This isn't perfect, but I think it's good enough to be useful to folks.

timholy added some commits Nov 23, 2016

Switch to a predictor algorithm for computing μ
This was motivated by the observation that the deviation-based algorithm doesn't work when there's only one constraint. The predictor algorithm has a little trouble with this case too, but it's not nearly so severe (it *can* increase μ, you just have to prevent it from decreasing it to 0). Moreover, it seems a little more regular in its changes.
@timholy

This comment has been minimized.

Contributor

timholy commented Nov 23, 2016

Status update: lots of changes. Here's how we compare in terms of objective value to Ipopt:
objdiff
We agree on ~90% of problems, and each is better than the other on another 5%. Looks like a wash to me.

Our progress on a per-iteration basis compared to Ipopt (plots the cumulative histogram of the fraction of iterations on which we're ahead):
fracbetter
Perhaps ever-so-slightly worse than Ipopt, but in the same ballpark.

Total number of iterations:
iterations
Both were set to a maximum of 3000 iterations. We're similar, but we do have more problems that reached the maximum number of iterations without converging. In cases I've checked, the algorithm actually found the right minimum but it didn't detect that it had converged. Some of these problems also don't converge if you run them in the unconstrained suite, so this isn't very specific to this PR.

While it's been a lot of work to get this far, I'm pretty stoked that with ~1month of effort it looks like we've got something that (for dense problems) does not look too bad compared to the industry-standard solution. There are several places (particularly the linesearch) where we are also noteworthily simpler than Ipopt, which I think is a good thing. I haven't benchmarked for speed, but I'm not going to worry about that yet. I'll focus on cleanup and test coverage and then merge this, and then work on speed if my own practical problems require it.

@anriseth

This comment has been minimized.

Contributor

anriseth commented Dec 13, 2016

Status update on this PR @timholy ?

@pkofod

This comment has been minimized.

Collaborator

pkofod commented Dec 16, 2016

I wonder why the number of iterations in fminbox differs between the linux and mac build...

@pkofod

This comment has been minimized.

Collaborator

pkofod commented Dec 17, 2016

@timholy is there anything specific you need to clean up? Do tell if you need a hand, so we don't end up in another rebase from hell!

@pkofod

This comment has been minimized.

Collaborator

pkofod commented Dec 19, 2016

When you're ready I suggest you squash some of the 40 commits. This will make handling of the merge conflicts easier I think. I'll be happy to do the actual rebasing once you've squashed it a bit. The funny thing is that the pr that introduced the conflict is almost orthogonal to what happens here, but I still end up with weird conflicts when trying to rebase. For example, /test/constraints.jl doesn't even exist in master, but I still get merge conflicts when rebasing. So odd.

@timholy timholy referenced this pull request Dec 22, 2016

Merged

Make LinSpace generic #18777

@@ -17,8 +17,13 @@ module Optim
Base.setindex!
export optimize,

This comment has been minimized.

@pkofod

pkofod Feb 2, 2017

Collaborator

any reason ContraintBounds are not exported?

push!(σ, 1)
push!(b, li)
end
ui = u[i]

This comment has been minimized.

@anriseth

anriseth Mar 6, 2017

Contributor

It seems like ui is already defined in line 472?

@musm

This comment has been minimized.

musm commented May 31, 2017

would be awesome :)

@pkofod

This comment has been minimized.

Collaborator

pkofod commented May 31, 2017

would be awesome :)

Unless @timholy decides to finish this himself (ABC, Tim, Always Be Closing!), I wouldn't say it's improbable that I merge some version of this during 2017. We've cleared some older issues recently, and after a more proper loosening of types, I think this is next on the ticket.

@timholy

This comment has been minimized.

Contributor

timholy commented May 31, 2017

I'd be happy with that. I haven't merged it because the natural next step is to implement this for something besides Newton (and it's what my lab would actually use). I started looking into that, and decided I didn't have the month to spare to tackle that 😄.

@pkofod

This comment has been minimized.

Collaborator

pkofod commented May 31, 2017

I haven't merged it because the natural next step is to implement this for something besides Newton (and it's what my lab would actually use).

What do you mean by the comment in parentheses? That your lab would use it if something besides Newton was implemented, or that your lab is using this as is?

@timholy

This comment has been minimized.

Contributor

timholy commented May 31, 2017

No, we're using the old teh/constrained branch, which supports more methods but has a much worse implementation.

@pkofod

This comment has been minimized.

Collaborator

pkofod commented May 31, 2017

No, we're using the old teh/constrained branch, which supports more methods but has a much worse implementation.

Alright, I follow. I think the first step will be up resolve conflicts (hopefully not as horrible as teh/constrained!!!), and merge this. Then we can always play around with other versions later.

@pkofod pkofod self-assigned this May 31, 2017

@timholy timholy referenced this pull request Jun 17, 2017

Closed

Manifold optimization #433

@anriseth anriseth referenced this pull request Oct 25, 2017

Open

Citing Optim.jl #478

@anriseth anriseth referenced this pull request Dec 9, 2017

Closed

API for fg! #439

@anriseth anriseth referenced this pull request Feb 7, 2018

Open

GSoC 2018 proposals #523

@anriseth

This comment has been minimized.

Contributor

anriseth commented Mar 3, 2018

The functionality from this PR is now available at https://github.com/JuliaNLSolvers/ConstrainedOptim.jl

I encourage everyone who are interested to try it out and let us know how it works for you.

The plan is to add some more constrained optimization code that is lying around to ConstrainedOptim.jl, and then move the code back into Optim at some point in the future.

@pkofod pkofod closed this May 29, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment