Skip to content

Commit

Permalink
Getting the docs up and running (#225)
Browse files Browse the repository at this point in the history
* Initial docs commit.
  • Loading branch information
pkofod committed Jun 21, 2016
1 parent e8e75a8 commit 24ad090
Show file tree
Hide file tree
Showing 28 changed files with 878 additions and 1 deletion.
2 changes: 2 additions & 0 deletions .gitignore
@@ -1,3 +1,5 @@
benchmarks/graphs/*
*~
*.kate-swp
docs/build/
docs/site/
2 changes: 2 additions & 0 deletions .travis.yml
Expand Up @@ -13,3 +13,5 @@ script:
- julia -e 'Pkg.clone(pwd()); Pkg.test("Optim", coverage=true)'
after_success:
- julia -e 'cd(Pkg.dir("Optim")); Pkg.add("Coverage"); using Coverage; Codecov.submit(process_folder())'
- julia -e 'Pkg.add("Documenter")'
- julia -e 'cd(Pkg.dir("Optim")); include(joinpath("docs", "make.jl"))'
4 changes: 3 additions & 1 deletion NEWS.md
@@ -1,3 +1,5 @@
# Optim v0.6.0 release notes

* Added NEWS.md
* Added documentation generated by [Documenter.jl](https://github.com/JuliaDocs/Documenter.jl), see PR[225](https://github.com/JuliaOpt/Optim.jl/pull/225).
* Fixed bug in ConjugteGradient direction reset step, see issue [209](https://github.com/JuliaOpt/Optim.jl/issues/209)
* Added NEWS.md
4 changes: 4 additions & 0 deletions README.md
@@ -1,5 +1,9 @@

[![](https://img.shields.io/badge/docs-latest-blue.svg)](https://juliaopt.github.io/Optim.jl/latest)

*This it the development branch of Optim.jl. Please visit [this branch](https://github.com/JuliaOpt/Optim.jl/tree/v0.4.5) to find the README.md belonging to the latest official release of Optim.jl*


Optim.jl
========

Expand Down
Binary file added docs/.documenter.enc
Binary file not shown.
12 changes: 12 additions & 0 deletions docs/make.jl
@@ -0,0 +1,12 @@
using Documenter, OptimDoc

# use include("Rosenbrock.jl") etc

# assuming linux.
#run('mv ../LICENSE.md ./LICENSE.md')
#run('mv ../CONTRIBUTING.md ./dev/CONTRIBUTING.md')
makedocs()

deploydocs(
repo = "github.com/JuliaOpt/Optim.jl.git"
)
51 changes: 51 additions & 0 deletions docs/mkdocs.yml
@@ -0,0 +1,51 @@
site_name: Optim.jl
repo_url: https://github.com/JuliaOpt/Optim.jl/
site_description: Pure Julia implementations of optimization algorithms.
site_author: JuliaOpt

theme: readthedocs


extra:
palette:
primary: 'indigo'
accent: 'blue'

markdown_extensions:
- codehilite
- extra
- tables
- fenced_code

extra_javascript:
- https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS_HTML
- assets/mathjaxhelper.js

docs_dir: 'build'

pages:
- Home: 'index.md'
- General information:
- Minimizing a function: 'user/minimization.md'
- Configurable Options: 'user/config.md'
- Tips and tricks: 'user/tipsandtricks.md'
- Planned Changes: 'user/planned.md'
- Algorithms:
- Solvers:
- Gradient Free:
- Nelder Mead: 'algo/nelder_mead.md'
- Simulated Annealing: 'algo/simulated_annealing.md'
# - Univariate:
# - Brent's Method: 'algo/brent.md'
# - Golden Section: 'algo/goldensection.md'
- Gradient Required:
# - 'Conjugate Gradient': 'algo/conjugategradient.md'
- 'Gradient Descent': 'algo/gradientdescent.md'
- '(L-)BFGS': 'algo/lbfgs.md'
- Hessian Required:
- Newton: 'algo/newton.md'
- Linesearch: 'algo/linesearch.md'
- Preconditioners: 'algo/precondition.md'
- 'Contributing':
- 'Contributing': 'dev/contributing.md'
- License: 'LICENSE.md'
Empty file added docs/src/.Rhistory
Empty file.
9 changes: 9 additions & 0 deletions docs/src/LICENSE.md
@@ -0,0 +1,9 @@
Optim.jl is licensed under the MIT License:

Copyright (c) 2012: John Myles White and other contributors.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
7 changes: 7 additions & 0 deletions docs/src/algo/brent.md
@@ -0,0 +1,7 @@
# Brent's Method
## Constructor

## Description
## Example
## References
R. P. Brent (2002) Algorithms for Minimization Without Derivatives. Dover edition.
14 changes: 14 additions & 0 deletions docs/src/algo/conjugategradient.md
@@ -0,0 +1,14 @@
# Conjugate Gradient Descent
## Constructor
```julia
ConjugateGradient(; linesearch! = hz_linesearch!,
eta = 0.4,
P = nothing,
precondprep! = (P, x) -> nothing)
```

## Description

## Example
## References
W. W. Hager and H. Zhang (2006) Algorithm 851: CG_DESCENT, a conjugate gradient method with guaranteed descent. ACM Transactions on Mathematical Software 32: 113-137.
5 changes: 5 additions & 0 deletions docs/src/algo/goldensection.md
@@ -0,0 +1,5 @@
# Golden Section
## Constructor
## Description
## Example
## References
29 changes: 29 additions & 0 deletions docs/src/algo/gradientdescent.md
@@ -0,0 +1,29 @@
# Gradient Descent
## Constructor
```julia
GradientDescent(; linesearch!::Function = hz_linesearch!,
P = nothing,
precondprep! = (P, x) -> nothing)
```
## Description
Gradient Descent a common name for a quasi-Newton solver. This means that it takes
steps according to

$ x_{n+1} = x_n - P^{-1}\nabla f(x_n)$

where $P$ is a positive definite matrix. If $P$ is the Hessian, we get Newton's method.
In Gradient Descent, $P$ is simply an appropriately dimensioned identity matrix,
such that we go in the exact opposite direction of the gradient. This means
that we do not use the curvature information from the Hessian, or an approximation
of it. While it does seem quite logical to go in the opposite direction of the fastest
increase in objective value, the procedure can be very slow if the problem is ill-conditioned.
See the section on preconditioners for ways to remedy this when using Gradient Descent.

As with the other quasi-Newton solvers in this package, a scalar $\alpha$ is introduced
as follows

$ x_{n+1} = x_n - \alpha P^{-1}\nabla f(x_n)$

and is chosen by a linesearch algorithm such that each step gives sufficient descent.
## Example
## References
1 change: 1 addition & 0 deletions docs/src/algo/index.md
@@ -0,0 +1 @@
# Solvers
43 changes: 43 additions & 0 deletions docs/src/algo/lbfgs.md
@@ -0,0 +1,43 @@
# (L-)BFGS
This page contains information about BFGS and its limited memory version L-BFGS.
## Constructors
```julia
BFGS(; linesearch! = hz_linesearch!,
P = nothing,
precondprep! = (P, x) -> nothing)
```

```julia
LBFGS(; m = 10,
linesearch! = hz_linesearch!,
P = nothing,
precondprep! = (P, x) -> nothing)
```
## Description
This means that it takes steps according to

$ x_{n+1} = x_n - P^{-1}\nabla f(x_n)$

where $P$ is a positive definite matrix. If $P$ is the Hessian, we get Newton's method.
In (L-)BFGS, the matrix is an approximation to the Hessian built using differences
in the gradient across iterations. As long as the initial matrix is positive definite
it is possible to show that all the follow matrices will be as well. The starting
matrix could simply be the identity matrix, such that the first step is identical
to the Gradient Descent algorithm, or even the actual Hessian.

There are two versions of BFGS in the package: BFGS, and L-BFGS. The latter is different
from the former because it doesn't use a complete history of the iterative procedure to
construct $P$, but rather only the latest $m$ steps. It doesn't actually build the Hessian
approximation matrix either, but computes the direction directly. This makes more suitable for
large scale problems, as the memory requirement to store the relevant vectors will
grow quickly in large problems.

As with the other quasi-Newton solvers in this package, a scalar $\alpha$ is introduced
as follows

$ x_{n+1} = x_n - \alpha P^{-1}\nabla f(x_n)$

and is chosen by a linesearch algorithm such that each step gives sufficient descent.
## Example
## References
Wright, Stephen, and Jorge Nocedal (2006) "Numerical optimization." Springer
17 changes: 17 additions & 0 deletions docs/src/algo/linesearch.md
@@ -0,0 +1,17 @@
# Line search
## Description

### Available line search algorithms

* `hz_linesearch!` , the default line search algorithm
* `backtracking_linesearch!`
* `interpolating_linesearch!`
* `mt_linesearch!`

The default line search algorithm is taken from the Conjugate Gradient implementation
by Hager and Zhang (HZ).

## Example
## References
W. W. Hager and H. Zhang (2006) "Algorithm 851: CG_DESCENT, a conjugate gradient method with guaranteed descent."" ACM Transactions on Mathematical Software 32: 113-137.
Wright, Stephen, and Jorge Nocedal (2006) "Numerical optimization." Springer
38 changes: 38 additions & 0 deletions docs/src/algo/nelder_mead.md
@@ -0,0 +1,38 @@
# Nelder-Mead
Nelder-Mead is currently the standard algorithm when no derivatives are provided.
## Constructor
```julia
NelderMead(; a = 1.0,
g = 2.0,
b = 0.5)
```
## Description
Our current implementation of the Nelder-Mead algorithm follows the original implementation
very closely, see Nelder and Mead (1965). This means that there is scope for improvement, but
also that it should be quite clear what is going on in the code relative to the original paper.

Instead of using gradient information, we keep track of the function value at a number
of points in the search space. Together, the points form a simplex. Given a simplex,
we can perform one of four actions: reflect, expand, contract, or shrink. Basically,
the goal is to iteratively replace the worst point with a better point. More information
can be found in Nelder and Mead (1965) or Gao and Han (2010).

The stopping rule is the same as in the original paper, and is basically the standard
error of the function values at the vertices. To set the tolerance level for this
convergence criterion, set the `g_tol` level as described in the Configurable Options
section.

When the solver finishes, we return a minimizer which is either the centroid or one of the vertices.
The function value at the centroid adds a function evaluation, as we need to evaluate the objection
at the centroid to choose the smallest function value. Howeever, even if the function value at the centroid can be returned
as the minimum, we do not trace it during the optimization iterations. This is to avoid
too many evaluations of the objective function which can be computationally expensive.
Typically, there should be no more than twice as many `f_calls` than `iterations`,
and adding an evaluation at the centroid when `tracing` could considerably increase the total
run-time of the algorithm.

## Example

## References
Nelder, John A. and R. Mead (1965). "A simplex method for function minimization". Computer Journal 7: 308–313. doi:10.1093/comjnl/7.4.308.
Gao, Fuchang and Lixing Han (2010). "Implementing the Nelder-Mead simplex algorithm with adaptive parameters". Computational Optimization and Applications [DOI 10.1007/s10589-010-9329-3]
59 changes: 59 additions & 0 deletions docs/src/algo/newton.md
@@ -0,0 +1,59 @@
# Newton's Method
## Constructor
```julia
Newton(; linesearch! = hz_linesearch!)
```

The constructor takes one keyword

* `linesearch! = a(d, x, p, x_new, g_new, lsr, c, mayterminate)`, a function performing line search, see the line search section.

## Description
Newton's method for optimization has a long history, and is in some sense the
gold standard in unconstrained optimization of smooth functions, at least from a theoretical viewpoint.
The main benefit is that it has a quadratic rate of convergence near a local optimum. The main
disadvantage is that the user has to provide a Hessian. This can be difficult, complicated, or simply annoying.
It can also be computationally expensive to calculate it.

Newton's method for optimization consists of applying Newton's method for solving
systems of equations, where the equations are the first order conditions, saying
that the gradient should equal the zero vector.

$ \nabla f(x) = 0 $

A second order Taylor expansion of the left-hand side leads to the iterative scheme

$ x_{n+1} = x_n - H(x_n)^{-1}\nabla f(x_n)$

where the inverse is not calculated directly, but the step size is instead calculated by solving

$ H(x) \textbf{s} = \nabla f(x_n) $.

This is equivalent to minimizing a quadratic model, $m_k$ around the current $x_n$

$ m_k(s) = f(x_n) + \nabla f(x_n)^\top \textbf{s} + \frac{1}{2} \textbf{s}^\top H(x_n) \textbf{s} $

For functions where $H(x_n)$ is difficult, or computationally expensive to obtain, we might
replace the Hessian with another positive definite matrix that approximates it.
Such methods are called Quasi-Newton methods; see (L-)BFGS and Gradient Descent.

In a sufficiently small neighborhood around the minimizer, Newton's method has
quadratic convergence, but globally it might have slower convergence, or it might
even diverge. To ensure convergence, a line search is performed for each $\textbf{s}$.
This amounts to replacing the step formula above with

$ x_{n+1} = x_n - \alpha \textbf{s}$

and finding a scalar $\alpha$ such that we get sufficient descent; see the line search section for more information.

Additionally, if the function is locally
concave, the step taken in the formulas above will go in a direction of ascent,
as the Hessian will not be positive (semi)definite.
To avoid this, we use a specialized method to calculate the step direction. If
the Hessian is positive semidefinite then the method used is standard, but if
it is not, a correction is made using the functionality in [PositiveFactorizations.jl](https://github.com/timholy/PositiveFactorizations.jl).

## Example
show the example from the issue

## References
Binary file added docs/src/algo/plap.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
58 changes: 58 additions & 0 deletions docs/src/algo/precondition.md
@@ -0,0 +1,58 @@
# Preconditioning

The `GradientDescent`, `ConjugateGradient` and `LBFGS` methods support preconditioning. A preconditioner
can be thought of as a change of coordinates under which the Hessian is better conditioned. With a
good preconditioner substantially improved convergence is possible.

A preconditioner `P`can be of any type as long as the following two methods are
implemented:

* `A_ldiv_B!(pgr, P, gr)` : apply `P` to a vector `gr` and store in `pgr`
(intuitively, `pgr = P \ gr`)
* `dot(x, P, y)` : the inner product induced by `P`
(intuitively, `dot(x, P * y)`)

Precisely what these operations mean, depends on how `P` is stored. Commonly, we store a matrix `P` which
approximates the Hessian in some vague sense. In this case,

* `A_ldiv_B!(pgr, P, gr) = copy!(pgr, P \ A)`
* `dot(x, P, y) = dot(x, P * y)`

Finally, it is possible to update the preconditioner as the state variable `x`
changes. This is done through `precondprep!` which is passed to the
optimizers as kw-argument, e.g.,
```jl
method=ConjugateGradient(P = precond(100), precondprep! = precond(100))
```
though in this case it would always return the same matrix.
(See `fminbox.jl` for a more natural example.)

Apart from preconditioning with matrices, `Optim.jl` provides
a type `InverseDiagonal`, which represents a diagonal matrix by
its inverse elements.

## Example
Below, we see an example where a function is minimized without and with a preconditioner
applied.
```jl
using ForwardDiff
plap(U; n = length(U)) = (n-1)*sum((0.1 + diff(U).^2).^2 ) - sum(U) / (n-1)
plap1 = ForwardDiff.gradient(plap)
precond(n) = spdiagm((-ones(n-1), 2*ones(n), -ones(n-1)), (-1,0,1), n, n)*(n+1)
df = DifferentiableFunction(x -> plap([0; X; 0]),
(x, g) -> copy!(g, (plap1([0; X; 0]))[2:end-1]))
result = Optim.optimize(df, zeros(100), method = ConjugateGradient(P = nothing))
result = Optim.optimize(df, zeros(100), method = ConjugateGradient(P = precond(100)))
```
The former optimize call converges at a slower rate than the latter. Looking at a
plot of the 2D version of the function shows the problem.

![plap](./plap.png)

The contours are shaped like ellipsoids, but we would rather want them to be circles.
Using the preconditioner effectively changes the coordinates such that the contours
becomes less ellipsoid-like. Benchmarking shows that using preconditioning provides
an approximate speed-up factor of 15 in this 100 dimensional case.


## References

0 comments on commit 24ad090

Please sign in to comment.