Getting the docs up and running (#225)

* Initial docs commit.
JuliaNLSolvers · Jun 21, 2016 · 24ad090 · 24ad090
1 parent e8e75a8
commit 24ad090
Show file tree

Hide file tree

Showing 28 changed files with 878 additions and 1 deletion.
diff --git a/.gitignore b/.gitignore
@@ -1,3 +1,5 @@
 benchmarks/graphs/*
 *~
 *.kate-swp
+docs/build/
+docs/site/
diff --git a/.travis.yml b/.travis.yml
@@ -13,3 +13,5 @@ script:
     - julia -e 'Pkg.clone(pwd()); Pkg.test("Optim", coverage=true)'
 after_success:
     - julia -e 'cd(Pkg.dir("Optim")); Pkg.add("Coverage"); using Coverage; Codecov.submit(process_folder())'
+    - julia -e 'Pkg.add("Documenter")'
+    - julia -e 'cd(Pkg.dir("Optim")); include(joinpath("docs", "make.jl"))'
diff --git a/NEWS.md b/NEWS.md
@@ -1,3 +1,5 @@
 # Optim v0.6.0 release notes
 
-* Added NEWS.md 
+* Added documentation generated by [Documenter.jl](https://github.com/JuliaDocs/Documenter.jl), see PR[225](https://github.com/JuliaOpt/Optim.jl/pull/225).
+* Fixed bug in ConjugteGradient direction reset step, see issue [209](https://github.com/JuliaOpt/Optim.jl/issues/209)
+* Added NEWS.md
diff --git a/README.md b/README.md
@@ -1,5 +1,9 @@
+
+[![](https://img.shields.io/badge/docs-latest-blue.svg)](https://juliaopt.github.io/Optim.jl/latest)
+
 *This it the development branch of Optim.jl. Please visit [this branch](https://github.com/JuliaOpt/Optim.jl/tree/v0.4.5) to find the README.md belonging to the latest official release of Optim.jl*
 
+
 Optim.jl
 ========
 

diff --git a/docs/.documenter.enc b/docs/.documenter.enc
diff --git a/docs/make.jl b/docs/make.jl
@@ -0,0 +1,12 @@
+using Documenter, OptimDoc
+
+# use include("Rosenbrock.jl") etc
+
+# assuming linux.
+#run('mv ../LICENSE.md ./LICENSE.md')
+#run('mv ../CONTRIBUTING.md ./dev/CONTRIBUTING.md')
+makedocs()
+
+deploydocs(
+    repo = "github.com/JuliaOpt/Optim.jl.git"
+)
diff --git a/docs/mkdocs.yml b/docs/mkdocs.yml
@@ -0,0 +1,51 @@
+site_name:        Optim.jl
+repo_url:         https://github.com/JuliaOpt/Optim.jl/
+site_description: Pure Julia implementations of optimization algorithms.
+site_author:      JuliaOpt
+
+theme: readthedocs
+
+
+extra:
+  palette:
+    primary: 'indigo'
+    accent:  'blue'
+
+markdown_extensions:
+  - codehilite
+  - extra
+  - tables
+  - fenced_code
+
+extra_javascript:
+  - https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS_HTML
+  - assets/mathjaxhelper.js
+
+docs_dir: 'build'
+
+pages:
+    - Home: 'index.md'
+    - General information:
+        - Minimizing a function: 'user/minimization.md'
+        - Configurable Options: 'user/config.md'
+        - Tips and tricks: 'user/tipsandtricks.md'
+        - Planned Changes: 'user/planned.md'
+    - Algorithms:
+        - Solvers:
+            - Gradient Free:
+                - Nelder Mead: 'algo/nelder_mead.md'
+                - Simulated Annealing: 'algo/simulated_annealing.md'
+#                - Univariate:
+#                    - Brent's Method: 'algo/brent.md'
+#                    - Golden Section: 'algo/goldensection.md'
+            - Gradient Required:
+#                - 'Conjugate Gradient': 'algo/conjugategradient.md'
+                - 'Gradient Descent': 'algo/gradientdescent.md'
+                - '(L-)BFGS': 'algo/lbfgs.md'
+            - Hessian Required:
+                - Newton: 'algo/newton.md'
+        - Linesearch: 'algo/linesearch.md'
+        - Preconditioners: 'algo/precondition.md'
+    - 'Contributing':
+        - 'Contributing': 'dev/contributing.md'
+    - License: 'LICENSE.md'
diff --git a/docs/src/.Rhistory b/docs/src/.Rhistory
diff --git a/docs/src/LICENSE.md b/docs/src/LICENSE.md
@@ -0,0 +1,9 @@
+Optim.jl is licensed under the MIT License:
+
+    Copyright (c) 2012: John Myles White and other contributors.
+
+    Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
+
+    The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
+
+    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
diff --git a/docs/src/algo/brent.md b/docs/src/algo/brent.md
@@ -0,0 +1,7 @@
+# Brent's Method
+## Constructor
+
+## Description
+## Example
+## References
+R. P. Brent (2002) Algorithms for Minimization Without Derivatives. Dover edition.
diff --git a/docs/src/algo/conjugategradient.md b/docs/src/algo/conjugategradient.md
@@ -0,0 +1,14 @@
+# Conjugate Gradient Descent
+## Constructor
+```julia
+ConjugateGradient(; linesearch! = hz_linesearch!,
+                    eta = 0.4,
+                    P = nothing,
+                    precondprep! = (P, x) -> nothing)
+```
+
+## Description
+
+## Example
+## References
+W. W. Hager and H. Zhang (2006) Algorithm 851: CG_DESCENT, a conjugate gradient method with guaranteed descent. ACM Transactions on Mathematical Software 32: 113-137.
diff --git a/docs/src/algo/goldensection.md b/docs/src/algo/goldensection.md
@@ -0,0 +1,5 @@
+# Golden Section
+## Constructor
+## Description
+## Example
+## References
diff --git a/docs/src/algo/gradientdescent.md b/docs/src/algo/gradientdescent.md
@@ -0,0 +1,29 @@
+# Gradient Descent
+## Constructor
+```julia
+GradientDescent(; linesearch!::Function = hz_linesearch!,
+                  P = nothing,
+                  precondprep! = (P, x) -> nothing)
+```
+## Description
+Gradient Descent a common name for a quasi-Newton solver. This means that it takes
+steps according to
+
+$ x_{n+1} = x_n - P^{-1}\nabla f(x_n)$
+
+where $P$ is a positive definite matrix. If $P$ is the Hessian, we get Newton's method.
+In Gradient Descent, $P$ is simply an appropriately dimensioned identity matrix,
+such that we go in the exact opposite direction of the gradient. This means
+that we do not use the curvature information from the Hessian, or an approximation
+of it. While it does seem quite logical to go in the opposite direction of the fastest
+increase in objective value, the procedure can be very slow if the problem is ill-conditioned.
+See the section on preconditioners for ways to remedy this when using Gradient Descent.
+
+As with the other quasi-Newton solvers in this package, a scalar $\alpha$ is introduced
+as follows
+
+$ x_{n+1} = x_n - \alpha P^{-1}\nabla f(x_n)$
+
+and is chosen by a linesearch algorithm such that each step gives sufficient descent.
+## Example
+## References
diff --git a/docs/src/algo/index.md b/docs/src/algo/index.md
@@ -0,0 +1 @@
+# Solvers
diff --git a/docs/src/algo/lbfgs.md b/docs/src/algo/lbfgs.md
@@ -0,0 +1,43 @@
+# (L-)BFGS
+This page contains information about BFGS and its limited memory version L-BFGS.
+## Constructors
+```julia
+BFGS(; linesearch! = hz_linesearch!,
+       P = nothing,
+       precondprep! = (P, x) -> nothing)
+```
+
+```julia
+LBFGS(; m = 10,
+        linesearch! = hz_linesearch!,
+        P = nothing,
+        precondprep! = (P, x) -> nothing)
+```
+## Description
+This means that it takes steps according to
+
+$ x_{n+1} = x_n - P^{-1}\nabla f(x_n)$
+
+where $P$ is a positive definite matrix. If $P$ is the Hessian, we get Newton's method.
+In (L-)BFGS, the matrix is an approximation to the Hessian built using differences
+in the gradient across iterations. As long as the initial matrix is positive definite
+ it is possible to show that all the follow matrices will be as well. The starting
+matrix could simply be the identity matrix, such that the first step is identical
+to the Gradient Descent algorithm, or even the actual Hessian.
+
+There are two versions of BFGS in the package: BFGS, and L-BFGS. The latter is different
+from the former because it doesn't use a complete history of the iterative procedure to
+construct $P$, but rather only the latest $m$ steps. It doesn't actually build the Hessian
+approximation matrix either, but computes the direction directly. This makes more suitable for
+large scale problems, as the memory requirement to store the relevant vectors will
+grow quickly in large problems.
+
+As with the other quasi-Newton solvers in this package, a scalar $\alpha$ is introduced
+as follows
+
+$ x_{n+1} = x_n - \alpha P^{-1}\nabla f(x_n)$
+
+and is chosen by a linesearch algorithm such that each step gives sufficient descent.
+## Example
+## References
+Wright, Stephen, and Jorge Nocedal (2006) "Numerical optimization." Springer
diff --git a/docs/src/algo/linesearch.md b/docs/src/algo/linesearch.md
@@ -0,0 +1,17 @@
+# Line search
+## Description
+
+### Available line search algorithms
+
+* `hz_linesearch!` , the default line search algorithm
+* `backtracking_linesearch!`
+* `interpolating_linesearch!`
+* `mt_linesearch!`
+
+The default line search algorithm is taken from the Conjugate Gradient implementation
+by Hager and Zhang (HZ).
+
+## Example
+## References
+W. W. Hager and H. Zhang (2006) "Algorithm 851: CG_DESCENT, a conjugate gradient method with guaranteed descent."" ACM Transactions on Mathematical Software 32: 113-137.
+Wright, Stephen, and Jorge Nocedal (2006) "Numerical optimization." Springer
diff --git a/docs/src/algo/nelder_mead.md b/docs/src/algo/nelder_mead.md
@@ -0,0 +1,38 @@
+# Nelder-Mead
+Nelder-Mead is currently the standard algorithm when no derivatives are provided.
+## Constructor
+```julia
+NelderMead(; a = 1.0,
+             g = 2.0,
+             b = 0.5)
+```
+## Description
+Our current implementation of the Nelder-Mead algorithm follows the original implementation
+very closely, see Nelder and Mead (1965). This means that there is scope for improvement, but
+also that it should be quite clear what is going on in the code relative to the original paper.
+
+Instead of using gradient information, we keep track of the function value at a number
+of points in the search space. Together, the points form a simplex. Given a simplex,
+we can perform one of four actions: reflect, expand, contract, or shrink. Basically,
+the goal is to iteratively replace the worst point with a better point. More information
+can be found in Nelder and Mead (1965) or Gao and Han (2010).
+
+The stopping rule is the same as in the original paper, and is basically the standard
+error of the function values at the vertices. To set the tolerance level for this
+convergence criterion, set the `g_tol` level as described in the Configurable Options
+section.
+
+When the solver finishes, we return a minimizer which is either the centroid or one of the vertices.
+The function value at the centroid adds a function evaluation, as we need to evaluate the objection
+at the centroid to choose the smallest function value. Howeever, even if the function value at the centroid can be returned
+as the minimum, we do not trace it during the optimization iterations. This is to avoid
+too many evaluations of the objective function which can be computationally expensive.
+Typically, there should be no more than twice as many `f_calls` than `iterations`,
+and adding an evaluation at the centroid when `tracing` could considerably increase the total
+run-time of the algorithm.
+
+## Example
+
+## References
+Nelder, John A. and R. Mead (1965). "A simplex method for function minimization". Computer Journal 7: 308–313. doi:10.1093/comjnl/7.4.308.
+Gao, Fuchang and Lixing Han (2010). "Implementing the Nelder-Mead simplex algorithm with adaptive parameters". Computational Optimization and Applications [DOI 10.1007/s10589-010-9329-3]
diff --git a/docs/src/algo/newton.md b/docs/src/algo/newton.md
@@ -0,0 +1,59 @@
+# Newton's Method
+## Constructor
+```julia
+Newton(; linesearch! = hz_linesearch!)
+```
+
+The constructor takes one keyword
+
+* `linesearch! = a(d, x, p, x_new, g_new, lsr, c, mayterminate)`, a function performing line search, see the line search section.
+
+## Description
+Newton's method for optimization has a long history, and is in some sense the
+gold standard in unconstrained optimization of smooth functions, at least from a theoretical viewpoint.
+The main benefit is that it has a quadratic rate of convergence near a local optimum. The main
+disadvantage is that the user has to provide a Hessian. This can be difficult, complicated, or simply annoying.
+It can also be computationally expensive to calculate it.
+
+Newton's method for optimization consists of applying Newton's method for solving
+systems of equations, where the equations are the first order conditions, saying
+that the gradient should equal the zero vector.
+
+$ \nabla f(x) = 0 $
+
+A second order Taylor expansion of the left-hand side leads to the iterative scheme
+
+$ x_{n+1} = x_n - H(x_n)^{-1}\nabla f(x_n)$
+
+where the inverse is not calculated directly, but the step size is instead calculated by solving
+
+$ H(x) \textbf{s} = \nabla f(x_n) $.
+
+This is equivalent to minimizing a quadratic model, $m_k$ around the current $x_n$
+
+$ m_k(s) = f(x_n) + \nabla f(x_n)^\top \textbf{s} + \frac{1}{2} \textbf{s}^\top H(x_n) \textbf{s} $
+
+For functions where $H(x_n)$ is difficult, or computationally expensive to obtain, we might
+replace the Hessian with another positive definite matrix that approximates it.
+Such methods are called Quasi-Newton methods; see (L-)BFGS and Gradient Descent.
+
+In a sufficiently small neighborhood around the minimizer, Newton's method has
+quadratic convergence, but globally it might have slower convergence, or it might
+even diverge. To ensure convergence, a line search is performed for each $\textbf{s}$.
+This amounts to replacing the step formula above with
+
+$ x_{n+1} = x_n - \alpha \textbf{s}$
+
+and finding a scalar $\alpha$ such that we get sufficient descent; see the line search section for more information.
+
+Additionally, if the function is locally
+concave, the step taken in the formulas above will go in a direction of ascent,
+ as the Hessian will not be positive (semi)definite.
+To avoid this, we use a specialized method to calculate the step direction. If
+the Hessian is positive semidefinite then the method used is standard, but if
+it is not, a correction is made using the functionality in [PositiveFactorizations.jl](https://github.com/timholy/PositiveFactorizations.jl).
+
+## Example
+show the example from the issue
+
+## References
diff --git a/docs/src/algo/plap.png b/docs/src/algo/plap.png
diff --git a/docs/src/algo/precondition.md b/docs/src/algo/precondition.md
@@ -0,0 +1,58 @@
+# Preconditioning
+
+The `GradientDescent`, `ConjugateGradient` and `LBFGS` methods support preconditioning. A preconditioner
+can be thought of as a change of coordinates under which the Hessian is better conditioned. With a
+good preconditioner substantially improved convergence is possible.
+
+A preconditioner `P`can be of any type as long as the following two methods are
+implemented:
+
+* `A_ldiv_B!(pgr, P, gr)` : apply `P` to a vector `gr` and store in `pgr`
+      (intuitively, `pgr = P \ gr`)
+* `dot(x, P, y)` : the inner product induced by `P`
+      (intuitively, `dot(x, P * y)`)
+
+Precisely what these operations mean, depends on how `P` is stored. Commonly, we store a matrix `P` which
+approximates the Hessian in some vague sense. In this case,
+
+* `A_ldiv_B!(pgr, P, gr) = copy!(pgr, P \ A)`
+* `dot(x, P, y) = dot(x, P * y)`
+
+Finally, it is possible to update the preconditioner as the state variable `x`
+changes. This is done through  `precondprep!` which is passed to the
+optimizers as kw-argument, e.g.,
+```jl
+   method=ConjugateGradient(P = precond(100), precondprep! = precond(100))
+```
+though in this case it would always return the same matrix.
+(See `fminbox.jl` for a more natural example.)
+
+Apart from preconditioning with matrices, `Optim.jl` provides
+a type `InverseDiagonal`, which represents a diagonal matrix by
+its inverse elements.
+
+## Example
+Below, we see an example where a function is minimized without and with a preconditioner
+applied.
+```jl
+using ForwardDiff
+plap(U; n = length(U)) = (n-1)*sum((0.1 + diff(U).^2).^2 ) - sum(U) / (n-1)
+plap1 = ForwardDiff.gradient(plap)
+precond(n) = spdiagm((-ones(n-1), 2*ones(n), -ones(n-1)), (-1,0,1), n, n)*(n+1)
+df = DifferentiableFunction(x -> plap([0; X; 0]),
+                            (x, g) -> copy!(g, (plap1([0; X; 0]))[2:end-1]))
+result = Optim.optimize(df, zeros(100), method = ConjugateGradient(P = nothing))
+result = Optim.optimize(df, zeros(100), method = ConjugateGradient(P = precond(100)))
+```
+The former optimize call converges at a slower rate than the latter. Looking at a
+ plot of the 2D version of the function shows the problem.
+
+![plap](./plap.png)
+
+The contours are shaped like ellipsoids, but we would rather want them to be circles.
+Using the preconditioner effectively changes the coordinates such that the contours
+becomes less ellipsoid-like. Benchmarking shows that using preconditioning provides
+ an approximate speed-up factor of 15 in this 100 dimensional case.
+
+
+## References