-
Notifications
You must be signed in to change notification settings - Fork 213
/
search_index.json
1 lines (1 loc) · 228 KB
/
search_index.json
1
{"config":{"lang":["en"],"prebuild_index":false,"separator":"[\\s\\-]+"},"docs":[{"location":"","text":"Optim.jl Univariate and multivariate optimization in Julia. Optim.jl is part of the JuliaNLSolvers family. Source PackageEvaluator Build Status Social References to cite What Optim is a Julia package for optimizing functions of various kinds. While there is some support for box constrained and Riemannian optimization, most of the solvers try to find an $x$ that minimizes a function $f(x)$ without any constraints. Thus, the main focus is on unconstrained optimization. The provided solvers, under certain conditions, will converge to a local minimum. In the case where a global minimum is desired, global optimization techniques should be employed instead (see e.g. BlackBoxOptim ). Why There are many solvers available from both free and commercial sources, and many of them are accessible from Julia. Few of them are written in Julia. Performance-wise this is rarely a problem, as they are often written in either Fortran or C. However, solvers written directly in Julia does come with some advantages. When writing Julia software (packages) that require something to be optimized, the programmer can either choose to write their own optimization routine, or use one of the many available solvers. For example, this could be something from the NLOpt suite. This means adding a dependency which is not written in Julia, and more assumptions have to be made as to the environment the user is in. Does the user have the proper compilers? Is it possible to use GPL'ed code in the project? Optim is released under the MIT license, and installation is a simple Pkg.add , so it really doesn't get much freer, easier, and lightweight than that. It is also true, that using a solver written in C or Fortran makes it impossible to leverage one of the main benefits of Julia: multiple dispatch. Since Optim is entirely written in Julia, we can currently use the dispatch system to ease the use of custom preconditioners. A planned feature along these lines is to allow for user controlled choice of solvers for various steps in the algorithm, entirely based on dispatch, and not predefined possibilities chosen by the developers of Optim. Being a Julia package also means that Optim has access to the automatic differentiation features through the packages in JuliaDiff . How Optim is registered in METADATA.jl . This means that all you need to do to install Optim, is to run Pkg.add( Optim )","title":"Home"},{"location":"#optimjl","text":"Univariate and multivariate optimization in Julia. Optim.jl is part of the JuliaNLSolvers family. Source PackageEvaluator Build Status Social References to cite","title":"Optim.jl"},{"location":"#what","text":"Optim is a Julia package for optimizing functions of various kinds. While there is some support for box constrained and Riemannian optimization, most of the solvers try to find an $x$ that minimizes a function $f(x)$ without any constraints. Thus, the main focus is on unconstrained optimization. The provided solvers, under certain conditions, will converge to a local minimum. In the case where a global minimum is desired, global optimization techniques should be employed instead (see e.g. BlackBoxOptim ).","title":"What"},{"location":"#why","text":"There are many solvers available from both free and commercial sources, and many of them are accessible from Julia. Few of them are written in Julia. Performance-wise this is rarely a problem, as they are often written in either Fortran or C. However, solvers written directly in Julia does come with some advantages. When writing Julia software (packages) that require something to be optimized, the programmer can either choose to write their own optimization routine, or use one of the many available solvers. For example, this could be something from the NLOpt suite. This means adding a dependency which is not written in Julia, and more assumptions have to be made as to the environment the user is in. Does the user have the proper compilers? Is it possible to use GPL'ed code in the project? Optim is released under the MIT license, and installation is a simple Pkg.add , so it really doesn't get much freer, easier, and lightweight than that. It is also true, that using a solver written in C or Fortran makes it impossible to leverage one of the main benefits of Julia: multiple dispatch. Since Optim is entirely written in Julia, we can currently use the dispatch system to ease the use of custom preconditioners. A planned feature along these lines is to allow for user controlled choice of solvers for various steps in the algorithm, entirely based on dispatch, and not predefined possibilities chosen by the developers of Optim. Being a Julia package also means that Optim has access to the automatic differentiation features through the packages in JuliaDiff .","title":"Why"},{"location":"#how","text":"Optim is registered in METADATA.jl . This means that all you need to do to install Optim, is to run Pkg.add( Optim )","title":"How"},{"location":"LICENSE/","text":"Optim.jl is licensed under the MIT License: Copyright (c) 2012: John Myles White, Tim Holy, and other contributors. Copyright (c) 2016: Patrick Kofod Mogensen, John Myles White, Tim Holy, and other contributors. Copyright (c) 2017: Patrick Kofod Mogensen, Asbj\u00f8rn Nilsen Riseth, John Myles White, Tim Holy, and other contributors. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.","title":"License"},{"location":"algo/","text":"Solvers","title":"Home"},{"location":"algo/#solvers","text":"","title":"Solvers"},{"location":"algo/brent/","text":"Brent's Method Constructor Description Example References R. P. Brent (2002) Algorithms for Minimization Without Derivatives. Dover edition.","title":"Brent"},{"location":"algo/brent/#brents-method","text":"","title":"Brent's Method"},{"location":"algo/brent/#constructor","text":"","title":"Constructor"},{"location":"algo/brent/#description","text":"","title":"Description"},{"location":"algo/brent/#example","text":"","title":"Example"},{"location":"algo/brent/#references","text":"R. P. Brent (2002) Algorithms for Minimization Without Derivatives. Dover edition.","title":"References"},{"location":"algo/cg/","text":"Conjugate Gradient Descent Constructor ConjugateGradient(; alphaguess = LineSearches.InitialHagerZhang(), linesearch = LineSearches.HagerZhang(), eta = 0.4, P = nothing, precondprep = (P, x) - nothing) Description The ConjugateGradient method implements Hager and Zhang (2006) and elements from Hager and Zhang (2013). Notice, that the default linesearch is HagerZhang from LineSearches.jl. This line search is exactly the one proposed in Hager and Zhang (2006). The constant $eta$ is used in determining the next step direction, and the default here deviates from the one used in the original paper ($0.01$). It needs to be a strictly positive number. Example Let's optimize the 2D Rosenbrock function. The function and gradient are given by f(x) = (1.0 - x[1])^2 + 100.0 * (x[2] - x[1]^2)^2 function g!(storage, x) storage[1] = -2.0 * (1.0 - x[1]) - 400.0 * (x[2] - x[1]^2) * x[1] storage[2] = 200.0 * (x[2] - x[1]^2) end we can then try to optimize this function from x=[0.0, 0.0] julia optimize(f, g!, zeros(2), ConjugateGradient()) Results of Optimization Algorithm * Algorithm: Conjugate Gradient * Starting Point: [0.0,0.0] * Minimizer: [1.000000002262018,1.0000000045408348] * Minimum: 5.144946e-18 * Iterations: 21 * Convergence: true * |x - x'| \u2264 0.0e+00: false |x - x'| = 2.09e-10 * |f(x) - f(x')| \u2264 0.0e+00 |f(x)|: false |f(x) - f(x')| = 1.55e+00 |f(x)| * |g(x)| \u2264 1.0e-08: true |g(x)| = 3.36e-09 * stopped by an increasing objective: false * Reached Maximum Number of Iterations: false * Objective Calls: 54 * Gradient Calls: 39 We can compare this to the default first order solver in Optim.jl julia optimize(f, g!, zeros(2)) Results of Optimization Algorithm * Algorithm: L-BFGS * Starting Point: [0.0,0.0] * Minimizer: [0.9999999999373614,0.999999999868622] * Minimum: 7.645684e-21 * Iterations: 16 * Convergence: true * |x - x'| \u2264 0.0e+00: false |x - x'| = 3.48e-07 * |f(x) - f(x')| \u2264 0.0e+00 |f(x)|: false |f(x) - f(x')| = 9.03e+06 |f(x)| * |g(x)| \u2264 1.0e-08: true |g(x)| = 2.32e-09 * stopped by an increasing objective: false * Reached Maximum Number of Iterations: false * Objective Calls: 53 * Gradient Calls: 53 We see that for this objective and starting point, ConjugateGradient() requires fewer gradient evaluations to reach convergence. References W. W. Hager and H. Zhang (2006) Algorithm 851: CG_DESCENT, a conjugate gradient method with guaranteed descent. ACM Transactions on Mathematical Software 32: 113-137. W. W. Hager and H. Zhang (2013), The Limited Memory Conjugate Gradient Method. SIAM Journal on Optimization, 23, pp. 2150-2168.","title":"Conjugate Gradient"},{"location":"algo/cg/#conjugate-gradient-descent","text":"","title":"Conjugate Gradient Descent"},{"location":"algo/cg/#constructor","text":"ConjugateGradient(; alphaguess = LineSearches.InitialHagerZhang(), linesearch = LineSearches.HagerZhang(), eta = 0.4, P = nothing, precondprep = (P, x) - nothing)","title":"Constructor"},{"location":"algo/cg/#description","text":"The ConjugateGradient method implements Hager and Zhang (2006) and elements from Hager and Zhang (2013). Notice, that the default linesearch is HagerZhang from LineSearches.jl. This line search is exactly the one proposed in Hager and Zhang (2006). The constant $eta$ is used in determining the next step direction, and the default here deviates from the one used in the original paper ($0.01$). It needs to be a strictly positive number.","title":"Description"},{"location":"algo/cg/#example","text":"Let's optimize the 2D Rosenbrock function. The function and gradient are given by f(x) = (1.0 - x[1])^2 + 100.0 * (x[2] - x[1]^2)^2 function g!(storage, x) storage[1] = -2.0 * (1.0 - x[1]) - 400.0 * (x[2] - x[1]^2) * x[1] storage[2] = 200.0 * (x[2] - x[1]^2) end we can then try to optimize this function from x=[0.0, 0.0] julia optimize(f, g!, zeros(2), ConjugateGradient()) Results of Optimization Algorithm * Algorithm: Conjugate Gradient * Starting Point: [0.0,0.0] * Minimizer: [1.000000002262018,1.0000000045408348] * Minimum: 5.144946e-18 * Iterations: 21 * Convergence: true * |x - x'| \u2264 0.0e+00: false |x - x'| = 2.09e-10 * |f(x) - f(x')| \u2264 0.0e+00 |f(x)|: false |f(x) - f(x')| = 1.55e+00 |f(x)| * |g(x)| \u2264 1.0e-08: true |g(x)| = 3.36e-09 * stopped by an increasing objective: false * Reached Maximum Number of Iterations: false * Objective Calls: 54 * Gradient Calls: 39 We can compare this to the default first order solver in Optim.jl julia optimize(f, g!, zeros(2)) Results of Optimization Algorithm * Algorithm: L-BFGS * Starting Point: [0.0,0.0] * Minimizer: [0.9999999999373614,0.999999999868622] * Minimum: 7.645684e-21 * Iterations: 16 * Convergence: true * |x - x'| \u2264 0.0e+00: false |x - x'| = 3.48e-07 * |f(x) - f(x')| \u2264 0.0e+00 |f(x)|: false |f(x) - f(x')| = 9.03e+06 |f(x)| * |g(x)| \u2264 1.0e-08: true |g(x)| = 2.32e-09 * stopped by an increasing objective: false * Reached Maximum Number of Iterations: false * Objective Calls: 53 * Gradient Calls: 53 We see that for this objective and starting point, ConjugateGradient() requires fewer gradient evaluations to reach convergence.","title":"Example"},{"location":"algo/cg/#references","text":"W. W. Hager and H. Zhang (2006) Algorithm 851: CG_DESCENT, a conjugate gradient method with guaranteed descent. ACM Transactions on Mathematical Software 32: 113-137. W. W. Hager and H. Zhang (2013), The Limited Memory Conjugate Gradient Method. SIAM Journal on Optimization, 23, pp. 2150-2168.","title":"References"},{"location":"algo/complex/","text":"Complex optimization Optimization of functions defined on complex inputs ($\\mathbb{C}^n \\to \\mathbb{R}$) is supported by simply passing a complex $x$ as input. The algorithms supported are all those which can naturally be extended to work with complex numbers: simulated annealing and all the first-order methods. The gradient of a complex-to-real function is defined as the only vector $g$ such that f(x+h) = f(x) + \\mbox{Re}(g' * h) + \\mathcal{O}(h^2). This is sometimes written g = \\frac{df}{d(z*)} = \\frac{df}{d(\\mbox{Re}(z))} + i \\frac{df}{d(\\mbox{Im(z)})}. The gradient of a $\\mathbb{C}^n \\to \\mathbb{R}$ function is a $\\mathbb{C}^n \\to \\mathbb{C}^n$ map. Even if it is differentiable when seen as a function of $\\mathbb{R}^{2n}$ to $\\mathbb{R}^{2n}$, it might not be complex-differentiable. For instance, take $f(z) = \\mbox{Re}(z)^2$. Then $g(z) = 2 \\mbox{Re}(z)$, which is not complex-differentiable (holomorphic). Therefore, the Hessian of a $\\mathbb{C}^n \\to \\mathbb{R}$ function is in general not well-defined as a $n \\times n$ complex matrix (only as a $2n \\times 2n$ real matrix), and therefore second-order optimization algorithms are not applicable directly. To use second-order optimization, convert to real variables. Examples We show how to minimize a quadratic plus quartic function with the LBFGS optimization algorithm. using Random Random.seed!(0) # Set the seed for reproducibility # \u03bc is the strength of the quartic. \u03bc = 0 is just a quadratic problem n = 4 A = randn(n,n) + im*randn(n,n) A = A'A + I b = randn(n) + im*randn(n) \u03bc = 1.0 fcomplex(x) = real(dot(x,A*x)/2 - dot(b,x)) + \u03bc*sum(abs.(x).^4) gcomplex(x) = A*x-b + 4\u03bc*(abs.(x).^2).*x gcomplex!(stor,x) = copyto!(stor,gcomplex(x)) x0 = randn(n)+im*randn(n) res = optimize(fcomplex, gcomplex!, x0, LBFGS()) The output of the optimization is Results of Optimization Algorithm * Algorithm: L-BFGS * Starting Point: [0.48155603952425174 - 1.477880724921868im,-0.3219431528959694 - 0.18542418173298963im, ...] * Minimizer: [0.14163543901272568 - 0.034929496785515886im,-0.1208600058040362 - 0.6125620908171383im, ...] * Minimum: -1.568997e+00 * Iterations: 16 * Convergence: true * |x - x'| \u2264 0.0e+00: false |x - x'| = 3.28e-09 * |f(x) - f(x')| \u2264 0.0e+00 |f(x)|: false |f(x) - f(x')| = -4.25e-16 |f(x)| * |g(x)| \u2264 1.0e-08: true |g(x)| = 6.33e-11 * Stopped by an increasing objective: false * Reached Maximum Number of Iterations: false * Objective Calls: 48 * Gradient Calls: 48 Similarly, with ConjugateGradient . res = optimize(fcomplex, gcomplex!, x0, ConjugateGradient()) Results of Optimization Algorithm * Algorithm: Conjugate Gradient * Starting Point: [0.48155603952425174 - 1.477880724921868im,-0.3219431528959694 - 0.18542418173298963im, ...] * Minimizer: [0.1416354378490425 - 0.034929499492595516im,-0.12086000949769983 - 0.6125620892675705im, ...] * Minimum: -1.568997e+00 * Iterations: 23 * Convergence: false * |x - x'| \u2264 0.0e+00: false |x - x'| = 8.54e-10 * |f(x) - f(x')| \u2264 0.0e+00 |f(x)|: false |f(x) - f(x')| = -4.25e-16 |f(x)| * |g(x)| \u2264 1.0e-08: false |g(x)| = 3.72e-08 * Stopped by an increasing objective: true * Reached Maximum Number of Iterations: false * Objective Calls: 51 * Gradient Calls: 29 Differentation The finite difference methods used by Optim support real functions with complex inputs. res = optimize(fcomplex, x0, LBFGS()) Results of Optimization Algorithm * Algorithm: L-BFGS * Starting Point: [0.48155603952425174 - 1.477880724921868im,-0.3219431528959694 - 0.18542418173298963im, ...] * Minimizer: [0.1416354390108624 - 0.034929496786122484im,-0.12086000580073922 - 0.6125620908025359im, ...] * Minimum: -1.568997e+00 * Iterations: 16 * Convergence: true * |x - x'| \u2264 0.0e+00: false |x - x'| = 3.28e-09 * |f(x) - f(x')| \u2264 0.0e+00 |f(x)|: true |f(x) - f(x')| = 0.00e+00 |f(x)| * |g(x)| \u2264 1.0e-08: true |g(x)| = 1.04e-10 * Stopped by an increasing objective: false * Reached Maximum Number of Iterations: false * Objective Calls: 48 * Gradient Calls: 48 Automatic differentiation support for complex inputs may come when Cassete.jl is ready. References Sorber, L., Barel, M. V., Lathauwer, L. D. (2012). Unconstrained optimization of real functions in complex variables. SIAM Journal on Optimization, 22(3), 879-898. Kreutz-Delgado, K. (2009). The complex gradient operator and the CR-calculus. arXiv preprint arXiv:0906.4835.","title":"Complex optimization"},{"location":"algo/complex/#complex-optimization","text":"Optimization of functions defined on complex inputs ($\\mathbb{C}^n \\to \\mathbb{R}$) is supported by simply passing a complex $x$ as input. The algorithms supported are all those which can naturally be extended to work with complex numbers: simulated annealing and all the first-order methods. The gradient of a complex-to-real function is defined as the only vector $g$ such that f(x+h) = f(x) + \\mbox{Re}(g' * h) + \\mathcal{O}(h^2). This is sometimes written g = \\frac{df}{d(z*)} = \\frac{df}{d(\\mbox{Re}(z))} + i \\frac{df}{d(\\mbox{Im(z)})}. The gradient of a $\\mathbb{C}^n \\to \\mathbb{R}$ function is a $\\mathbb{C}^n \\to \\mathbb{C}^n$ map. Even if it is differentiable when seen as a function of $\\mathbb{R}^{2n}$ to $\\mathbb{R}^{2n}$, it might not be complex-differentiable. For instance, take $f(z) = \\mbox{Re}(z)^2$. Then $g(z) = 2 \\mbox{Re}(z)$, which is not complex-differentiable (holomorphic). Therefore, the Hessian of a $\\mathbb{C}^n \\to \\mathbb{R}$ function is in general not well-defined as a $n \\times n$ complex matrix (only as a $2n \\times 2n$ real matrix), and therefore second-order optimization algorithms are not applicable directly. To use second-order optimization, convert to real variables.","title":"Complex optimization"},{"location":"algo/complex/#examples","text":"We show how to minimize a quadratic plus quartic function with the LBFGS optimization algorithm. using Random Random.seed!(0) # Set the seed for reproducibility # \u03bc is the strength of the quartic. \u03bc = 0 is just a quadratic problem n = 4 A = randn(n,n) + im*randn(n,n) A = A'A + I b = randn(n) + im*randn(n) \u03bc = 1.0 fcomplex(x) = real(dot(x,A*x)/2 - dot(b,x)) + \u03bc*sum(abs.(x).^4) gcomplex(x) = A*x-b + 4\u03bc*(abs.(x).^2).*x gcomplex!(stor,x) = copyto!(stor,gcomplex(x)) x0 = randn(n)+im*randn(n) res = optimize(fcomplex, gcomplex!, x0, LBFGS()) The output of the optimization is Results of Optimization Algorithm * Algorithm: L-BFGS * Starting Point: [0.48155603952425174 - 1.477880724921868im,-0.3219431528959694 - 0.18542418173298963im, ...] * Minimizer: [0.14163543901272568 - 0.034929496785515886im,-0.1208600058040362 - 0.6125620908171383im, ...] * Minimum: -1.568997e+00 * Iterations: 16 * Convergence: true * |x - x'| \u2264 0.0e+00: false |x - x'| = 3.28e-09 * |f(x) - f(x')| \u2264 0.0e+00 |f(x)|: false |f(x) - f(x')| = -4.25e-16 |f(x)| * |g(x)| \u2264 1.0e-08: true |g(x)| = 6.33e-11 * Stopped by an increasing objective: false * Reached Maximum Number of Iterations: false * Objective Calls: 48 * Gradient Calls: 48 Similarly, with ConjugateGradient . res = optimize(fcomplex, gcomplex!, x0, ConjugateGradient()) Results of Optimization Algorithm * Algorithm: Conjugate Gradient * Starting Point: [0.48155603952425174 - 1.477880724921868im,-0.3219431528959694 - 0.18542418173298963im, ...] * Minimizer: [0.1416354378490425 - 0.034929499492595516im,-0.12086000949769983 - 0.6125620892675705im, ...] * Minimum: -1.568997e+00 * Iterations: 23 * Convergence: false * |x - x'| \u2264 0.0e+00: false |x - x'| = 8.54e-10 * |f(x) - f(x')| \u2264 0.0e+00 |f(x)|: false |f(x) - f(x')| = -4.25e-16 |f(x)| * |g(x)| \u2264 1.0e-08: false |g(x)| = 3.72e-08 * Stopped by an increasing objective: true * Reached Maximum Number of Iterations: false * Objective Calls: 51 * Gradient Calls: 29","title":"Examples"},{"location":"algo/complex/#differentation","text":"The finite difference methods used by Optim support real functions with complex inputs. res = optimize(fcomplex, x0, LBFGS()) Results of Optimization Algorithm * Algorithm: L-BFGS * Starting Point: [0.48155603952425174 - 1.477880724921868im,-0.3219431528959694 - 0.18542418173298963im, ...] * Minimizer: [0.1416354390108624 - 0.034929496786122484im,-0.12086000580073922 - 0.6125620908025359im, ...] * Minimum: -1.568997e+00 * Iterations: 16 * Convergence: true * |x - x'| \u2264 0.0e+00: false |x - x'| = 3.28e-09 * |f(x) - f(x')| \u2264 0.0e+00 |f(x)|: true |f(x) - f(x')| = 0.00e+00 |f(x)| * |g(x)| \u2264 1.0e-08: true |g(x)| = 1.04e-10 * Stopped by an increasing objective: false * Reached Maximum Number of Iterations: false * Objective Calls: 48 * Gradient Calls: 48 Automatic differentiation support for complex inputs may come when Cassete.jl is ready.","title":"Differentation"},{"location":"algo/complex/#references","text":"Sorber, L., Barel, M. V., Lathauwer, L. D. (2012). Unconstrained optimization of real functions in complex variables. SIAM Journal on Optimization, 22(3), 879-898. Kreutz-Delgado, K. (2009). The complex gradient operator and the CR-calculus. arXiv preprint arXiv:0906.4835.","title":"References"},{"location":"algo/goldensection/","text":"Golden Section Constructor Description Example References","title":"Goldensection"},{"location":"algo/goldensection/#golden-section","text":"","title":"Golden Section"},{"location":"algo/goldensection/#constructor","text":"","title":"Constructor"},{"location":"algo/goldensection/#description","text":"","title":"Description"},{"location":"algo/goldensection/#example","text":"","title":"Example"},{"location":"algo/goldensection/#references","text":"","title":"References"},{"location":"algo/gradientdescent/","text":"Gradient Descent Constructor GradientDescent(; alphaguess = LineSearches.InitialPrevious(), linesearch = LineSearches.HagerZhang(), P = nothing, precondprep = (P, x) - nothing) Description Gradient Descent a common name for a quasi-Newton solver. This means that it takes steps according to x_{n+1} = x_n - P^{-1}\\nabla f(x_n) where $P$ is a positive definite matrix. If $P$ is the Hessian, we get Newton's method. In Gradient Descent, $P$ is simply an appropriately dimensioned identity matrix, such that we go in the exact opposite direction of the gradient. This means that we do not use the curvature information from the Hessian, or an approximation of it. While it does seem quite logical to go in the opposite direction of the fastest increase in objective value, the procedure can be very slow if the problem is ill-conditioned. See the section on preconditioners for ways to remedy this when using Gradient Descent. As with the other quasi-Newton solvers in this package, a scalar $\\alpha$ is introduced as follows x_{n+1} = x_n - \\alpha P^{-1}\\nabla f(x_n) and is chosen by a linesearch algorithm such that each step gives sufficient descent. Example References","title":"Gradient Descent"},{"location":"algo/gradientdescent/#gradient-descent","text":"","title":"Gradient Descent"},{"location":"algo/gradientdescent/#constructor","text":"GradientDescent(; alphaguess = LineSearches.InitialPrevious(), linesearch = LineSearches.HagerZhang(), P = nothing, precondprep = (P, x) - nothing)","title":"Constructor"},{"location":"algo/gradientdescent/#description","text":"Gradient Descent a common name for a quasi-Newton solver. This means that it takes steps according to x_{n+1} = x_n - P^{-1}\\nabla f(x_n) where $P$ is a positive definite matrix. If $P$ is the Hessian, we get Newton's method. In Gradient Descent, $P$ is simply an appropriately dimensioned identity matrix, such that we go in the exact opposite direction of the gradient. This means that we do not use the curvature information from the Hessian, or an approximation of it. While it does seem quite logical to go in the opposite direction of the fastest increase in objective value, the procedure can be very slow if the problem is ill-conditioned. See the section on preconditioners for ways to remedy this when using Gradient Descent. As with the other quasi-Newton solvers in this package, a scalar $\\alpha$ is introduced as follows x_{n+1} = x_n - \\alpha P^{-1}\\nabla f(x_n) and is chosen by a linesearch algorithm such that each step gives sufficient descent.","title":"Description"},{"location":"algo/gradientdescent/#example","text":"","title":"Example"},{"location":"algo/gradientdescent/#references","text":"","title":"References"},{"location":"algo/ipnewton/","text":"Interior point Newton method # Optim.IPNewton Type . Interior-point Newton Constructor IPNewton(; linesearch::Function = Optim.backtrack_constrained_grad, \u03bc0::Union{Symbol,Number} = :auto, show_linesearch::Bool = false) The initial barrier penalty coefficient \u03bc0 can be chosen as a number, or set to :auto to let the algorithm decide its value, see initialize_\u03bc_\u03bb! . Note : For constrained optimization problems, we recommend always enabling allow_f_increases and successive_f_tol in the options passed to optimize . The default is set to Optim.Options(allow_f_increases = true, successive_f_tol = 2) . As of February 2018, the line search algorithm is specialised for constrained interior-point methods. In future we hope to support more algorithms from LineSearches.jl . Description The IPNewton method implements an interior-point primal-dual Newton algorithm for solving nonlinear, constrained optimization problems. See Nocedal and Wright (Ch. 19, 2006) for a discussion of interior-point methods for constrained optimization. References The algorithm was originally written by Tim Holy (@timholy, tim.holy@gmail.com). J Nocedal, SJ Wright (2006), Numerical optimization, second edition. Springer. A W\u00e4chter, LT Biegler (2006), On the implementation of an interior-point filter line-search algorithm for large-scale nonlinear programming. Mathematical Programming 106 (1), 25-57. Examples Nonlinear constrained optimization in Optim","title":"Interior point Newton"},{"location":"algo/ipnewton/#interior-point-newton-method","text":"# Optim.IPNewton Type . Interior-point Newton Constructor IPNewton(; linesearch::Function = Optim.backtrack_constrained_grad, \u03bc0::Union{Symbol,Number} = :auto, show_linesearch::Bool = false) The initial barrier penalty coefficient \u03bc0 can be chosen as a number, or set to :auto to let the algorithm decide its value, see initialize_\u03bc_\u03bb! . Note : For constrained optimization problems, we recommend always enabling allow_f_increases and successive_f_tol in the options passed to optimize . The default is set to Optim.Options(allow_f_increases = true, successive_f_tol = 2) . As of February 2018, the line search algorithm is specialised for constrained interior-point methods. In future we hope to support more algorithms from LineSearches.jl . Description The IPNewton method implements an interior-point primal-dual Newton algorithm for solving nonlinear, constrained optimization problems. See Nocedal and Wright (Ch. 19, 2006) for a discussion of interior-point methods for constrained optimization. References The algorithm was originally written by Tim Holy (@timholy, tim.holy@gmail.com). J Nocedal, SJ Wright (2006), Numerical optimization, second edition. Springer. A W\u00e4chter, LT Biegler (2006), On the implementation of an interior-point filter line-search algorithm for large-scale nonlinear programming. Mathematical Programming 106 (1), 25-57.","title":"Interior point Newton method"},{"location":"algo/ipnewton/#examples","text":"Nonlinear constrained optimization in Optim","title":"Examples"},{"location":"algo/lbfgs/","text":"(L-)BFGS This page contains information about BFGS and its limited memory version L-BFGS. Constructors BFGS(; alphaguess = LineSearches.InitialStatic(), linesearch = LineSearches.HagerZhang(), initial_invH = nothing, initial_stepnorm = nothing, manifold = Flat()) initial_invH has a default value of nothing . If the user has a specific initial matrix they want to supply, it should be supplied as a function of an array similar to the initial point x0 . If initial_stepnorm is set to a number z , the initial matrix will be the identity matrix scaled by z times the sup-norm of the gradient at the initial point x0 . LBFGS(; m = 10, alphaguess = LineSearches.InitialStatic(), linesearch = LineSearches.HagerZhang(), P = nothing, precondprep = (P, x) - nothing, manifold = Flat(), scaleinvH0::Bool = true (typeof(P) : Nothing)) Description This means that it takes steps according to x_{n+1} = x_n - P^{-1}\\nabla f(x_n) where $P$ is a positive definite matrix. If $P$ is the Hessian, we get Newton's method. In (L-)BFGS, the matrix is an approximation to the Hessian built using differences in the gradient across iterations. As long as the initial matrix is positive definite it is possible to show that all the follow matrices will be as well. The starting matrix could simply be the identity matrix, such that the first step is identical to the Gradient Descent algorithm, or even the actual Hessian. There are two versions of BFGS in the package: BFGS, and L-BFGS. The latter is different from the former because it doesn't use a complete history of the iterative procedure to construct $P$, but rather only the latest $m$ steps. It doesn't actually build the Hessian approximation matrix either, but computes the direction directly. This makes more suitable for large scale problems, as the memory requirement to store the relevant vectors will grow quickly in large problems. As with the other quasi-Newton solvers in this package, a scalar $\\alpha$ is introduced as follows x_{n+1} = x_n - \\alpha P^{-1}\\nabla f(x_n) and is chosen by a linesearch algorithm such that each step gives sufficient descent. Example References Wright, Stephen, and Jorge Nocedal (2006) \"Numerical optimization.\" Springer","title":"(L-)BFGS"},{"location":"algo/lbfgs/#l-bfgs","text":"This page contains information about BFGS and its limited memory version L-BFGS.","title":"(L-)BFGS"},{"location":"algo/lbfgs/#constructors","text":"BFGS(; alphaguess = LineSearches.InitialStatic(), linesearch = LineSearches.HagerZhang(), initial_invH = nothing, initial_stepnorm = nothing, manifold = Flat()) initial_invH has a default value of nothing . If the user has a specific initial matrix they want to supply, it should be supplied as a function of an array similar to the initial point x0 . If initial_stepnorm is set to a number z , the initial matrix will be the identity matrix scaled by z times the sup-norm of the gradient at the initial point x0 . LBFGS(; m = 10, alphaguess = LineSearches.InitialStatic(), linesearch = LineSearches.HagerZhang(), P = nothing, precondprep = (P, x) - nothing, manifold = Flat(), scaleinvH0::Bool = true (typeof(P) : Nothing))","title":"Constructors"},{"location":"algo/lbfgs/#description","text":"This means that it takes steps according to x_{n+1} = x_n - P^{-1}\\nabla f(x_n) where $P$ is a positive definite matrix. If $P$ is the Hessian, we get Newton's method. In (L-)BFGS, the matrix is an approximation to the Hessian built using differences in the gradient across iterations. As long as the initial matrix is positive definite it is possible to show that all the follow matrices will be as well. The starting matrix could simply be the identity matrix, such that the first step is identical to the Gradient Descent algorithm, or even the actual Hessian. There are two versions of BFGS in the package: BFGS, and L-BFGS. The latter is different from the former because it doesn't use a complete history of the iterative procedure to construct $P$, but rather only the latest $m$ steps. It doesn't actually build the Hessian approximation matrix either, but computes the direction directly. This makes more suitable for large scale problems, as the memory requirement to store the relevant vectors will grow quickly in large problems. As with the other quasi-Newton solvers in this package, a scalar $\\alpha$ is introduced as follows x_{n+1} = x_n - \\alpha P^{-1}\\nabla f(x_n) and is chosen by a linesearch algorithm such that each step gives sufficient descent.","title":"Description"},{"location":"algo/lbfgs/#example","text":"","title":"Example"},{"location":"algo/lbfgs/#references","text":"Wright, Stephen, and Jorge Nocedal (2006) \"Numerical optimization.\" Springer","title":"References"},{"location":"algo/linesearch/","text":"Line search Description The line search functionality has been moved to LineSearches.jl . Line search is used to decide the step length along the direction computed by an optimization algorithm. The following Optim algorithms use line search: Accelerated Gradient Descent (L-)BFGS Conjugate Gradient Gradient Descent Momentum Gradient Descent Newton By default Optim calls the line search algorithm HagerZhang() provided by LineSearches . Different line search algorithms can be assigned with the linesearch keyword argument to the given algorithm. LineSearches also allows the user to decide how the initial step length for the line search algorithm is chosen. This is set with the alphaguess keyword argument for the Optim algorithm. The default procedure varies. Example This example compares two different line search algorithms on the Rosenbrock problem. First, run Newton with the default line search algorithm: using Optim, LineSearches prob = Optim.UnconstrainedProblems.examples[ Rosenbrock ] algo_hz = Newton(;alphaguess = LineSearches.InitialStatic(), linesearch = LineSearches.HagerZhang()) res_hz = Optim.optimize(prob.f, prob.g!, prob.h!, prob.initial_x, method=algo_hz) This gives the result * Algorithm: Newton's Method * Starting Point: [0.0,0.0] * Minimizer: [0.9999999999999994,0.9999999999999989] * Minimum: 3.081488e-31 * Iterations: 14 * Convergence: true * |x - x'| \u2264 0.0e+00: false |x - x'| = 3.06e-09 * |f(x) - f(x')| \u2264 0.0e+00 |f(x)|: false |f(x) - f(x')| = 2.94e+13 |f(x)| * |g(x)| \u2264 1.0e-08: true |g(x)| = 1.11e-15 * stopped by an increasing objective: false * Reached Maximum Number of Iterations: false * Objective Calls: 44 * Gradient Calls: 44 * Hessian Calls: 14 Now we can try Newton with the More-Thuente line search: algo_mt = Newton(;alphaguess = LineSearches.InitialStatic(), linesearch = LineSearches.MoreThuente()) res_mt = Optim.optimize(prob.f, prob.g!, prob.h!, prob.initial_x, method=algo_mt) This gives the following result, reducing the number of function and gradient calls: Results of Optimization Algorithm * Algorithm: Newton's Method * Starting Point: [0.0,0.0] * Minimizer: [0.9999999999999992,0.999999999999998] * Minimum: 2.032549e-29 * Iterations: 14 * Convergence: true * |x - x'| \u2264 0.0e+00: false |x - x'| = 3.67e-08 * |f(x) - f(x')| \u2264 0.0e00 |f(x)|: false |f(x) - f(x')| = 1.66e+13 |f(x)| * |g(x)| \u2264 1.0e-08: true |g(x)| = 1.76e-13 * stopped by an increasing objective: false * Reached Maximum Number of Iterations: false * Objective Calls: 17 * Gradient Calls: 17 * Hessian Calls: 14 References","title":"Linesearch"},{"location":"algo/linesearch/#line-search","text":"","title":"Line search"},{"location":"algo/linesearch/#description","text":"The line search functionality has been moved to LineSearches.jl . Line search is used to decide the step length along the direction computed by an optimization algorithm. The following Optim algorithms use line search: Accelerated Gradient Descent (L-)BFGS Conjugate Gradient Gradient Descent Momentum Gradient Descent Newton By default Optim calls the line search algorithm HagerZhang() provided by LineSearches . Different line search algorithms can be assigned with the linesearch keyword argument to the given algorithm. LineSearches also allows the user to decide how the initial step length for the line search algorithm is chosen. This is set with the alphaguess keyword argument for the Optim algorithm. The default procedure varies.","title":"Description"},{"location":"algo/linesearch/#example","text":"This example compares two different line search algorithms on the Rosenbrock problem. First, run Newton with the default line search algorithm: using Optim, LineSearches prob = Optim.UnconstrainedProblems.examples[ Rosenbrock ] algo_hz = Newton(;alphaguess = LineSearches.InitialStatic(), linesearch = LineSearches.HagerZhang()) res_hz = Optim.optimize(prob.f, prob.g!, prob.h!, prob.initial_x, method=algo_hz) This gives the result * Algorithm: Newton's Method * Starting Point: [0.0,0.0] * Minimizer: [0.9999999999999994,0.9999999999999989] * Minimum: 3.081488e-31 * Iterations: 14 * Convergence: true * |x - x'| \u2264 0.0e+00: false |x - x'| = 3.06e-09 * |f(x) - f(x')| \u2264 0.0e+00 |f(x)|: false |f(x) - f(x')| = 2.94e+13 |f(x)| * |g(x)| \u2264 1.0e-08: true |g(x)| = 1.11e-15 * stopped by an increasing objective: false * Reached Maximum Number of Iterations: false * Objective Calls: 44 * Gradient Calls: 44 * Hessian Calls: 14 Now we can try Newton with the More-Thuente line search: algo_mt = Newton(;alphaguess = LineSearches.InitialStatic(), linesearch = LineSearches.MoreThuente()) res_mt = Optim.optimize(prob.f, prob.g!, prob.h!, prob.initial_x, method=algo_mt) This gives the following result, reducing the number of function and gradient calls: Results of Optimization Algorithm * Algorithm: Newton's Method * Starting Point: [0.0,0.0] * Minimizer: [0.9999999999999992,0.999999999999998] * Minimum: 2.032549e-29 * Iterations: 14 * Convergence: true * |x - x'| \u2264 0.0e+00: false |x - x'| = 3.67e-08 * |f(x) - f(x')| \u2264 0.0e00 |f(x)|: false |f(x) - f(x')| = 1.66e+13 |f(x)| * |g(x)| \u2264 1.0e-08: true |g(x)| = 1.76e-13 * stopped by an increasing objective: false * Reached Maximum Number of Iterations: false * Objective Calls: 17 * Gradient Calls: 17 * Hessian Calls: 14","title":"Example"},{"location":"algo/linesearch/#references","text":"","title":"References"},{"location":"algo/manifolds/","text":"Manifold optimization Optim.jl supports the minimization of functions defined on Riemannian manifolds, i.e. with simple constraints such as normalization and orthogonality. The basic idea of such algorithms is to project back (\"retract\") each iterate of an unconstrained minimization method onto the manifold. This is used by passing a manifold keyword argument to the optimizer. Howto Here is a simple test case where we minimize the Rayleigh quotient x, A x of a symmetric matrix A under the constraint ||x|| = 1 , finding an eigenvector associated with the lowest eigenvalue of A . n = 10 A = Diagonal(range(1, stop=2, length=n)) f(x) = dot(x,A*x)/2 g(x) = A*x g!(stor,x) = copyto!(stor,g(x)) x0 = randn(n) manif = Optim.Sphere() Optim.optimize(f, g!, x0, Optim.ConjugateGradient(manifold=manif)) Supported solvers and manifolds All first-order optimization methods are supported. The following manifolds are currently supported: Flat: Euclidean space, default. Standard unconstrained optimization. Sphere: spherical constraint ||x|| = 1 Stiefel: Stiefel manifold of N by n matrices with orthogonal columns, i.e. X'*X = I The following meta-manifolds construct manifolds out of pre-existing ones: PowerManifold: identical copies of a specified manifold ProductManifold: product of two (potentially different) manifolds See test/multivariate/manifolds.jl for usage examples. Implementing new manifolds is as simple as adding methods project_tangent!(M::YourManifold,x) and retract!(M::YourManifold,g,x) . If you implement another manifold or optimization method, please contribute a PR! References The Geometry of Algorithms with Orthogonality Constraints, Alan Edelman, Tom\u00e1s A. Arias, Steven T. Smith, SIAM. J. Matrix Anal. Appl., 20(2), 303\u2013353 Optimization Algorithms on Matrix Manifolds, P.-A. Absil, R. Mahony, R. Sepulchre, Princeton University Press, 2008","title":"Manifolds"},{"location":"algo/manifolds/#manifold-optimization","text":"Optim.jl supports the minimization of functions defined on Riemannian manifolds, i.e. with simple constraints such as normalization and orthogonality. The basic idea of such algorithms is to project back (\"retract\") each iterate of an unconstrained minimization method onto the manifold. This is used by passing a manifold keyword argument to the optimizer.","title":"Manifold optimization"},{"location":"algo/manifolds/#howto","text":"Here is a simple test case where we minimize the Rayleigh quotient x, A x of a symmetric matrix A under the constraint ||x|| = 1 , finding an eigenvector associated with the lowest eigenvalue of A . n = 10 A = Diagonal(range(1, stop=2, length=n)) f(x) = dot(x,A*x)/2 g(x) = A*x g!(stor,x) = copyto!(stor,g(x)) x0 = randn(n) manif = Optim.Sphere() Optim.optimize(f, g!, x0, Optim.ConjugateGradient(manifold=manif))","title":"Howto"},{"location":"algo/manifolds/#supported-solvers-and-manifolds","text":"All first-order optimization methods are supported. The following manifolds are currently supported: Flat: Euclidean space, default. Standard unconstrained optimization. Sphere: spherical constraint ||x|| = 1 Stiefel: Stiefel manifold of N by n matrices with orthogonal columns, i.e. X'*X = I The following meta-manifolds construct manifolds out of pre-existing ones: PowerManifold: identical copies of a specified manifold ProductManifold: product of two (potentially different) manifolds See test/multivariate/manifolds.jl for usage examples. Implementing new manifolds is as simple as adding methods project_tangent!(M::YourManifold,x) and retract!(M::YourManifold,g,x) . If you implement another manifold or optimization method, please contribute a PR!","title":"Supported solvers and manifolds"},{"location":"algo/manifolds/#references","text":"The Geometry of Algorithms with Orthogonality Constraints, Alan Edelman, Tom\u00e1s A. Arias, Steven T. Smith, SIAM. J. Matrix Anal. Appl., 20(2), 303\u2013353 Optimization Algorithms on Matrix Manifolds, P.-A. Absil, R. Mahony, R. Sepulchre, Princeton University Press, 2008","title":"References"},{"location":"algo/nelder_mead/","text":"Nelder-Mead Nelder-Mead is currently the standard algorithm when no derivatives are provided. Constructor NelderMead(; parameters = AdaptiveParameters(), initial_simplex = AffineSimplexer()) The keywords in the constructor are used to control the following parts of the solver: parameters is a an instance of either AdaptiveParameters or FixedParameters , and is used to generate parameters for the Nelder-Mead Algorithm. initial_simplex is an instance of AffineSimplexer . See more details below. Description Our current implementation of the Nelder-Mead algorithm is based on Nelder and Mead (1965) and Gao and Han (2010). Gradient free methods can be a bit sensitive to starting values and tuning parameters, so it is a good idea to be careful with the defaults provided in Optim. Instead of using gradient information, Nelder-Mead is a direct search method. It keeps track of the function value at a number of points in the search space. Together, the points form a simplex. Given a simplex, we can perform one of four actions: reflect, expand, contract, or shrink. Basically, the goal is to iteratively replace the worst point with a better point. More information can be found in Nelder and Mead (1965), Lagarias, et al (1998) or Gao and Han (2010). The stopping rule is the same as in the original paper, and is the standard error of the function values at the vertices. To set the tolerance level for this convergence criterion, set the g_tol level as described in the Configurable Options section. When the solver finishes, we return a minimizer which is either the centroid or one of the vertices. The function value at the centroid adds a function evaluation, as we need to evaluate the objection at the centroid to choose the smallest function value. However, even if the function value at the centroid can be returned as the minimum, we do not trace it during the optimization iterations. This is to avoid too many evaluations of the objective function which can be computationally expensive. Typically, there should be no more than twice as many f_calls than iterations . Adding an evaluation at the centroid when tracing could considerably increase the total run-time of the algorithm. Specifying the initial simplex The default choice of initial_simplex is AffineSimplexer() . A simplex is represented by an $(n+1)$-dimensional vector of $n$-dimensional vectors. It is used together with the initial x to create the initial simplex. To construct the $i$th vertex, it simply multiplies entry $i$ in the initial vector with a constant b , and adds a constant a . This means that the $i$th of the $n$ additional vertices is of the form (x_0^1, x_0^2, \\ldots, x_0^i, \\ldots, 0,0) + (0, 0, \\ldots, x_0^i\\cdot b+a,\\ldots, 0,0) If an $x_0^i$ is zero, we need the $a$ to make sure all vertices are unique. Generally, it is advised to start with a relatively large simplex. If a specific simplex is wanted, it is possible to construct the $(n+1)$-vector of $n$-dimensional vectors, and pass it to the solver using a new type definition and a new method for the function simplexer . For example, let us minimize the two-dimensional Rosenbrock function, and choose three vertices that have elements that are simply standard uniform draws. using Optim struct MySimplexer : Optim.Simplexer end Optim.simplexer(S::MySimplexer, initial_x) = [rand(length(initial_x)) for i = 1:length(initial_x)+1] f(x) = (1.0 - x[1])^2 + 100.0 * (x[2] - x[1]^2)^2 optimize(f, [.0, .0], NelderMead(initial_simplex = MySimplexer())) Say we want to implement the initial simplex as in Matlab's fminsearch . This is very close to the AffineSimplexer above, but with a small twist. Instead of always adding the a , a constant is only added to entries that are zero. If the entry is non-zero, five percent of the level is added. This might be implemented (by the user) as struct MatlabSimplexer{T} : Optim.Simplexer a::T b::T end MatlabSimplexer(;a = 0.00025, b = 0.05) = MatlabSimplexer(a, b) function Optim.simplexer(A::MatlabSimplexer, initial_x::AbstractArray{T, N}) where {T, N} n = length(initial_x) initial_simplex = Array{T, N}[initial_x for i = 1:n+1] for j = 1:n initial_simplex[j+1][j] += initial_simplex[j+1][j] == zero(T) ? S.b * initial_simplex[j+1][j] : S.a end initial_simplex end The parameters of Nelder-Mead The different types of steps in the algorithm are governed by four parameters: $\\alpha$ for the reflection, $\\beta$ for the expansion, $\\gamma$ for the contraction, and $\\delta$ for the shrink step. We default to the adaptive parameters scheme in Gao and Han (2010). These are based on the dimensionality of the problem, and are given by \\alpha = 1, \\quad \\beta = 1+2/n,\\quad \\gamma =0.75 - 1/2n,\\quad \\delta = 1-1/n It is also possible to specify the original parameters from Nelder and Mead (1965) \\alpha = 1,\\quad \\beta = 2, \\quad\\gamma = 1/2, \\quad\\delta = 1/2 by specifying parameters = Optim.FixedParameters() . For specifying custom values, parameters = Optim.FixedParameters(\u03b1 = a, \u03b2 = b, \u03b3 = g, \u03b4 = d) is used, where a, b, g, d are the chosen values. If another parameter specification is wanted, it is possible to create a custom sub-type of Optim.NMParameters , and add a method to the parameters function. It should take the new type as the first positional argument, and the dimensionality of x as the second positional argument, and return a 4-tuple of parameters. However, it will often be easier to simply supply the wanted parameters to FixedParameters . References Nelder, John A. and R. Mead (1965). \"A simplex method for function minimization\". Computer Journal 7: 308\u2013313. doi:10.1093/comjnl/7.4.308. Lagarias, Jeffrey C., et al. \"Convergence properties of the Nelder\u2013Mead simplex method in low dimensions.\" SIAM Journal on optimization 9.1 (1998): 112-147. Gao, Fuchang and Lixing Han (2010). \"Implementing the Nelder-Mead simplex algorithm with adaptive parameters\". Computational Optimization and Applications [DOI 10.1007/s10589-010-9329-3]","title":"Nelder Mead"},{"location":"algo/nelder_mead/#nelder-mead","text":"Nelder-Mead is currently the standard algorithm when no derivatives are provided.","title":"Nelder-Mead"},{"location":"algo/nelder_mead/#constructor","text":"NelderMead(; parameters = AdaptiveParameters(), initial_simplex = AffineSimplexer()) The keywords in the constructor are used to control the following parts of the solver: parameters is a an instance of either AdaptiveParameters or FixedParameters , and is used to generate parameters for the Nelder-Mead Algorithm. initial_simplex is an instance of AffineSimplexer . See more details below.","title":"Constructor"},{"location":"algo/nelder_mead/#description","text":"Our current implementation of the Nelder-Mead algorithm is based on Nelder and Mead (1965) and Gao and Han (2010). Gradient free methods can be a bit sensitive to starting values and tuning parameters, so it is a good idea to be careful with the defaults provided in Optim. Instead of using gradient information, Nelder-Mead is a direct search method. It keeps track of the function value at a number of points in the search space. Together, the points form a simplex. Given a simplex, we can perform one of four actions: reflect, expand, contract, or shrink. Basically, the goal is to iteratively replace the worst point with a better point. More information can be found in Nelder and Mead (1965), Lagarias, et al (1998) or Gao and Han (2010). The stopping rule is the same as in the original paper, and is the standard error of the function values at the vertices. To set the tolerance level for this convergence criterion, set the g_tol level as described in the Configurable Options section. When the solver finishes, we return a minimizer which is either the centroid or one of the vertices. The function value at the centroid adds a function evaluation, as we need to evaluate the objection at the centroid to choose the smallest function value. However, even if the function value at the centroid can be returned as the minimum, we do not trace it during the optimization iterations. This is to avoid too many evaluations of the objective function which can be computationally expensive. Typically, there should be no more than twice as many f_calls than iterations . Adding an evaluation at the centroid when tracing could considerably increase the total run-time of the algorithm.","title":"Description"},{"location":"algo/nelder_mead/#specifying-the-initial-simplex","text":"The default choice of initial_simplex is AffineSimplexer() . A simplex is represented by an $(n+1)$-dimensional vector of $n$-dimensional vectors. It is used together with the initial x to create the initial simplex. To construct the $i$th vertex, it simply multiplies entry $i$ in the initial vector with a constant b , and adds a constant a . This means that the $i$th of the $n$ additional vertices is of the form (x_0^1, x_0^2, \\ldots, x_0^i, \\ldots, 0,0) + (0, 0, \\ldots, x_0^i\\cdot b+a,\\ldots, 0,0) If an $x_0^i$ is zero, we need the $a$ to make sure all vertices are unique. Generally, it is advised to start with a relatively large simplex. If a specific simplex is wanted, it is possible to construct the $(n+1)$-vector of $n$-dimensional vectors, and pass it to the solver using a new type definition and a new method for the function simplexer . For example, let us minimize the two-dimensional Rosenbrock function, and choose three vertices that have elements that are simply standard uniform draws. using Optim struct MySimplexer : Optim.Simplexer end Optim.simplexer(S::MySimplexer, initial_x) = [rand(length(initial_x)) for i = 1:length(initial_x)+1] f(x) = (1.0 - x[1])^2 + 100.0 * (x[2] - x[1]^2)^2 optimize(f, [.0, .0], NelderMead(initial_simplex = MySimplexer())) Say we want to implement the initial simplex as in Matlab's fminsearch . This is very close to the AffineSimplexer above, but with a small twist. Instead of always adding the a , a constant is only added to entries that are zero. If the entry is non-zero, five percent of the level is added. This might be implemented (by the user) as struct MatlabSimplexer{T} : Optim.Simplexer a::T b::T end MatlabSimplexer(;a = 0.00025, b = 0.05) = MatlabSimplexer(a, b) function Optim.simplexer(A::MatlabSimplexer, initial_x::AbstractArray{T, N}) where {T, N} n = length(initial_x) initial_simplex = Array{T, N}[initial_x for i = 1:n+1] for j = 1:n initial_simplex[j+1][j] += initial_simplex[j+1][j] == zero(T) ? S.b * initial_simplex[j+1][j] : S.a end initial_simplex end","title":"Specifying the initial simplex"},{"location":"algo/nelder_mead/#the-parameters-of-nelder-mead","text":"The different types of steps in the algorithm are governed by four parameters: $\\alpha$ for the reflection, $\\beta$ for the expansion, $\\gamma$ for the contraction, and $\\delta$ for the shrink step. We default to the adaptive parameters scheme in Gao and Han (2010). These are based on the dimensionality of the problem, and are given by \\alpha = 1, \\quad \\beta = 1+2/n,\\quad \\gamma =0.75 - 1/2n,\\quad \\delta = 1-1/n It is also possible to specify the original parameters from Nelder and Mead (1965) \\alpha = 1,\\quad \\beta = 2, \\quad\\gamma = 1/2, \\quad\\delta = 1/2 by specifying parameters = Optim.FixedParameters() . For specifying custom values, parameters = Optim.FixedParameters(\u03b1 = a, \u03b2 = b, \u03b3 = g, \u03b4 = d) is used, where a, b, g, d are the chosen values. If another parameter specification is wanted, it is possible to create a custom sub-type of Optim.NMParameters , and add a method to the parameters function. It should take the new type as the first positional argument, and the dimensionality of x as the second positional argument, and return a 4-tuple of parameters. However, it will often be easier to simply supply the wanted parameters to FixedParameters .","title":"The parameters of Nelder-Mead"},{"location":"algo/nelder_mead/#references","text":"Nelder, John A. and R. Mead (1965). \"A simplex method for function minimization\". Computer Journal 7: 308\u2013313. doi:10.1093/comjnl/7.4.308. Lagarias, Jeffrey C., et al. \"Convergence properties of the Nelder\u2013Mead simplex method in low dimensions.\" SIAM Journal on optimization 9.1 (1998): 112-147. Gao, Fuchang and Lixing Han (2010). \"Implementing the Nelder-Mead simplex algorithm with adaptive parameters\". Computational Optimization and Applications [DOI 10.1007/s10589-010-9329-3]","title":"References"},{"location":"algo/newton/","text":"Newton's Method Constructor Newton(; alphaguess = LineSearches.InitialStatic(), linesearch = LineSearches.HagerZhang()) The constructor takes two keywords: linesearch = a(d, x, p, x_new, g_new, phi0, dphi0, c) , a function performing line search, see the line search section. alphaguess = a(state, dphi0, d) , a function for setting the initial guess for the line search algorithm, see the line search section. Description Newton's method for optimization has a long history, and is in some sense the gold standard in unconstrained optimization of smooth functions, at least from a theoretical viewpoint. The main benefit is that it has a quadratic rate of convergence near a local optimum. The main disadvantage is that the user has to provide a Hessian. This can be difficult, complicated, or simply annoying. It can also be computationally expensive to calculate it. Newton's method for optimization consists of applying Newton's method for solving systems of equations, where the equations are the first order conditions, saying that the gradient should equal the zero vector. \\nabla f(x) = 0 A second order Taylor expansion of the left-hand side leads to the iterative scheme x_{n+1} = x_n - H(x_n)^{-1}\\nabla f(x_n) where the inverse is not calculated directly, but the step size is instead calculated by solving H(x) \\textbf{s} = \\nabla f(x_n). This is equivalent to minimizing a quadratic model, $m_k$ around the current $x_n$ m_k(s) = f(x_n) + \\nabla f(x_n)^\\top \\textbf{s} + \\frac{1}{2} \\textbf{s}^\\top H(x_n) \\textbf{s} For functions where $H(x_n)$ is difficult, or computationally expensive to obtain, we might replace the Hessian with another positive definite matrix that approximates it. Such methods are called Quasi-Newton methods; see (L-)BFGS and Gradient Descent. In a sufficiently small neighborhood around the minimizer, Newton's method has quadratic convergence, but globally it might have slower convergence, or it might even diverge. To ensure convergence, a line search is performed for each $\\textbf{s}$. This amounts to replacing the step formula above with x_{n+1} = x_n - \\alpha \\textbf{s} and finding a scalar $\\alpha$ such that we get sufficient descent; see the line search section for more information. Additionally, if the function is locally concave, the step taken in the formulas above will go in a direction of ascent, as the Hessian will not be positive (semi)definite. To avoid this, we use a specialized method to calculate the step direction. If the Hessian is positive semidefinite then the method used is standard, but if it is not, a correction is made using the functionality in PositiveFactorizations.jl . Example show the example from the issue References","title":"Newton"},{"location":"algo/newton/#newtons-method","text":"","title":"Newton's Method"},{"location":"algo/newton/#constructor","text":"Newton(; alphaguess = LineSearches.InitialStatic(), linesearch = LineSearches.HagerZhang()) The constructor takes two keywords: linesearch = a(d, x, p, x_new, g_new, phi0, dphi0, c) , a function performing line search, see the line search section. alphaguess = a(state, dphi0, d) , a function for setting the initial guess for the line search algorithm, see the line search section.","title":"Constructor"},{"location":"algo/newton/#description","text":"Newton's method for optimization has a long history, and is in some sense the gold standard in unconstrained optimization of smooth functions, at least from a theoretical viewpoint. The main benefit is that it has a quadratic rate of convergence near a local optimum. The main disadvantage is that the user has to provide a Hessian. This can be difficult, complicated, or simply annoying. It can also be computationally expensive to calculate it. Newton's method for optimization consists of applying Newton's method for solving systems of equations, where the equations are the first order conditions, saying that the gradient should equal the zero vector. \\nabla f(x) = 0 A second order Taylor expansion of the left-hand side leads to the iterative scheme x_{n+1} = x_n - H(x_n)^{-1}\\nabla f(x_n) where the inverse is not calculated directly, but the step size is instead calculated by solving H(x) \\textbf{s} = \\nabla f(x_n). This is equivalent to minimizing a quadratic model, $m_k$ around the current $x_n$ m_k(s) = f(x_n) + \\nabla f(x_n)^\\top \\textbf{s} + \\frac{1}{2} \\textbf{s}^\\top H(x_n) \\textbf{s} For functions where $H(x_n)$ is difficult, or computationally expensive to obtain, we might replace the Hessian with another positive definite matrix that approximates it. Such methods are called Quasi-Newton methods; see (L-)BFGS and Gradient Descent. In a sufficiently small neighborhood around the minimizer, Newton's method has quadratic convergence, but globally it might have slower convergence, or it might even diverge. To ensure convergence, a line search is performed for each $\\textbf{s}$. This amounts to replacing the step formula above with x_{n+1} = x_n - \\alpha \\textbf{s} and finding a scalar $\\alpha$ such that we get sufficient descent; see the line search section for more information. Additionally, if the function is locally concave, the step taken in the formulas above will go in a direction of ascent, as the Hessian will not be positive (semi)definite. To avoid this, we use a specialized method to calculate the step direction. If the Hessian is positive semidefinite then the method used is standard, but if it is not, a correction is made using the functionality in PositiveFactorizations.jl .","title":"Description"},{"location":"algo/newton/#example","text":"show the example from the issue","title":"Example"},{"location":"algo/newton/#references","text":"","title":"References"},{"location":"algo/newton_trust_region/","text":"Newton's Method With a Trust Region Constructor NewtonTrustRegion(; initial_delta = 1.0, delta_hat = 100.0, eta = 0.1, rho_lower = 0.25, rho_upper = 0.75) The constructor takes keywords that determine the initial and maximal size of the trust region, when to grow and shrink the region, and how close the function should be to the quadratic approximation. The notation follows chapter four of Numerical Optimization. Below, rho $=\\rho$ refers to the ratio of the actual function change to the change in the quadratic approximation for a given step. initial_delta: The starting trust region radius delta_hat: The largest allowable trust region radius eta: When rho is at least eta , accept the step. rho_lower: When rho is less than rho_lower , shrink the trust region. rho_upper: When rho is greater than rho_upper , grow the trust region (though no greater than delta_hat ). Description Newton's method with a trust region is designed to take advantage of the second-order information in a function's Hessian, but with more stability than Newton's method when functions are not globally well-approximated by a quadratic. This is achieved by repeatedly minimizing quadratic approximations within a dynamically-sized \"trust region\" in which the function is assumed to be locally quadratic [1]. Newton's method optimizes a quadratic approximation to a function. When a function is well approximated by a quadratic (for example, near an optimum), Newton's method converges very quickly by exploiting the second-order information in the Hessian matrix. However, when the function is not well-approximated by a quadratic, either because the starting point is far from the optimum or the function has a more irregular shape, Newton steps can be erratically large, leading to distant, irrelevant areas of the space. Trust region methods use second-order information but restrict the steps to be within a \"trust region\" where the function is believed to be approximately quadratic. At iteration $k$, a trust region method chooses a step $p$ to minimize a quadratic approximation to the objective such that the step size is no larger than a given trust region size, $\\Delta_k$. \\underset{p\\in\\mathbb{R}^n}\\min m_k(p) = f_k + g_k^T p + \\frac{1}{2}p^T B_k p \\quad\\textrm{such that } ||p||\\le \\Delta_k Here, $p$ is the step to take at iteration $k$, so that $x_{k+1} = x_k + p$. In the definition of $m_k(p)$, $f_k = f(x_k)$ is the value at the previous location, $g_k=\\nabla f(x_k)$ is the gradient at the previous location, $B_k = \\nabla^2 f(x_k)$ is the Hessian matrix at the previous iterate, and $||\\cdot||$ is the Euclidian norm. If the trust region size, $\\Delta_k$, is large enough that the minimizer of the quadratic approximation $m_k(p)$ has $||p|| \\le \\Delta_k$, then the step is the same as an ordinary Newton step. However, if the unconstrained quadratic minimizer lies outside the trust region, then the minimizer to the constrained problem will occur on the boundary, i.e. we will have $||p|| = \\Delta_k$. It turns out that when the Cholesky decomposition of $B_k$ can be computed, the optimal $p$ can be found numerically with relative ease. ([1], section 4.3) This is the method currently used in Optim. It makes sense to adapt the trust region size, $\\Delta_k$, as one moves through the space and assesses the quality of the quadratic fit. This adaptation is controlled by the parameters $\\eta$, $\\rho_{lower}$, and $\\rho_{upper}$, which are parameters to the NewtonTrustRegion optimization method. For each step, we calculate \\rho_k := \\frac{f(x_{k+1}) - f(x_k)}{m_k(p) - m_k(0)} Intuitively, $\\rho_k$ measures the quality of the quadratic approximation: if $\\rho_k \\approx 1$, then our quadratic approximation is reasonable. If $p$ was on the boundary and $\\rho_k \\rho_{upper}$, then perhaps we can benefit from larger steps. In this case, for the next iteration we grow the trust region geometrically up to a maximum of $\\hat\\Delta$: \\rho_k > \\rho_{upper} \\Rightarrow \\Delta_{k+1} = \\min(2 \\Delta_k, \\hat\\Delta). Conversely, if $\\rho_k \\rho_{lower}$, then we shrink the trust region geometrically: $\\rho_k \\rho_{lower} \\Rightarrow \\Delta_{k+1} = 0.25 \\Delta_k$. Finally, we only accept a point if its decrease is appreciable compared to the quadratic approximation. Specifically, a step is only accepted $\\rho_k \\eta$. As long as we choose $\\eta$ to be less than $\\rho_{lower}$, we will shrink the trust region whenever we reject a step. Eventually, if the objective function is locally quadratic, $\\Delta_k$ will become small enough that a quadratic approximation will be accurate enough to make progress again. Example using Optim, OptimTestProblems prob = OptimTestProblems.UnconstrainedProblems.examples[ Rosenbrock ]; res = Optim.optimize(prob.f, prob.g!, prob.h!, prob.initial_x, NewtonTrustRegion()) References [1] Nocedal, Jorge, and Stephen Wright. Numerical optimization. Springer Science Business Media, 2006.","title":"Newton with Trust Region"},{"location":"algo/newton_trust_region/#newtons-method-with-a-trust-region","text":"","title":"Newton's Method With a Trust Region"},{"location":"algo/newton_trust_region/#constructor","text":"NewtonTrustRegion(; initial_delta = 1.0, delta_hat = 100.0, eta = 0.1, rho_lower = 0.25, rho_upper = 0.75) The constructor takes keywords that determine the initial and maximal size of the trust region, when to grow and shrink the region, and how close the function should be to the quadratic approximation. The notation follows chapter four of Numerical Optimization. Below, rho $=\\rho$ refers to the ratio of the actual function change to the change in the quadratic approximation for a given step. initial_delta: The starting trust region radius delta_hat: The largest allowable trust region radius eta: When rho is at least eta , accept the step. rho_lower: When rho is less than rho_lower , shrink the trust region. rho_upper: When rho is greater than rho_upper , grow the trust region (though no greater than delta_hat ).","title":"Constructor"},{"location":"algo/newton_trust_region/#description","text":"Newton's method with a trust region is designed to take advantage of the second-order information in a function's Hessian, but with more stability than Newton's method when functions are not globally well-approximated by a quadratic. This is achieved by repeatedly minimizing quadratic approximations within a dynamically-sized \"trust region\" in which the function is assumed to be locally quadratic [1]. Newton's method optimizes a quadratic approximation to a function. When a function is well approximated by a quadratic (for example, near an optimum), Newton's method converges very quickly by exploiting the second-order information in the Hessian matrix. However, when the function is not well-approximated by a quadratic, either because the starting point is far from the optimum or the function has a more irregular shape, Newton steps can be erratically large, leading to distant, irrelevant areas of the space. Trust region methods use second-order information but restrict the steps to be within a \"trust region\" where the function is believed to be approximately quadratic. At iteration $k$, a trust region method chooses a step $p$ to minimize a quadratic approximation to the objective such that the step size is no larger than a given trust region size, $\\Delta_k$. \\underset{p\\in\\mathbb{R}^n}\\min m_k(p) = f_k + g_k^T p + \\frac{1}{2}p^T B_k p \\quad\\textrm{such that } ||p||\\le \\Delta_k Here, $p$ is the step to take at iteration $k$, so that $x_{k+1} = x_k + p$. In the definition of $m_k(p)$, $f_k = f(x_k)$ is the value at the previous location, $g_k=\\nabla f(x_k)$ is the gradient at the previous location, $B_k = \\nabla^2 f(x_k)$ is the Hessian matrix at the previous iterate, and $||\\cdot||$ is the Euclidian norm. If the trust region size, $\\Delta_k$, is large enough that the minimizer of the quadratic approximation $m_k(p)$ has $||p|| \\le \\Delta_k$, then the step is the same as an ordinary Newton step. However, if the unconstrained quadratic minimizer lies outside the trust region, then the minimizer to the constrained problem will occur on the boundary, i.e. we will have $||p|| = \\Delta_k$. It turns out that when the Cholesky decomposition of $B_k$ can be computed, the optimal $p$ can be found numerically with relative ease. ([1], section 4.3) This is the method currently used in Optim. It makes sense to adapt the trust region size, $\\Delta_k$, as one moves through the space and assesses the quality of the quadratic fit. This adaptation is controlled by the parameters $\\eta$, $\\rho_{lower}$, and $\\rho_{upper}$, which are parameters to the NewtonTrustRegion optimization method. For each step, we calculate \\rho_k := \\frac{f(x_{k+1}) - f(x_k)}{m_k(p) - m_k(0)} Intuitively, $\\rho_k$ measures the quality of the quadratic approximation: if $\\rho_k \\approx 1$, then our quadratic approximation is reasonable. If $p$ was on the boundary and $\\rho_k \\rho_{upper}$, then perhaps we can benefit from larger steps. In this case, for the next iteration we grow the trust region geometrically up to a maximum of $\\hat\\Delta$: \\rho_k > \\rho_{upper} \\Rightarrow \\Delta_{k+1} = \\min(2 \\Delta_k, \\hat\\Delta). Conversely, if $\\rho_k \\rho_{lower}$, then we shrink the trust region geometrically: $\\rho_k \\rho_{lower} \\Rightarrow \\Delta_{k+1} = 0.25 \\Delta_k$. Finally, we only accept a point if its decrease is appreciable compared to the quadratic approximation. Specifically, a step is only accepted $\\rho_k \\eta$. As long as we choose $\\eta$ to be less than $\\rho_{lower}$, we will shrink the trust region whenever we reject a step. Eventually, if the objective function is locally quadratic, $\\Delta_k$ will become small enough that a quadratic approximation will be accurate enough to make progress again.","title":"Description"},{"location":"algo/newton_trust_region/#example","text":"using Optim, OptimTestProblems prob = OptimTestProblems.UnconstrainedProblems.examples[ Rosenbrock ]; res = Optim.optimize(prob.f, prob.g!, prob.h!, prob.initial_x, NewtonTrustRegion())","title":"Example"},{"location":"algo/newton_trust_region/#references","text":"[1] Nocedal, Jorge, and Stephen Wright. Numerical optimization. Springer Science Business Media, 2006.","title":"References"},{"location":"algo/ngmres/","text":"Acceleration methods: N-GMRES and O-ACCEL Constructors NGMRES(; alphaguess = LineSearches.InitialStatic(), linesearch = LineSearches.HagerZhang(), manifold = Flat(), wmax::Int = 10, \u03f50 = 1e-12, nlprecon = GradientDescent( alphaguess = LineSearches.InitialStatic(alpha=1e-4,scaled=true), linesearch = LineSearches.Static(), manifold = manifold), nlpreconopts = Options(iterations = 1, allow_f_increases = true), ) OACCEL(;manifold::Manifold = Flat(), alphaguess = LineSearches.InitialStatic(), linesearch = LineSearches.HagerZhang(), nlprecon = GradientDescent( alphaguess = LineSearches.InitialStatic(alpha=1e-4,scaled=true), linesearch = LineSearches.Static(), manifold = manifold), nlpreconopts = Options(iterations = 1, allow_f_increases = true), \u03f50 = 1e-12, wmax::Int = 10) Description These algorithms take a step given by the nonlinear preconditioner nlprecon and proposes an accelerated step on a subspace spanned by the previous wmax iterates. N-GMRES accelerates based on a minimization of an approximation to the $\\ell_2$ norm of the gradient. O-ACCEL accelerates based on a minimization of a n approximation to the objective. N-GMRES was originally developed for solving nonlinear systems [1], and reduces to GMRES for linear problems. Application of the algorithm to optimization is covered, for example, in [2]. A description of O-ACCEL and its connection to N-GMRES can be found in [3]. We recommend trying LBFGS on your problem before N-GMRES or O-ACCEL. All three algorithms have similar computational cost and memory requirements, however, L-BFGS is more efficient for many problems. Example This example shows how to accelerate GradientDescent on the Extended Rosenbrock problem. First, we try to optimize using GradientDescent . using Optim, OptimTestProblems UP = OptimTestProblems.UnconstrainedProblems prob = UP.examples[ Extended Rosenbrock ] optimize(UP.objective(prob), UP.gradient(prob), prob.initial_x, GradientDescent()) The algorithm does not converge within 1000 iterations. Results of Optimization Algorithm * Algorithm: Gradient Descent * Starting Point: [-1.2,1.0, ...] * Minimizer: [0.8923389282461412,0.7961268644300445, ...] * Minimum: 2.898230e-01 * Iterations: 1000 * Convergence: false * |x - x'| \u2264 0.0e+00: false |x - x'| = 4.02e-04 * |f(x) - f(x')| \u2264 0.0e+00 |f(x)|: false |f(x) - f(x')| = 2.38e-03 |f(x)| * |g(x)| \u2264 1.0e-08: false |g(x)| = 8.23e-02 * Stopped by an increasing objective: false * Reached Maximum Number of Iterations: true * Objective Calls: 2525 * Gradient Calls: 2525 Now, we use OACCEL to accelerate GradientDescent . # Default nonlinear procenditioner for `OACCEL` nlprecon = GradientDescent(alphaguess=LineSearches.InitialStatic(alpha=1e-4,scaled=true), linesearch=LineSearches.Static()) # Default size of subspace that OACCEL accelerates over is `wmax = 10` oacc10 = OACCEL(nlprecon=nlprecon, wmax=10) optimize(UP.objective(prob), UP.gradient(prob), prob.initial_x, oacc10) This drastically improves the GradientDescent algorithm, converging in 87 iterations. Results of Optimization Algorithm * Algorithm: O-ACCEL preconditioned with Gradient Descent * Starting Point: [-1.2,1.0, ...] * Minimizer: [1.0000000011361219,1.0000000022828495, ...] * Minimum: 3.255053e-17 * Iterations: 87 * Convergence: true * |x - x'| \u2264 0.0e+00: false |x - x'| = 6.51e-08 * |f(x) - f(x')| \u2264 0.0e+00 |f(x)|: false |f(x) - f(x')| = 7.56e+02 |f(x)| * |g(x)| \u2264 1.0e-08: true |g(x)| = 1.06e-09 * Stopped by an increasing objective: false * Reached Maximum Number of Iterations: false * Objective Calls: 285 * Gradient Calls: 285 We can improve the acceleration further by changing the acceleration subspace size wmax . oacc5 = OACCEL(nlprecon=nlprecon, wmax=5) optimize(UP.objective(prob), UP.gradient(prob), prob.initial_x, oacc5) Now, the O-ACCEL algorithm has accelerated GradientDescent to converge in 50 iterations. Results of Optimization Algorithm * Algorithm: O-ACCEL preconditioned with Gradient Descent * Starting Point: [-1.2,1.0, ...] * Minimizer: [0.9999999999392858,0.9999999998784691, ...] * Minimum: 9.218164e-20 * Iterations: 50 * Convergence: true * |x - x'| \u2264 0.0e+00: false |x - x'| = 2.76e-07 * |f(x) - f(x')| \u2264 0.0e+00 |f(x)|: false |f(x) - f(x')| = 5.18e+06 |f(x)| * |g(x)| \u2264 1.0e-08: true |g(x)| = 4.02e-11 * Stopped by an increasing objective: false * Reached Maximum Number of Iterations: false * Objective Calls: 181 * Gradient Calls: 181 As a final comparison, we can do the same with N-GMRES. ngmres5 = NGMRES(nlprecon=nlprecon, wmax=5) optimize(UP.objective(prob), UP.gradient(prob), prob.initial_x, ngmres5) Again, this significantly improves the GradientDescent algorithm, and converges in 63 iterations. Results of Optimization Algorithm * Algorithm: Nonlinear GMRES preconditioned with Gradient Descent * Starting Point: [-1.2,1.0, ...] * Minimizer: [0.9999999998534468,0.9999999997063993, ...] * Minimum: 5.375569e-19 * Iterations: 63 * Convergence: true * |x - x'| \u2264 0.0e+00: false |x - x'| = 9.94e-09 * |f(x) - f(x')| \u2264 0.0e+00 |f(x)|: false |f(x) - f(x')| = 1.29e+03 |f(x)| * |g(x)| \u2264 1.0e-08: true |g(x)| = 4.94e-11 * Stopped by an increasing objective: false * Reached Maximum Number of Iterations: false * Objective Calls: 222 * Gradient Calls: 222 References [1] De Sterck. Steepest descent preconditioning for nonlinear GMRES optimization. NLAA, 2013. [2] Washio and Oosterlee. Krylov subspace acceleration for nonlinear multigrid schemes. ETNA, 1997. [3] Riseth. Objective acceleration for unconstrained optimization. 2018.","title":"Acceleration"},{"location":"algo/ngmres/#acceleration-methods-n-gmres-and-o-accel","text":"","title":"Acceleration methods: N-GMRES and O-ACCEL"},{"location":"algo/ngmres/#constructors","text":"NGMRES(; alphaguess = LineSearches.InitialStatic(), linesearch = LineSearches.HagerZhang(), manifold = Flat(), wmax::Int = 10, \u03f50 = 1e-12, nlprecon = GradientDescent( alphaguess = LineSearches.InitialStatic(alpha=1e-4,scaled=true), linesearch = LineSearches.Static(), manifold = manifold), nlpreconopts = Options(iterations = 1, allow_f_increases = true), ) OACCEL(;manifold::Manifold = Flat(), alphaguess = LineSearches.InitialStatic(), linesearch = LineSearches.HagerZhang(), nlprecon = GradientDescent( alphaguess = LineSearches.InitialStatic(alpha=1e-4,scaled=true), linesearch = LineSearches.Static(), manifold = manifold), nlpreconopts = Options(iterations = 1, allow_f_increases = true), \u03f50 = 1e-12, wmax::Int = 10)","title":"Constructors"},{"location":"algo/ngmres/#description","text":"These algorithms take a step given by the nonlinear preconditioner nlprecon and proposes an accelerated step on a subspace spanned by the previous wmax iterates. N-GMRES accelerates based on a minimization of an approximation to the $\\ell_2$ norm of the gradient. O-ACCEL accelerates based on a minimization of a n approximation to the objective. N-GMRES was originally developed for solving nonlinear systems [1], and reduces to GMRES for linear problems. Application of the algorithm to optimization is covered, for example, in [2]. A description of O-ACCEL and its connection to N-GMRES can be found in [3]. We recommend trying LBFGS on your problem before N-GMRES or O-ACCEL. All three algorithms have similar computational cost and memory requirements, however, L-BFGS is more efficient for many problems.","title":"Description"},{"location":"algo/ngmres/#example","text":"This example shows how to accelerate GradientDescent on the Extended Rosenbrock problem. First, we try to optimize using GradientDescent . using Optim, OptimTestProblems UP = OptimTestProblems.UnconstrainedProblems prob = UP.examples[ Extended Rosenbrock ] optimize(UP.objective(prob), UP.gradient(prob), prob.initial_x, GradientDescent()) The algorithm does not converge within 1000 iterations. Results of Optimization Algorithm * Algorithm: Gradient Descent * Starting Point: [-1.2,1.0, ...] * Minimizer: [0.8923389282461412,0.7961268644300445, ...] * Minimum: 2.898230e-01 * Iterations: 1000 * Convergence: false * |x - x'| \u2264 0.0e+00: false |x - x'| = 4.02e-04 * |f(x) - f(x')| \u2264 0.0e+00 |f(x)|: false |f(x) - f(x')| = 2.38e-03 |f(x)| * |g(x)| \u2264 1.0e-08: false |g(x)| = 8.23e-02 * Stopped by an increasing objective: false * Reached Maximum Number of Iterations: true * Objective Calls: 2525 * Gradient Calls: 2525 Now, we use OACCEL to accelerate GradientDescent . # Default nonlinear procenditioner for `OACCEL` nlprecon = GradientDescent(alphaguess=LineSearches.InitialStatic(alpha=1e-4,scaled=true), linesearch=LineSearches.Static()) # Default size of subspace that OACCEL accelerates over is `wmax = 10` oacc10 = OACCEL(nlprecon=nlprecon, wmax=10) optimize(UP.objective(prob), UP.gradient(prob), prob.initial_x, oacc10) This drastically improves the GradientDescent algorithm, converging in 87 iterations. Results of Optimization Algorithm * Algorithm: O-ACCEL preconditioned with Gradient Descent * Starting Point: [-1.2,1.0, ...] * Minimizer: [1.0000000011361219,1.0000000022828495, ...] * Minimum: 3.255053e-17 * Iterations: 87 * Convergence: true * |x - x'| \u2264 0.0e+00: false |x - x'| = 6.51e-08 * |f(x) - f(x')| \u2264 0.0e+00 |f(x)|: false |f(x) - f(x')| = 7.56e+02 |f(x)| * |g(x)| \u2264 1.0e-08: true |g(x)| = 1.06e-09 * Stopped by an increasing objective: false * Reached Maximum Number of Iterations: false * Objective Calls: 285 * Gradient Calls: 285 We can improve the acceleration further by changing the acceleration subspace size wmax . oacc5 = OACCEL(nlprecon=nlprecon, wmax=5) optimize(UP.objective(prob), UP.gradient(prob), prob.initial_x, oacc5) Now, the O-ACCEL algorithm has accelerated GradientDescent to converge in 50 iterations. Results of Optimization Algorithm * Algorithm: O-ACCEL preconditioned with Gradient Descent * Starting Point: [-1.2,1.0, ...] * Minimizer: [0.9999999999392858,0.9999999998784691, ...] * Minimum: 9.218164e-20 * Iterations: 50 * Convergence: true * |x - x'| \u2264 0.0e+00: false |x - x'| = 2.76e-07 * |f(x) - f(x')| \u2264 0.0e+00 |f(x)|: false |f(x) - f(x')| = 5.18e+06 |f(x)| * |g(x)| \u2264 1.0e-08: true |g(x)| = 4.02e-11 * Stopped by an increasing objective: false * Reached Maximum Number of Iterations: false * Objective Calls: 181 * Gradient Calls: 181 As a final comparison, we can do the same with N-GMRES. ngmres5 = NGMRES(nlprecon=nlprecon, wmax=5) optimize(UP.objective(prob), UP.gradient(prob), prob.initial_x, ngmres5) Again, this significantly improves the GradientDescent algorithm, and converges in 63 iterations. Results of Optimization Algorithm * Algorithm: Nonlinear GMRES preconditioned with Gradient Descent * Starting Point: [-1.2,1.0, ...] * Minimizer: [0.9999999998534468,0.9999999997063993, ...] * Minimum: 5.375569e-19 * Iterations: 63 * Convergence: true * |x - x'| \u2264 0.0e+00: false |x - x'| = 9.94e-09 * |f(x) - f(x')| \u2264 0.0e+00 |f(x)|: false |f(x) - f(x')| = 1.29e+03 |f(x)| * |g(x)| \u2264 1.0e-08: true |g(x)| = 4.94e-11 * Stopped by an increasing objective: false * Reached Maximum Number of Iterations: false * Objective Calls: 222 * Gradient Calls: 222","title":"Example"},{"location":"algo/ngmres/#references","text":"[1] De Sterck. Steepest descent preconditioning for nonlinear GMRES optimization. NLAA, 2013. [2] Washio and Oosterlee. Krylov subspace acceleration for nonlinear multigrid schemes. ETNA, 1997. [3] Riseth. Objective acceleration for unconstrained optimization. 2018.","title":"References"},{"location":"algo/particle_swarm/","text":"Particle Swarm Constructor ParticleSwarm(; lower = [], upper = [], n_particles = 0) The constructor takes three keywords: lower = [] , a vector of lower bounds, unbounded below if empty or Inf 's upper = [] , a vector of upper bounds, unbounded above if empty or Inf 's n_particles = 0 , number of particles in the swarm, defaults to least three Description The Particle Swarm implementation in Optim.jl is the so-called Adaptive Particle Swarm algorithm in [1]. It attempts to improve global coverage and convergence by switching between four evolutionary states: exploration, exploitation, convergence, and jumping out. In the jumping out state it intentially tries to take the best particle and move it away from its (potentially and probably) local optimum, to improve the ability to find a global optimum. Of course, this comes a the cost of slower convergence, but hopefully converges to the global optimum as a result. References [1] Zhan, Zhang, and Chung. Adaptive particle swarm optimization, IEEE Transactions on Systems, Man, and Cybernetics, Part B: CyberneticsVolume 39, Issue 6, 2009, Pages 1362-1381 (2009)","title":"Particle Swarm"},{"location":"algo/particle_swarm/#particle-swarm","text":"","title":"Particle Swarm"},{"location":"algo/particle_swarm/#constructor","text":"ParticleSwarm(; lower = [], upper = [], n_particles = 0) The constructor takes three keywords: lower = [] , a vector of lower bounds, unbounded below if empty or Inf 's upper = [] , a vector of upper bounds, unbounded above if empty or Inf 's n_particles = 0 , number of particles in the swarm, defaults to least three","title":"Constructor"},{"location":"algo/particle_swarm/#description","text":"The Particle Swarm implementation in Optim.jl is the so-called Adaptive Particle Swarm algorithm in [1]. It attempts to improve global coverage and convergence by switching between four evolutionary states: exploration, exploitation, convergence, and jumping out. In the jumping out state it intentially tries to take the best particle and move it away from its (potentially and probably) local optimum, to improve the ability to find a global optimum. Of course, this comes a the cost of slower convergence, but hopefully converges to the global optimum as a result.","title":"Description"},{"location":"algo/particle_swarm/#references","text":"[1] Zhan, Zhang, and Chung. Adaptive particle swarm optimization, IEEE Transactions on Systems, Man, and Cybernetics, Part B: CyberneticsVolume 39, Issue 6, 2009, Pages 1362-1381 (2009)","title":"References"},{"location":"algo/precondition/","text":"Preconditioning The GradientDescent , ConjugateGradient and LBFGS methods support preconditioning. A preconditioner can be thought of as a change of coordinates under which the Hessian is better conditioned. With a good preconditioner substantially improved convergence is possible. A preconditioner P can be of any type as long as the following two methods are implemented: A_ldiv_B!(pgr, P, gr) : apply P to a vector gr and store in pgr (intuitively, pgr = P \\ gr ) dot(x, P, y) : the inner product induced by P (intuitively, dot(x, P * y) ) Precisely what these operations mean, depends on how P is stored. Commonly, we store a matrix P which approximates the Hessian in some vague sense. In this case, A_ldiv_B!(pgr, P, gr) = copyto!(pgr, P \\ A) dot(x, P, y) = dot(x, P * y) Finally, it is possible to update the preconditioner as the state variable x changes. This is done through precondprep! which is passed to the optimizers as kw-argument, e.g., method=ConjugateGradient(P = precond(100), precondprep! = precond(100)) though in this case it would always return the same matrix. (See fminbox.jl for a more natural example.) Apart from preconditioning with matrices, Optim.jl provides a type InverseDiagonal , which represents a diagonal matrix by its inverse elements. Example Below, we see an example where a function is minimized without and with a preconditioner applied. using ForwardDiff, Optim, SparseArrays initial_x = zeros(100) plap(U; n = length(U)) = (n-1)*sum((0.1 .+ diff(U).^2).^2 ) - sum(U) / (n-1) plap1(x) = ForwardDiff.gradient(plap,x) precond(n) = spdiagm(-1 = -ones(n-1), 0 = 2ones(n), 1 = -ones(n-1)) * (n+1) f(x) = plap([0; x; 0]) g!(G, x) = copyto!(G, (plap1([0; x; 0]))[2:end-1]) result = Optim.optimize(f, g!, initial_x, method = ConjugateGradient(P = nothing)) result = Optim.optimize(f, g!, initial_x, method = ConjugateGradient(P = precond(100))) The former optimize call converges at a slower rate than the latter. Looking at a plot of the 2D version of the function shows the problem. The contours are shaped like ellipsoids, but we would rather want them to be circles. Using the preconditioner effectively changes the coordinates such that the contours becomes less ellipsoid-like. Benchmarking shows that using preconditioning provides an approximate speed-up factor of 15 in this 100 dimensional case. References","title":"Preconditioners"},{"location":"algo/precondition/#preconditioning","text":"The GradientDescent , ConjugateGradient and LBFGS methods support preconditioning. A preconditioner can be thought of as a change of coordinates under which the Hessian is better conditioned. With a good preconditioner substantially improved convergence is possible. A preconditioner P can be of any type as long as the following two methods are implemented: A_ldiv_B!(pgr, P, gr) : apply P to a vector gr and store in pgr (intuitively, pgr = P \\ gr ) dot(x, P, y) : the inner product induced by P (intuitively, dot(x, P * y) ) Precisely what these operations mean, depends on how P is stored. Commonly, we store a matrix P which approximates the Hessian in some vague sense. In this case, A_ldiv_B!(pgr, P, gr) = copyto!(pgr, P \\ A) dot(x, P, y) = dot(x, P * y) Finally, it is possible to update the preconditioner as the state variable x changes. This is done through precondprep! which is passed to the optimizers as kw-argument, e.g., method=ConjugateGradient(P = precond(100), precondprep! = precond(100)) though in this case it would always return the same matrix. (See fminbox.jl for a more natural example.) Apart from preconditioning with matrices, Optim.jl provides a type InverseDiagonal , which represents a diagonal matrix by its inverse elements.","title":"Preconditioning"},{"location":"algo/precondition/#example","text":"Below, we see an example where a function is minimized without and with a preconditioner applied. using ForwardDiff, Optim, SparseArrays initial_x = zeros(100) plap(U; n = length(U)) = (n-1)*sum((0.1 .+ diff(U).^2).^2 ) - sum(U) / (n-1) plap1(x) = ForwardDiff.gradient(plap,x) precond(n) = spdiagm(-1 = -ones(n-1), 0 = 2ones(n), 1 = -ones(n-1)) * (n+1) f(x) = plap([0; x; 0]) g!(G, x) = copyto!(G, (plap1([0; x; 0]))[2:end-1]) result = Optim.optimize(f, g!, initial_x, method = ConjugateGradient(P = nothing)) result = Optim.optimize(f, g!, initial_x, method = ConjugateGradient(P = precond(100))) The former optimize call converges at a slower rate than the latter. Looking at a plot of the 2D version of the function shows the problem. The contours are shaped like ellipsoids, but we would rather want them to be circles. Using the preconditioner effectively changes the coordinates such that the contours becomes less ellipsoid-like. Benchmarking shows that using preconditioning provides an approximate speed-up factor of 15 in this 100 dimensional case.","title":"Example"},{"location":"algo/precondition/#references","text":"","title":"References"},{"location":"algo/samin/","text":"SAMIN Constructor SAMIN(; nt::Int = 5 # reduce temperature every nt*ns*dim(x_init) evaluations ns::Int = 5 # adjust bounds every ns*dim(x_init) evaluations rt::T = 0.9 # geometric temperature reduction factor: when temp changes, new temp is t=rt*t neps::Int = 5 # number of previous best values the final result is compared to f_tol::T = 1e-12 # the required tolerance level for function value comparisons x_tol::T = 1e-6 # the required tolerance level for x coverage_ok::Bool = false, # if false, increase temperature until initial parameter space is covered verbosity::Int = 0) # scalar: 0, 1, 2 or 3 (default = 0). Description The SAMIN method implements the Simulated Annealing algorithm for problems with bounds constraints as described in Goffe et. al. (1994) and Goffe (1996). A key control parameter is rt, the geometric temperature reduction rate, which should be between zero and one. Setting rt lower will cause the algorithm to contract the search space more quickly, reducing the run time. Setting rt too low will cause the algorithm to narrow the search too quickly, and the true minimizer may be skipped over. If possible, run the algorithm multiple times to verify that the same solution is found each time. If this is not the case, increase rt. When in doubt, start with a conservative rt, for example, rt=0.95, and allow for a generous iteration limit. The algorithm requires lower and upper bounds on the parameters, although these bounds are often set rather wide, and are not necessarily meant to reflect constraints in the model, but rather bounds that enclose the parameter space. If the final x s are very close to the boundary (which can be checked by setting verbosity=1), it is a good idea to restart the optimizer with wider bounds, unless the bounds actually reflect hard constraints on x . Example This example shows a successful minimization: julia using Optim, OptimTestProblems julia prob = OptimTestProblems.UnconstrainedProblems.examples[ Rosenbrock ]; julia res = Optim.optimize(prob.f, fill(-100.0, 2), fill(100.0, 2), prob.initial_x, SAMIN(), Optim.Options(iterations=10^6)) ================================================================================ SAMIN results == Normal convergence == total number of objective function evaluations: 23701 Obj. value: 0.0000000000 parameter search width 1.00000 0.00000 1.00000 0.00000 ================================================================================ Results of Optimization Algorithm * Algorithm: SAMIN * Starting Point: [-1.2,1.0] * Minimizer: [0.9999999893140956,0.9999999765350857] * Minimum: 5.522977e-16 * Iterations: 23701 * Convergence: false * |x - x'| \u2264 0.0e+00: false |x - x'| = NaN * |f(x) - f(x')| \u2264 0.0e+00 |f(x)|: false |f(x) - f(x')| = NaN |f(x)| * |g(x)| \u2264 0.0e+00: false |g(x)| = NaN * Stopped by an increasing objective: false * Reached Maximum Number of Iterations: false * Objective Calls: 23701 * Gradient Calls: 0 Example This example shows an unsuccessful minimization, because the cooling rate, rt=0.5, is too rapid: julia using Optim, OptimTestProblems julia prob = OptimTestProblems.UnconstrainedProblems.examples[ Rosenbrock ]; julia res = Optim.optimize(prob.f, fill(-100.0, 2), fill(100.0, 2), prob.initial_x, SAMIN(rt=0.5), Optim.Options(iterations=10^6)) ================================================================================ SAMIN results == Normal convergence == total number of objective function evaluations: 12051 Obj. value: 0.0011613045 parameter search width 0.96592 0.00000 0.93301 0.00000 ================================================================================ Results of Optimization Algorithm * Algorithm: SAMIN * Starting Point: [-1.2,1.0] * Minimizer: [0.9659220825756248,0.9330054696322896] * Minimum: 1.161304e-03 * Iterations: 12051 * Convergence: false * |x - x'| \u2264 0.0e+00: false |x - x'| = NaN * |f(x) - f(x')| \u2264 0.0e+00 |f(x)|: false |f(x) - f(x')| = NaN |f(x)| * |g(x)| \u2264 0.0e+00: false |g(x)| = NaN * Stopped by an increasing objective: false * Reached Maximum Number of Iterations: false * Objective Calls: 12051 * Gradient Calls: 0 References Goffe, et. al. (1994) \"Global Optimization of Statistical Functions with Simulated Annealing\", Journal of Econometrics, V. 60, N. 1/2. Goffe, William L. (1996) \"SIMANN: A Global Optimization Algorithm using Simulated Annealing \" Studies in Nonlinear Dynamics Econometrics, Oct96, Vol. 1 Issue 3.","title":"Simulated Annealing w/ bounds"},{"location":"algo/samin/#samin","text":"","title":"SAMIN"},{"location":"algo/samin/#constructor","text":"SAMIN(; nt::Int = 5 # reduce temperature every nt*ns*dim(x_init) evaluations ns::Int = 5 # adjust bounds every ns*dim(x_init) evaluations rt::T = 0.9 # geometric temperature reduction factor: when temp changes, new temp is t=rt*t neps::Int = 5 # number of previous best values the final result is compared to f_tol::T = 1e-12 # the required tolerance level for function value comparisons x_tol::T = 1e-6 # the required tolerance level for x coverage_ok::Bool = false, # if false, increase temperature until initial parameter space is covered verbosity::Int = 0) # scalar: 0, 1, 2 or 3 (default = 0).","title":"Constructor"},{"location":"algo/samin/#description","text":"The SAMIN method implements the Simulated Annealing algorithm for problems with bounds constraints as described in Goffe et. al. (1994) and Goffe (1996). A key control parameter is rt, the geometric temperature reduction rate, which should be between zero and one. Setting rt lower will cause the algorithm to contract the search space more quickly, reducing the run time. Setting rt too low will cause the algorithm to narrow the search too quickly, and the true minimizer may be skipped over. If possible, run the algorithm multiple times to verify that the same solution is found each time. If this is not the case, increase rt. When in doubt, start with a conservative rt, for example, rt=0.95, and allow for a generous iteration limit. The algorithm requires lower and upper bounds on the parameters, although these bounds are often set rather wide, and are not necessarily meant to reflect constraints in the model, but rather bounds that enclose the parameter space. If the final x s are very close to the boundary (which can be checked by setting verbosity=1), it is a good idea to restart the optimizer with wider bounds, unless the bounds actually reflect hard constraints on x .","title":"Description"},{"location":"algo/samin/#example","text":"This example shows a successful minimization: julia using Optim, OptimTestProblems julia prob = OptimTestProblems.UnconstrainedProblems.examples[ Rosenbrock ]; julia res = Optim.optimize(prob.f, fill(-100.0, 2), fill(100.0, 2), prob.initial_x, SAMIN(), Optim.Options(iterations=10^6)) ================================================================================ SAMIN results == Normal convergence == total number of objective function evaluations: 23701 Obj. value: 0.0000000000 parameter search width 1.00000 0.00000 1.00000 0.00000 ================================================================================ Results of Optimization Algorithm * Algorithm: SAMIN * Starting Point: [-1.2,1.0] * Minimizer: [0.9999999893140956,0.9999999765350857] * Minimum: 5.522977e-16 * Iterations: 23701 * Convergence: false * |x - x'| \u2264 0.0e+00: false |x - x'| = NaN * |f(x) - f(x')| \u2264 0.0e+00 |f(x)|: false |f(x) - f(x')| = NaN |f(x)| * |g(x)| \u2264 0.0e+00: false |g(x)| = NaN * Stopped by an increasing objective: false * Reached Maximum Number of Iterations: false * Objective Calls: 23701 * Gradient Calls: 0","title":"Example"},{"location":"algo/samin/#example_1","text":"This example shows an unsuccessful minimization, because the cooling rate, rt=0.5, is too rapid: julia using Optim, OptimTestProblems julia prob = OptimTestProblems.UnconstrainedProblems.examples[ Rosenbrock ]; julia res = Optim.optimize(prob.f, fill(-100.0, 2), fill(100.0, 2), prob.initial_x, SAMIN(rt=0.5), Optim.Options(iterations=10^6)) ================================================================================ SAMIN results == Normal convergence == total number of objective function evaluations: 12051 Obj. value: 0.0011613045 parameter search width 0.96592 0.00000 0.93301 0.00000 ================================================================================ Results of Optimization Algorithm * Algorithm: SAMIN * Starting Point: [-1.2,1.0] * Minimizer: [0.9659220825756248,0.9330054696322896] * Minimum: 1.161304e-03 * Iterations: 12051 * Convergence: false * |x - x'| \u2264 0.0e+00: false |x - x'| = NaN * |f(x) - f(x')| \u2264 0.0e+00 |f(x)|: false |f(x) - f(x')| = NaN |f(x)| * |g(x)| \u2264 0.0e+00: false |g(x)| = NaN * Stopped by an increasing objective: false * Reached Maximum Number of Iterations: false * Objective Calls: 12051 * Gradient Calls: 0","title":"Example"},{"location":"algo/samin/#references","text":"Goffe, et. al. (1994) \"Global Optimization of Statistical Functions with Simulated Annealing\", Journal of Econometrics, V. 60, N. 1/2. Goffe, William L. (1996) \"SIMANN: A Global Optimization Algorithm using Simulated Annealing \" Studies in Nonlinear Dynamics Econometrics, Oct96, Vol. 1 Issue 3.","title":"References"},{"location":"algo/simulated_annealing/","text":"Simulated Annealing Constructor SimulatedAnnealing(; neighbor = default_neighbor!, T = default_temperature, p = kirkpatrick) The constructor takes three keywords: neighbor = a!(x_proposed, x_current) , a mutating function of the current x, and the proposed x T = b(iteration) , a function of the current iteration that returns a temperature p = c(f_proposal, f_current, T) , a function of the current temperature, current function value and proposed function value that returns an acceptance probability Description Simulated Annealing is a derivative free method for optimization. It is based on the Metropolis-Hastings algorithm that was originally used to generate samples from a thermodynamics system, and is often used to generate draws from a posterior when doing Bayesian inference. As such, it is a probabilistic method for finding the minimum of a function, often over a quite large domains. For the historical reasons given above, the algorithm uses terms such as cooling, temperature, and acceptance probabilities. As the constructor shows, a simulated annealing implementation is characterized by a temperature, a neighbor function, and an acceptance probability. The temperature controls how volatile the changes in minimizer candidates are allowed to be, as it enters the acceptance probability. For example, the original Kirkpatrick et al. acceptance probability function can be written as follows p(f_proposal, f_current, T) = exp(-(f_proposal - f_current)/T) A high temperature makes it more likely that a draw is accepted, by pushing acceptance probability to 1. As in the Metropolis-Hastings algorithm, we always accept a smaller function value, but we also sometimes accept a larger value. As the temperature decreases, we're more and more likely to only accept candidate x 's that lowers the function value. To obtain a new f_proposal , we need a neighbor function. A simple neighbor function adds a standard normal draw to each dimension of x function neighbor!(x_proposal::Array, x::Array) for i in eachindex(x) x_proposal[i] = x[i]+randn() end end As we see, it is not really possible to disentangle the role of the different components of the algorithm. For example, both the functional form of the acceptance function, the temperature and (indirectly) the neighbor function determine if the next draw of x is accepted or not. The current implementation of Simulated Annealing is very rough. It lacks quite a few features which are normally part of a proper SA implementation. A better implementation is under way, see this issue . Example References","title":"Simulated Annealing"},{"location":"algo/simulated_annealing/#simulated-annealing","text":"","title":"Simulated Annealing"},{"location":"algo/simulated_annealing/#constructor","text":"SimulatedAnnealing(; neighbor = default_neighbor!, T = default_temperature, p = kirkpatrick) The constructor takes three keywords: neighbor = a!(x_proposed, x_current) , a mutating function of the current x, and the proposed x T = b(iteration) , a function of the current iteration that returns a temperature p = c(f_proposal, f_current, T) , a function of the current temperature, current function value and proposed function value that returns an acceptance probability","title":"Constructor"},{"location":"algo/simulated_annealing/#description","text":"Simulated Annealing is a derivative free method for optimization. It is based on the Metropolis-Hastings algorithm that was originally used to generate samples from a thermodynamics system, and is often used to generate draws from a posterior when doing Bayesian inference. As such, it is a probabilistic method for finding the minimum of a function, often over a quite large domains. For the historical reasons given above, the algorithm uses terms such as cooling, temperature, and acceptance probabilities. As the constructor shows, a simulated annealing implementation is characterized by a temperature, a neighbor function, and an acceptance probability. The temperature controls how volatile the changes in minimizer candidates are allowed to be, as it enters the acceptance probability. For example, the original Kirkpatrick et al. acceptance probability function can be written as follows p(f_proposal, f_current, T) = exp(-(f_proposal - f_current)/T) A high temperature makes it more likely that a draw is accepted, by pushing acceptance probability to 1. As in the Metropolis-Hastings algorithm, we always accept a smaller function value, but we also sometimes accept a larger value. As the temperature decreases, we're more and more likely to only accept candidate x 's that lowers the function value. To obtain a new f_proposal , we need a neighbor function. A simple neighbor function adds a standard normal draw to each dimension of x function neighbor!(x_proposal::Array, x::Array) for i in eachindex(x) x_proposal[i] = x[i]+randn() end end As we see, it is not really possible to disentangle the role of the different components of the algorithm. For example, both the functional form of the acceptance function, the temperature and (indirectly) the neighbor function determine if the next draw of x is accepted or not. The current implementation of Simulated Annealing is very rough. It lacks quite a few features which are normally part of a proper SA implementation. A better implementation is under way, see this issue .","title":"Description"},{"location":"algo/simulated_annealing/#example","text":"","title":"Example"},{"location":"algo/simulated_annealing/#references","text":"","title":"References"},{"location":"dev/","text":"Using Nelder Mead","title":"Home"},{"location":"dev/#using-nelder-mead","text":"","title":"Using Nelder Mead"},{"location":"dev/contributing/","text":"Notes for contributing We are always happy to get help from people who normally do not contribute to the package. However, to make the process run smoothly, we ask you to read this page before creating your pull request. That way it is more probable that your changes will be incorporated, and in the end it will mean less work for everyone. Things to consider When proposing a change to Optim.jl , there are a few things to consider. If you're in doubt feel free to reach out. A simple way to get in touch, is to join our gitter channel . Before submitting a pull request, please consider the following bullets: Did you remember to provide tests for your changes? If not, please do so, or ask for help. Did your change add new functionality? Remember to add a section in the documentation. Did you change existing code in a breaking way? Then remember to use Julia's deprecation tools to help users migrate to the new syntax. Add a note in the NEWS.md file, so we can keep track of changes between versions. Adding a solver If you're contributing a new solver, you shouldn't need to touch any of the code in src/optimize.jl . You should rather add a file named ( solver is the name of the solver) solver.jl in src , and make sure that you define an Optimizer subtype struct Solver : Optimizer end with appropriate fields, a default constructor with a keyword for each field, a state type that holds all variables that are (re)used throughout the iterative procedure, an initial_state that initializes such a state, and an update! method that does the actual work. Say you want to contribute a solver called Minim , then your src/minim.jl file would look something like struct Minim{IF, F :Function, T} : Optimizer alphaguess!::IF linesearch!::F minim_parameter::T end Minim(; alphaguess = LineSearches.InitialStatic(), linesearch = LineSearches.HagerZhang(), minim_parameter = 1.0) = Minim(linesearch, minim_parameter) type MinimState{T,N,G} x::AbstractArray{T,N} x_previous::AbstractArray{T,N} f_x_previous::T s::AbstractArray{T,N} @add_linesearch_fields() end function initial_state(method::Minim, options, d, initial_x) # prepare cache variables etc here end function update!{T}(d, state::MinimState{T}, method::Minim) # code for Minim here false # should the procedure force quit? end","title":"Contributing"},{"location":"dev/contributing/#notes-for-contributing","text":"We are always happy to get help from people who normally do not contribute to the package. However, to make the process run smoothly, we ask you to read this page before creating your pull request. That way it is more probable that your changes will be incorporated, and in the end it will mean less work for everyone.","title":"Notes for contributing"},{"location":"dev/contributing/#things-to-consider","text":"When proposing a change to Optim.jl , there are a few things to consider. If you're in doubt feel free to reach out. A simple way to get in touch, is to join our gitter channel . Before submitting a pull request, please consider the following bullets: Did you remember to provide tests for your changes? If not, please do so, or ask for help. Did your change add new functionality? Remember to add a section in the documentation. Did you change existing code in a breaking way? Then remember to use Julia's deprecation tools to help users migrate to the new syntax. Add a note in the NEWS.md file, so we can keep track of changes between versions.","title":"Things to consider"},{"location":"dev/contributing/#adding-a-solver","text":"If you're contributing a new solver, you shouldn't need to touch any of the code in src/optimize.jl . You should rather add a file named ( solver is the name of the solver) solver.jl in src , and make sure that you define an Optimizer subtype struct Solver : Optimizer end with appropriate fields, a default constructor with a keyword for each field, a state type that holds all variables that are (re)used throughout the iterative procedure, an initial_state that initializes such a state, and an update! method that does the actual work. Say you want to contribute a solver called Minim , then your src/minim.jl file would look something like struct Minim{IF, F :Function, T} : Optimizer alphaguess!::IF linesearch!::F minim_parameter::T end Minim(; alphaguess = LineSearches.InitialStatic(), linesearch = LineSearches.HagerZhang(), minim_parameter = 1.0) = Minim(linesearch, minim_parameter) type MinimState{T,N,G} x::AbstractArray{T,N} x_previous::AbstractArray{T,N} f_x_previous::T s::AbstractArray{T,N} @add_linesearch_fields() end function initial_state(method::Minim, options, d, initial_x) # prepare cache variables etc here end function update!{T}(d, state::MinimState{T}, method::Minim) # code for Minim here false # should the procedure force quit? end","title":"Adding a solver"},{"location":"examples/generated/ipnewton_basics/","text":"Nonlinear constrained optimization Tip This example is also available as a Jupyter notebook: ipnewton_basics.ipynb The nonlinear constrained optimization interface in Optim assumes that the user can write the optimization problem in the following way. \\min_{x\\in\\mathbb{R}^n} f(x) \\quad \\text{such that}\\\\ l_x \\leq \\phantom{c(}x\\phantom{)} \\leq u_x \\\\ l_c \\leq c(x) \\leq u_c. For equality constraints on $x_j$ or $c(x)_j$ you set those particular entries of bounds to be equal, $l_j=u_j$. Likewise, setting $l_j=-\\infty$ or $u_j=\\infty$ means that the constraint is unbounded from below or above respectively. Constrained optimization with IPNewton We will go through examples on how to use the constraints interface with the interior-point Newton optimization algorithm IPNewton . Throughout these examples we work with the standard Rosenbrock function. The objective and its derivatives are given by fun(x) = (1.0 - x[1])^2 + 100.0 * (x[2] - x[1]^2)^2 function fun_grad!(g, x) g[1] = -2.0 * (1.0 - x[1]) - 400.0 * (x[2] - x[1]^2) * x[1] g[2] = 200.0 * (x[2] - x[1]^2) end function fun_hess!(h, x) h[1, 1] = 2.0 - 400.0 * x[2] + 1200.0 * x[1]^2 h[1, 2] = -400.0 * x[1] h[2, 1] = -400.0 * x[1] h[2, 2] = 200.0 end; Optimization interface To solve a constrained optimization problem we call the optimize method optimize(d::AbstractObjective, constraints::AbstractConstraints, initial_x::Tx, method::ConstrainedOptimizer, options::Options) We can create instances of AbstractObjective and AbstractConstraints using the types TwiceDifferentiable and TwiceDifferentiableConstraints from the package NLSolversBase.jl . Box minimzation We want to optimize the Rosenbrock function in the box $-0.5 \\leq x \\leq 0.5$, starting from the point $x_0=(0,0)$. Box constraints are defined using, for example, TwiceDifferentiableConstraints(lx, ux) . x0 = [0.0, 0.0] df = TwiceDifferentiable(fun, fun_grad!, fun_hess!, x0) lx = [-0.5, -0.5]; ux = [0.5, 0.5] dfc = TwiceDifferentiableConstraints(lx, ux) res = optimize(df, dfc, x0, IPNewton()) * Status: success * Candidate solution Minimizer: [5.00e-01, 2.50e-01] Minimum: 2.500000e-01 * Found with Algorithm: Interior Point Newton Initial Point: [0.00e+00, 0.00e+00] * Convergence measures |x - x'| = 4.39e-10 \u2270 0.0e+00 |x - x'|/|x'| = 8.79e-10 \u2270 0.0e+00 |f(x) - f(x')| = 0.00e+00 \u2264 0.0e+00 |f(x) - f(x')|/|f(x')| = 0.00e+00 \u2264 0.0e+00 |g(x)| = 1.00e+00 \u2270 1.0e-08 * Work counters Seconds run: 0 (vs limit Inf) Iterations: 43 f(x) calls: 68 \u2207f(x) calls: 68 If we only want to set lower bounds, use ux = fill(Inf, 2) ux = fill(Inf, 2) dfc = TwiceDifferentiableConstraints(lx, ux) clear!(df) res = optimize(df, dfc, x0, IPNewton()) * Status: success * Candidate solution Minimizer: [1.00e+00, 1.00e+00] Minimum: 7.987239e-20 * Found with Algorithm: Interior Point Newton Initial Point: [0.00e+00, 0.00e+00] * Convergence measures |x - x'| = 3.54e-10 \u2270 0.0e+00 |x - x'|/|x'| = 3.54e-10 \u2270 0.0e+00 |f(x) - f(x')| = 2.40e-19 \u2270 0.0e+00 |f(x) - f(x')|/|f(x')| = 3.00e+00 \u2270 0.0e+00 |g(x)| = 8.83e-09 \u2264 1.0e-08 * Work counters Seconds run: 0 (vs limit Inf) Iterations: 35 f(x) calls: 63 \u2207f(x) calls: 63 Defining \"unconstrained\" problems An unconstrained problem can be defined either by passing Inf bounds or empty arrays. Note that we must pass the correct type information to the empty lx and ux lx = fill(-Inf, 2); ux = fill(Inf, 2) dfc = TwiceDifferentiableConstraints(lx, ux) clear!(df) res = optimize(df, dfc, x0, IPNewton()) lx = Float64[]; ux = Float64[] dfc = TwiceDifferentiableConstraints(lx, ux) clear!(df) res = optimize(df, dfc, x0, IPNewton()) * Status: success * Candidate solution Minimizer: [1.00e+00, 1.00e+00] Minimum: 5.998937e-19 * Found with Algorithm: Interior Point Newton Initial Point: [0.00e+00, 0.00e+00] * Convergence measures |x - x'| = 1.50e-09 \u2270 0.0e+00 |x - x'|/|x'| = 1.50e-09 \u2270 0.0e+00 |f(x) - f(x')| = 1.80e-18 \u2270 0.0e+00 |f(x) - f(x')|/|f(x')| = 3.00e+00 \u2270 0.0e+00 |g(x)| = 7.92e-09 \u2264 1.0e-08 * Work counters Seconds run: 0 (vs limit Inf) Iterations: 34 f(x) calls: 63 \u2207f(x) calls: 63 Generic nonlinear constraints We now consider the Rosenbrock problem with a constraint on c(x)_1 = x_1^2 + x_2^2. We pass the information about the constraints to optimize by defining a vector function c(x) and its Jacobian J(x) . The Hessian information is treated differently, by considering the Lagrangian of the corresponding slack-variable transformed optimization problem. This is similar to how the CUTEst library works. Let $H_j(x)$ represent the Hessian of the $j$th component $c(x)_j$ of the generic constraints. and $\\lambda_j$ the corresponding dual variable in the Lagrangian. Then we want the constraint object to add the values of $H_j(x)$ to the Hessian of the objective, weighted by $\\lambda_j$. The Julian form for the supplied function $c(x)$ and the derivative information is then added in the following way. con_c!(c, x) = (c[1] = x[1]^2 + x[2]^2; c) function con_jacobian!(J, x) J[1,1] = 2*x[1] J[1,2] = 2*x[2] J end function con_h!(h, x, \u03bb) h[1,1] += \u03bb[1]*2 h[2,2] += \u03bb[1]*2 end; Note that con_h! adds the \u03bb -weighted Hessian value of each element of c(x) to the Hessian of fun . We can then optimize the Rosenbrock function inside the ball of radius $0.5$. lx = Float64[]; ux = Float64[] lc = [-Inf]; uc = [0.5^2] dfc = TwiceDifferentiableConstraints(con_c!, con_jacobian!, con_h!, lx, ux, lc, uc) res = optimize(df, dfc, x0, IPNewton()) * Status: success * Candidate solution Minimizer: [4.56e-01, 2.06e-01] Minimum: 2.966216e-01 * Found with Algorithm: Interior Point Newton Initial Point: [0.00e+00, 0.00e+00] * Convergence measures |x - x'| = 0.00e+00 \u2264 0.0e+00 |x - x'|/|x'| = 0.00e+00 \u2264 0.0e+00 |f(x) - f(x')| = 0.00e+00 \u2264 0.0e+00 |f(x) - f(x')|/|f(x')| = 0.00e+00 \u2264 0.0e+00 |g(x)| = 7.71e-01 \u2270 1.0e-08 * Work counters Seconds run: 0 (vs limit Inf) Iterations: 28 f(x) calls: 109 \u2207f(x) calls: 109 We can add a lower bound on the constraint, and thus optimize the objective on the annulus with inner and outer radii $0.1$ and $0.5$ respectively. lc = [0.1^2] dfc = TwiceDifferentiableConstraints(con_c!, con_jacobian!, con_h!, lx, ux, lc, uc) res = optimize(df, dfc, x0, IPNewton()) \u250c Warning: Initial guess is not an interior point \u2514 @ Optim ~/.julia/packages/Optim/EhyUl/src/multivariate/solvers/constrained/ipnewton/ipnewton.jl:111 Stacktrace: [1] initial_state(::IPNewton{typeof(Optim.backtrack_constrained_grad),Symbol}, ::Optim.Options{Float64,Nothing}, ::TwiceDifferentiable{Float64,Array{Float64,1},Array{Float64,2},Array{Float64,1}}, ::TwiceDifferentiableConstraints{typeof(Main.ex-ipnewton_basics.con_c!),typeof(Main.ex-ipnewton_basics.con_jacobian!),typeof(Main.ex-ipnewton_basics.con_h!),Float64}, ::Array{Float64,1}) at /home/travis/.julia/packages/Optim/EhyUl/src/multivariate/solvers/constrained/ipnewton/ipnewton.jl:112 [2] optimize(::TwiceDifferentiable{Float64,Array{Float64,1},Array{Float64,2},Array{Float64,1}}, ::TwiceDifferentiableConstraints{typeof(Main.ex-ipnewton_basics.con_c!),typeof(Main.ex-ipnewton_basics.con_jacobian!),typeof(Main.ex-ipnewton_basics.con_h!),Float64}, ::Array{Float64,1}, ::IPNewton{typeof(Optim.backtrack_constrained_grad),Symbol}, ::Optim.Options{Float64,Nothing}) at /home/travis/.julia/packages/Optim/EhyUl/src/multivariate/solvers/constrained/ipnewton/interior.jl:196 (repeats 2 times) [3] top-level scope at none:0 [4] eval at ./boot.jl:319 [inlined] [5] (::getfield(Documenter.Expanders, Symbol( ##8#10 )){Module,Expr})() at /home/travis/.julia/packages/Documenter/Qo3Yk/src/Expanders.jl:480 [6] cd(::getfield(Documenter.Expanders, Symbol( ##8#10 )){Module,Expr}, ::String) at ./file.jl:96 [7] #7 at /home/travis/.julia/packages/Documenter/Qo3Yk/src/Expanders.jl:479 [inlined] [8] (::getfield(Documenter.Utilities, Symbol( ##18#19 )){getfield(Documenter.Expanders, Symbol( ##7#9 )){Documenter.Documents.Page,Module,Expr},Base.PipeEndpoint,Base.PipeEndpoint,Pipe,Array{UInt8,1}})() at /home/travis/.julia/packages/Documenter/Qo3Yk/src/Utilities/Utilities.jl:591 [9] with_logstate(::getfield(Documenter.Utilities, Symbol( ##18#19 )){getfield(Documenter.Expanders, Symbol( ##7#9 )){Documenter.Documents.Page,Module,Expr},Base.PipeEndpoint,Base.PipeEndpoint,Pipe,Array{UInt8,1}}, ::Base.CoreLogging.LogState) at ./logging.jl:395 [10] with_logger(::Function, ::Logging.ConsoleLogger) at ./logging.jl:491 [11] withoutput at /home/travis/.julia/packages/Documenter/Qo3Yk/src/Utilities/Utilities.jl:589 [inlined] [12] runner(::Type{Documenter.Expanders.ExampleBlocks}, ::Markdown.Code, ::Documenter.Documents.Page, ::Documenter.Documents.Document) at /home/travis/.julia/packages/Documenter/Qo3Yk/src/Expanders.jl:478 [13] dispatch(::Type{Documenter.Expanders.ExpanderPipeline}, ::Markdown.Code, ::Vararg{Any,N} where N) at /home/travis/.julia/packages/Documenter/Qo3Yk/src/Selectors.jl:168 [14] expand(::Documenter.Documents.Document) at /home/travis/.julia/packages/Documenter/Qo3Yk/src/Expanders.jl:31 [15] runner(::Type{Documenter.Builder.ExpandTemplates}, ::Documenter.Documents.Document) at /home/travis/.julia/packages/Documenter/Qo3Yk/src/Builder.jl:178 [16] dispatch(::Type{Documenter.Builder.DocumentPipeline}, ::Documenter.Documents.Document) at /home/travis/.julia/packages/Documenter/Qo3Yk/src/Selectors.jl:168 [17] #2 at /home/travis/.julia/packages/Documenter/Qo3Yk/src/Documenter.jl:204 [inlined] [18] cd(::getfield(Documenter, Symbol( ##2#3 )){Documenter.Documents.Document}, ::String) at ./file.jl:96 [19] #makedocs#1 at /home/travis/.julia/packages/Documenter/Qo3Yk/src/Documenter.jl:203 [inlined] [20] (::getfield(Documenter, Symbol( #kw##makedocs )))(::NamedTuple{(:doctest,),Tuple{Bool}}, ::typeof(makedocs)) at ./none:0 [21] top-level scope at none:0 [22] include at ./boot.jl:317 [inlined] [23] include_relative(::Module, ::String) at ./loading.jl:1044 [24] include(::Module, ::String) at ./sysimg.jl:29 [25] exec_options(::Base.JLOptions) at ./client.jl:266 [26] _start() at ./client.jl:425 * Status: success * Candidate solution Minimizer: [4.56e-01, 2.06e-01] Minimum: 2.966216e-01 * Found with Algorithm: Interior Point Newton Initial Point: [0.00e+00, 0.00e+00] * Convergence measures |x - x'| = 0.00e+00 \u2264 0.0e+00 |x - x'|/|x'| = 0.00e+00 \u2264 0.0e+00 |f(x) - f(x')| = 0.00e+00 \u2264 0.0e+00 |f(x) - f(x')|/|f(x')| = 0.00e+00 \u2264 0.0e+00 |g(x)| = 7.71e-01 \u2270 1.0e-08 * Work counters Seconds run: 0 (vs limit Inf) Iterations: 34 f(x) calls: 158 \u2207f(x) calls: 158 Note that the algorithm warns that the Initial guess is not an interior point. IPNewton can often handle this, however, if the initial guess is such that c(x) = u_c , then the algorithm currently fails. We may fix this in the future. Multiple constraints The following example illustrates how to add an additional constraint. In particular, we add a constraint function c(x)_2 = x_2\\sin(x_1)-x_1 function con2_c!(c, x) c[1] = x[1]^2 + x[2]^2 ## First constraint c[2] = x[2]*sin(x[1])-x[1] ## Second constraint c end function con2_jacobian!(J, x) # First constraint J[1,1] = 2*x[1] J[1,2] = 2*x[2] # Second constraint J[2,1] = x[2]*cos(x[1])-1.0 J[2,2] = sin(x[1]) J end function con2_h!(h, x, \u03bb) # First constraint h[1,1] += \u03bb[1]*2 h[2,2] += \u03bb[1]*2 # Second constraint h[1,1] += \u03bb[2]*x[2]*-sin(x[1]) h[1,2] += \u03bb[2]*cos(x[1]) # Symmetrize h h[2,1] = h[1,2] h end; We generate the constraint objects and call IPNewton with initial guess $x_0 = (0.25,0.25)$. x0 = [0.25, 0.25] lc = [-Inf, 0.0]; uc = [0.5^2, 0.0] dfc = TwiceDifferentiableConstraints(con2_c!, con2_jacobian!, con2_h!, lx, ux, lc, uc) res = optimize(df, dfc, x0, IPNewton()) * Status: success * Candidate solution Minimizer: [-1.60e-19, -1.95e-18] Minimum: 1.000000e+00 * Found with Algorithm: Interior Point Newton Initial Point: [2.50e-01, 2.50e-01] * Convergence measures |x - x'| = 6.90e-10 \u2270 0.0e+00 |x - x'|/|x'| = 3.55e+08 \u2270 0.0e+00 |f(x) - f(x')| = 1.38e-09 \u2270 0.0e+00 |f(x) - f(x')|/|f(x')| = 1.38e-09 \u2270 0.0e+00 |g(x)| = 2.00e+00 \u2270 1.0e-08 * Work counters Seconds run: 0 (vs limit Inf) Iterations: 29 f(x) calls: 215 \u2207f(x) calls: 215 Plain Program Below follows a version of the program without any comments. The file is also available here: ipnewton_basics.jl using Optim, NLSolversBase #hide import NLSolversBase: clear! #hide fun(x) = (1.0 - x[1])^2 + 100.0 * (x[2] - x[1]^2)^2 function fun_grad!(g, x) g[1] = -2.0 * (1.0 - x[1]) - 400.0 * (x[2] - x[1]^2) * x[1] g[2] = 200.0 * (x[2] - x[1]^2) end function fun_hess!(h, x) h[1, 1] = 2.0 - 400.0 * x[2] + 1200.0 * x[1]^2 h[1, 2] = -400.0 * x[1] h[2, 1] = -400.0 * x[1] h[2, 2] = 200.0 end; x0 = [0.0, 0.0] df = TwiceDifferentiable(fun, fun_grad!, fun_hess!, x0) lx = [-0.5, -0.5]; ux = [0.5, 0.5] dfc = TwiceDifferentiableConstraints(lx, ux) res = optimize(df, dfc, x0, IPNewton()) ux = fill(Inf, 2) dfc = TwiceDifferentiableConstraints(lx, ux) clear!(df) res = optimize(df, dfc, x0, IPNewton()) lx = fill(-Inf, 2); ux = fill(Inf, 2) dfc = TwiceDifferentiableConstraints(lx, ux) clear!(df) res = optimize(df, dfc, x0, IPNewton()) lx = Float64[]; ux = Float64[] dfc = TwiceDifferentiableConstraints(lx, ux) clear!(df) res = optimize(df, dfc, x0, IPNewton()) con_c!(c, x) = (c[1] = x[1]^2 + x[2]^2; c) function con_jacobian!(J, x) J[1,1] = 2*x[1] J[1,2] = 2*x[2] J end function con_h!(h, x, \u03bb) h[1,1] += \u03bb[1]*2 h[2,2] += \u03bb[1]*2 end; lx = Float64[]; ux = Float64[] lc = [-Inf]; uc = [0.5^2] dfc = TwiceDifferentiableConstraints(con_c!, con_jacobian!, con_h!, lx, ux, lc, uc) res = optimize(df, dfc, x0, IPNewton()) lc = [0.1^2] dfc = TwiceDifferentiableConstraints(con_c!, con_jacobian!, con_h!, lx, ux, lc, uc) res = optimize(df, dfc, x0, IPNewton()) function con2_c!(c, x) c[1] = x[1]^2 + x[2]^2 ## First constraint c[2] = x[2]*sin(x[1])-x[1] ## Second constraint c end function con2_jacobian!(J, x) # First constraint J[1,1] = 2*x[1] J[1,2] = 2*x[2] # Second constraint J[2,1] = x[2]*cos(x[1])-1.0 J[2,2] = sin(x[1]) J end function con2_h!(h, x, \u03bb) # First constraint h[1,1] += \u03bb[1]*2 h[2,2] += \u03bb[1]*2 # Second constraint h[1,1] += \u03bb[2]*x[2]*-sin(x[1]) h[1,2] += \u03bb[2]*cos(x[1]) # Symmetrize h h[2,1] = h[1,2] h end; x0 = [0.25, 0.25] lc = [-Inf, 0.0]; uc = [0.5^2, 0.0] dfc = TwiceDifferentiableConstraints(con2_c!, con2_jacobian!, con2_h!, lx, ux, lc, uc) res = optimize(df, dfc, x0, IPNewton()) # This file was generated using Literate.jl, https://github.com/fredrikekre/Literate.jl This page was generated using Literate.jl .","title":"Interior point Newton"},{"location":"examples/generated/ipnewton_basics/#nonlinear-constrained-optimization","text":"Tip This example is also available as a Jupyter notebook: ipnewton_basics.ipynb The nonlinear constrained optimization interface in Optim assumes that the user can write the optimization problem in the following way. \\min_{x\\in\\mathbb{R}^n} f(x) \\quad \\text{such that}\\\\ l_x \\leq \\phantom{c(}x\\phantom{)} \\leq u_x \\\\ l_c \\leq c(x) \\leq u_c. For equality constraints on $x_j$ or $c(x)_j$ you set those particular entries of bounds to be equal, $l_j=u_j$. Likewise, setting $l_j=-\\infty$ or $u_j=\\infty$ means that the constraint is unbounded from below or above respectively.","title":"Nonlinear constrained optimization"},{"location":"examples/generated/ipnewton_basics/#constrained-optimization-with-ipnewton","text":"We will go through examples on how to use the constraints interface with the interior-point Newton optimization algorithm IPNewton . Throughout these examples we work with the standard Rosenbrock function. The objective and its derivatives are given by fun(x) = (1.0 - x[1])^2 + 100.0 * (x[2] - x[1]^2)^2 function fun_grad!(g, x) g[1] = -2.0 * (1.0 - x[1]) - 400.0 * (x[2] - x[1]^2) * x[1] g[2] = 200.0 * (x[2] - x[1]^2) end function fun_hess!(h, x) h[1, 1] = 2.0 - 400.0 * x[2] + 1200.0 * x[1]^2 h[1, 2] = -400.0 * x[1] h[2, 1] = -400.0 * x[1] h[2, 2] = 200.0 end;","title":"Constrained optimization with IPNewton"},{"location":"examples/generated/ipnewton_basics/#optimization-interface","text":"To solve a constrained optimization problem we call the optimize method optimize(d::AbstractObjective, constraints::AbstractConstraints, initial_x::Tx, method::ConstrainedOptimizer, options::Options) We can create instances of AbstractObjective and AbstractConstraints using the types TwiceDifferentiable and TwiceDifferentiableConstraints from the package NLSolversBase.jl .","title":"Optimization interface"},{"location":"examples/generated/ipnewton_basics/#box-minimzation","text":"We want to optimize the Rosenbrock function in the box $-0.5 \\leq x \\leq 0.5$, starting from the point $x_0=(0,0)$. Box constraints are defined using, for example, TwiceDifferentiableConstraints(lx, ux) . x0 = [0.0, 0.0] df = TwiceDifferentiable(fun, fun_grad!, fun_hess!, x0) lx = [-0.5, -0.5]; ux = [0.5, 0.5] dfc = TwiceDifferentiableConstraints(lx, ux) res = optimize(df, dfc, x0, IPNewton()) * Status: success * Candidate solution Minimizer: [5.00e-01, 2.50e-01] Minimum: 2.500000e-01 * Found with Algorithm: Interior Point Newton Initial Point: [0.00e+00, 0.00e+00] * Convergence measures |x - x'| = 4.39e-10 \u2270 0.0e+00 |x - x'|/|x'| = 8.79e-10 \u2270 0.0e+00 |f(x) - f(x')| = 0.00e+00 \u2264 0.0e+00 |f(x) - f(x')|/|f(x')| = 0.00e+00 \u2264 0.0e+00 |g(x)| = 1.00e+00 \u2270 1.0e-08 * Work counters Seconds run: 0 (vs limit Inf) Iterations: 43 f(x) calls: 68 \u2207f(x) calls: 68 If we only want to set lower bounds, use ux = fill(Inf, 2) ux = fill(Inf, 2) dfc = TwiceDifferentiableConstraints(lx, ux) clear!(df) res = optimize(df, dfc, x0, IPNewton()) * Status: success * Candidate solution Minimizer: [1.00e+00, 1.00e+00] Minimum: 7.987239e-20 * Found with Algorithm: Interior Point Newton Initial Point: [0.00e+00, 0.00e+00] * Convergence measures |x - x'| = 3.54e-10 \u2270 0.0e+00 |x - x'|/|x'| = 3.54e-10 \u2270 0.0e+00 |f(x) - f(x')| = 2.40e-19 \u2270 0.0e+00 |f(x) - f(x')|/|f(x')| = 3.00e+00 \u2270 0.0e+00 |g(x)| = 8.83e-09 \u2264 1.0e-08 * Work counters Seconds run: 0 (vs limit Inf) Iterations: 35 f(x) calls: 63 \u2207f(x) calls: 63","title":"Box minimzation"},{"location":"examples/generated/ipnewton_basics/#defining-unconstrained-problems","text":"An unconstrained problem can be defined either by passing Inf bounds or empty arrays. Note that we must pass the correct type information to the empty lx and ux lx = fill(-Inf, 2); ux = fill(Inf, 2) dfc = TwiceDifferentiableConstraints(lx, ux) clear!(df) res = optimize(df, dfc, x0, IPNewton()) lx = Float64[]; ux = Float64[] dfc = TwiceDifferentiableConstraints(lx, ux) clear!(df) res = optimize(df, dfc, x0, IPNewton()) * Status: success * Candidate solution Minimizer: [1.00e+00, 1.00e+00] Minimum: 5.998937e-19 * Found with Algorithm: Interior Point Newton Initial Point: [0.00e+00, 0.00e+00] * Convergence measures |x - x'| = 1.50e-09 \u2270 0.0e+00 |x - x'|/|x'| = 1.50e-09 \u2270 0.0e+00 |f(x) - f(x')| = 1.80e-18 \u2270 0.0e+00 |f(x) - f(x')|/|f(x')| = 3.00e+00 \u2270 0.0e+00 |g(x)| = 7.92e-09 \u2264 1.0e-08 * Work counters Seconds run: 0 (vs limit Inf) Iterations: 34 f(x) calls: 63 \u2207f(x) calls: 63","title":"Defining \"unconstrained\" problems"},{"location":"examples/generated/ipnewton_basics/#generic-nonlinear-constraints","text":"We now consider the Rosenbrock problem with a constraint on c(x)_1 = x_1^2 + x_2^2. We pass the information about the constraints to optimize by defining a vector function c(x) and its Jacobian J(x) . The Hessian information is treated differently, by considering the Lagrangian of the corresponding slack-variable transformed optimization problem. This is similar to how the CUTEst library works. Let $H_j(x)$ represent the Hessian of the $j$th component $c(x)_j$ of the generic constraints. and $\\lambda_j$ the corresponding dual variable in the Lagrangian. Then we want the constraint object to add the values of $H_j(x)$ to the Hessian of the objective, weighted by $\\lambda_j$. The Julian form for the supplied function $c(x)$ and the derivative information is then added in the following way. con_c!(c, x) = (c[1] = x[1]^2 + x[2]^2; c) function con_jacobian!(J, x) J[1,1] = 2*x[1] J[1,2] = 2*x[2] J end function con_h!(h, x, \u03bb) h[1,1] += \u03bb[1]*2 h[2,2] += \u03bb[1]*2 end; Note that con_h! adds the \u03bb -weighted Hessian value of each element of c(x) to the Hessian of fun . We can then optimize the Rosenbrock function inside the ball of radius $0.5$. lx = Float64[]; ux = Float64[] lc = [-Inf]; uc = [0.5^2] dfc = TwiceDifferentiableConstraints(con_c!, con_jacobian!, con_h!, lx, ux, lc, uc) res = optimize(df, dfc, x0, IPNewton()) * Status: success * Candidate solution Minimizer: [4.56e-01, 2.06e-01] Minimum: 2.966216e-01 * Found with Algorithm: Interior Point Newton Initial Point: [0.00e+00, 0.00e+00] * Convergence measures |x - x'| = 0.00e+00 \u2264 0.0e+00 |x - x'|/|x'| = 0.00e+00 \u2264 0.0e+00 |f(x) - f(x')| = 0.00e+00 \u2264 0.0e+00 |f(x) - f(x')|/|f(x')| = 0.00e+00 \u2264 0.0e+00 |g(x)| = 7.71e-01 \u2270 1.0e-08 * Work counters Seconds run: 0 (vs limit Inf) Iterations: 28 f(x) calls: 109 \u2207f(x) calls: 109 We can add a lower bound on the constraint, and thus optimize the objective on the annulus with inner and outer radii $0.1$ and $0.5$ respectively. lc = [0.1^2] dfc = TwiceDifferentiableConstraints(con_c!, con_jacobian!, con_h!, lx, ux, lc, uc) res = optimize(df, dfc, x0, IPNewton()) \u250c Warning: Initial guess is not an interior point \u2514 @ Optim ~/.julia/packages/Optim/EhyUl/src/multivariate/solvers/constrained/ipnewton/ipnewton.jl:111 Stacktrace: [1] initial_state(::IPNewton{typeof(Optim.backtrack_constrained_grad),Symbol}, ::Optim.Options{Float64,Nothing}, ::TwiceDifferentiable{Float64,Array{Float64,1},Array{Float64,2},Array{Float64,1}}, ::TwiceDifferentiableConstraints{typeof(Main.ex-ipnewton_basics.con_c!),typeof(Main.ex-ipnewton_basics.con_jacobian!),typeof(Main.ex-ipnewton_basics.con_h!),Float64}, ::Array{Float64,1}) at /home/travis/.julia/packages/Optim/EhyUl/src/multivariate/solvers/constrained/ipnewton/ipnewton.jl:112 [2] optimize(::TwiceDifferentiable{Float64,Array{Float64,1},Array{Float64,2},Array{Float64,1}}, ::TwiceDifferentiableConstraints{typeof(Main.ex-ipnewton_basics.con_c!),typeof(Main.ex-ipnewton_basics.con_jacobian!),typeof(Main.ex-ipnewton_basics.con_h!),Float64}, ::Array{Float64,1}, ::IPNewton{typeof(Optim.backtrack_constrained_grad),Symbol}, ::Optim.Options{Float64,Nothing}) at /home/travis/.julia/packages/Optim/EhyUl/src/multivariate/solvers/constrained/ipnewton/interior.jl:196 (repeats 2 times) [3] top-level scope at none:0 [4] eval at ./boot.jl:319 [inlined] [5] (::getfield(Documenter.Expanders, Symbol( ##8#10 )){Module,Expr})() at /home/travis/.julia/packages/Documenter/Qo3Yk/src/Expanders.jl:480 [6] cd(::getfield(Documenter.Expanders, Symbol( ##8#10 )){Module,Expr}, ::String) at ./file.jl:96 [7] #7 at /home/travis/.julia/packages/Documenter/Qo3Yk/src/Expanders.jl:479 [inlined] [8] (::getfield(Documenter.Utilities, Symbol( ##18#19 )){getfield(Documenter.Expanders, Symbol( ##7#9 )){Documenter.Documents.Page,Module,Expr},Base.PipeEndpoint,Base.PipeEndpoint,Pipe,Array{UInt8,1}})() at /home/travis/.julia/packages/Documenter/Qo3Yk/src/Utilities/Utilities.jl:591 [9] with_logstate(::getfield(Documenter.Utilities, Symbol( ##18#19 )){getfield(Documenter.Expanders, Symbol( ##7#9 )){Documenter.Documents.Page,Module,Expr},Base.PipeEndpoint,Base.PipeEndpoint,Pipe,Array{UInt8,1}}, ::Base.CoreLogging.LogState) at ./logging.jl:395 [10] with_logger(::Function, ::Logging.ConsoleLogger) at ./logging.jl:491 [11] withoutput at /home/travis/.julia/packages/Documenter/Qo3Yk/src/Utilities/Utilities.jl:589 [inlined] [12] runner(::Type{Documenter.Expanders.ExampleBlocks}, ::Markdown.Code, ::Documenter.Documents.Page, ::Documenter.Documents.Document) at /home/travis/.julia/packages/Documenter/Qo3Yk/src/Expanders.jl:478 [13] dispatch(::Type{Documenter.Expanders.ExpanderPipeline}, ::Markdown.Code, ::Vararg{Any,N} where N) at /home/travis/.julia/packages/Documenter/Qo3Yk/src/Selectors.jl:168 [14] expand(::Documenter.Documents.Document) at /home/travis/.julia/packages/Documenter/Qo3Yk/src/Expanders.jl:31 [15] runner(::Type{Documenter.Builder.ExpandTemplates}, ::Documenter.Documents.Document) at /home/travis/.julia/packages/Documenter/Qo3Yk/src/Builder.jl:178 [16] dispatch(::Type{Documenter.Builder.DocumentPipeline}, ::Documenter.Documents.Document) at /home/travis/.julia/packages/Documenter/Qo3Yk/src/Selectors.jl:168 [17] #2 at /home/travis/.julia/packages/Documenter/Qo3Yk/src/Documenter.jl:204 [inlined] [18] cd(::getfield(Documenter, Symbol( ##2#3 )){Documenter.Documents.Document}, ::String) at ./file.jl:96 [19] #makedocs#1 at /home/travis/.julia/packages/Documenter/Qo3Yk/src/Documenter.jl:203 [inlined] [20] (::getfield(Documenter, Symbol( #kw##makedocs )))(::NamedTuple{(:doctest,),Tuple{Bool}}, ::typeof(makedocs)) at ./none:0 [21] top-level scope at none:0 [22] include at ./boot.jl:317 [inlined] [23] include_relative(::Module, ::String) at ./loading.jl:1044 [24] include(::Module, ::String) at ./sysimg.jl:29 [25] exec_options(::Base.JLOptions) at ./client.jl:266 [26] _start() at ./client.jl:425 * Status: success * Candidate solution Minimizer: [4.56e-01, 2.06e-01] Minimum: 2.966216e-01 * Found with Algorithm: Interior Point Newton Initial Point: [0.00e+00, 0.00e+00] * Convergence measures |x - x'| = 0.00e+00 \u2264 0.0e+00 |x - x'|/|x'| = 0.00e+00 \u2264 0.0e+00 |f(x) - f(x')| = 0.00e+00 \u2264 0.0e+00 |f(x) - f(x')|/|f(x')| = 0.00e+00 \u2264 0.0e+00 |g(x)| = 7.71e-01 \u2270 1.0e-08 * Work counters Seconds run: 0 (vs limit Inf) Iterations: 34 f(x) calls: 158 \u2207f(x) calls: 158 Note that the algorithm warns that the Initial guess is not an interior point. IPNewton can often handle this, however, if the initial guess is such that c(x) = u_c , then the algorithm currently fails. We may fix this in the future.","title":"Generic nonlinear constraints"},{"location":"examples/generated/ipnewton_basics/#multiple-constraints","text":"The following example illustrates how to add an additional constraint. In particular, we add a constraint function c(x)_2 = x_2\\sin(x_1)-x_1 function con2_c!(c, x) c[1] = x[1]^2 + x[2]^2 ## First constraint c[2] = x[2]*sin(x[1])-x[1] ## Second constraint c end function con2_jacobian!(J, x) # First constraint J[1,1] = 2*x[1] J[1,2] = 2*x[2] # Second constraint J[2,1] = x[2]*cos(x[1])-1.0 J[2,2] = sin(x[1]) J end function con2_h!(h, x, \u03bb) # First constraint h[1,1] += \u03bb[1]*2 h[2,2] += \u03bb[1]*2 # Second constraint h[1,1] += \u03bb[2]*x[2]*-sin(x[1]) h[1,2] += \u03bb[2]*cos(x[1]) # Symmetrize h h[2,1] = h[1,2] h end; We generate the constraint objects and call IPNewton with initial guess $x_0 = (0.25,0.25)$. x0 = [0.25, 0.25] lc = [-Inf, 0.0]; uc = [0.5^2, 0.0] dfc = TwiceDifferentiableConstraints(con2_c!, con2_jacobian!, con2_h!, lx, ux, lc, uc) res = optimize(df, dfc, x0, IPNewton()) * Status: success * Candidate solution Minimizer: [-1.60e-19, -1.95e-18] Minimum: 1.000000e+00 * Found with Algorithm: Interior Point Newton Initial Point: [2.50e-01, 2.50e-01] * Convergence measures |x - x'| = 6.90e-10 \u2270 0.0e+00 |x - x'|/|x'| = 3.55e+08 \u2270 0.0e+00 |f(x) - f(x')| = 1.38e-09 \u2270 0.0e+00 |f(x) - f(x')|/|f(x')| = 1.38e-09 \u2270 0.0e+00 |g(x)| = 2.00e+00 \u2270 1.0e-08 * Work counters Seconds run: 0 (vs limit Inf) Iterations: 29 f(x) calls: 215 \u2207f(x) calls: 215","title":"Multiple constraints"},{"location":"examples/generated/ipnewton_basics/#plain-program","text":"Below follows a version of the program without any comments. The file is also available here: ipnewton_basics.jl using Optim, NLSolversBase #hide import NLSolversBase: clear! #hide fun(x) = (1.0 - x[1])^2 + 100.0 * (x[2] - x[1]^2)^2 function fun_grad!(g, x) g[1] = -2.0 * (1.0 - x[1]) - 400.0 * (x[2] - x[1]^2) * x[1] g[2] = 200.0 * (x[2] - x[1]^2) end function fun_hess!(h, x) h[1, 1] = 2.0 - 400.0 * x[2] + 1200.0 * x[1]^2 h[1, 2] = -400.0 * x[1] h[2, 1] = -400.0 * x[1] h[2, 2] = 200.0 end; x0 = [0.0, 0.0] df = TwiceDifferentiable(fun, fun_grad!, fun_hess!, x0) lx = [-0.5, -0.5]; ux = [0.5, 0.5] dfc = TwiceDifferentiableConstraints(lx, ux) res = optimize(df, dfc, x0, IPNewton()) ux = fill(Inf, 2) dfc = TwiceDifferentiableConstraints(lx, ux) clear!(df) res = optimize(df, dfc, x0, IPNewton()) lx = fill(-Inf, 2); ux = fill(Inf, 2) dfc = TwiceDifferentiableConstraints(lx, ux) clear!(df) res = optimize(df, dfc, x0, IPNewton()) lx = Float64[]; ux = Float64[] dfc = TwiceDifferentiableConstraints(lx, ux) clear!(df) res = optimize(df, dfc, x0, IPNewton()) con_c!(c, x) = (c[1] = x[1]^2 + x[2]^2; c) function con_jacobian!(J, x) J[1,1] = 2*x[1] J[1,2] = 2*x[2] J end function con_h!(h, x, \u03bb) h[1,1] += \u03bb[1]*2 h[2,2] += \u03bb[1]*2 end; lx = Float64[]; ux = Float64[] lc = [-Inf]; uc = [0.5^2] dfc = TwiceDifferentiableConstraints(con_c!, con_jacobian!, con_h!, lx, ux, lc, uc) res = optimize(df, dfc, x0, IPNewton()) lc = [0.1^2] dfc = TwiceDifferentiableConstraints(con_c!, con_jacobian!, con_h!, lx, ux, lc, uc) res = optimize(df, dfc, x0, IPNewton()) function con2_c!(c, x) c[1] = x[1]^2 + x[2]^2 ## First constraint c[2] = x[2]*sin(x[1])-x[1] ## Second constraint c end function con2_jacobian!(J, x) # First constraint J[1,1] = 2*x[1] J[1,2] = 2*x[2] # Second constraint J[2,1] = x[2]*cos(x[1])-1.0 J[2,2] = sin(x[1]) J end function con2_h!(h, x, \u03bb) # First constraint h[1,1] += \u03bb[1]*2 h[2,2] += \u03bb[1]*2 # Second constraint h[1,1] += \u03bb[2]*x[2]*-sin(x[1]) h[1,2] += \u03bb[2]*cos(x[1]) # Symmetrize h h[2,1] = h[1,2] h end; x0 = [0.25, 0.25] lc = [-Inf, 0.0]; uc = [0.5^2, 0.0] dfc = TwiceDifferentiableConstraints(con2_c!, con2_jacobian!, con2_h!, lx, ux, lc, uc) res = optimize(df, dfc, x0, IPNewton()) # This file was generated using Literate.jl, https://github.com/fredrikekre/Literate.jl This page was generated using Literate.jl .","title":"Plain Program"},{"location":"examples/generated/maxlikenlm/","text":"Maximum Likelihood Estimation: The Normal Linear Model Tip This example is also available as a Jupyter notebook: maxlikenlm.ipynb The following tutorial will introduce maximum likelihood estimation in Julia for the normal linear model. The normal linear model (sometimes referred to as the OLS model) is the workhorse of regression modeling and is utilized across a number of diverse fields. In this tutorial, we will utilize simulated data to demonstrate how Julia can be used to recover the parameters of interest. The first order of business is to use the Optim package and also include the NLSolversBase routine: using Optim, NLSolversBase, Random using LinearAlgebra: diag Random.seed!(0); # Fix random seed generator for reproducibility Tip Add Optim with the following command at the Julia command prompt: Pkg.add(\"Optim\") The first item that needs to be addressed is the data generating process or DGP. The following code will produce data from a nomral linear model: n = 500 # Number of observations nvar = 2 # Number of variables \u03b2 = ones(nvar) * 3.0 # True coefficients x = [ones(n) randn(n, nvar - 1)] # X matrix of explanatory variables plus constant \u03b5 = randn(n) * 0.5 # Error variance y = x * \u03b2 + \u03b5; # Generate Data In the above example, we have 500 observations, 2 explanatory variables plus an intercept, an error variance equal to 0.5, coefficients equal to 3.0, and all of these are subject to change by the user. Since we know the true value of these parameters, we should obtain these values when we maximize the likelihood function. The next step in our tutorial is to define a Julia function for the likelihood function. The following function defines the likelihood function for the normal linear model: function Log_Likelihood(X, Y, \u03b2, log_\u03c3) \u03c3 = exp(log_\u03c3) llike = -n/2*log(2\u03c0) - n/2* log(\u03c3^2) - (sum((Y - X * \u03b2).^2) / (2\u03c3^2)) llike = -llike end Log_Likelihood (generic function with 1 method) The log likelihood function accepts 4 inputs: the matrix of explanatory variables (X), the dependent variable (Y), the \u03b2's, and the error varicance. Note that we exponentiate the error variance in the second line of the code because the error variance cannot be negative and we want to avoid this situation when maximizing the likelihood. The next step in our tutorial is to optimize our function. We first use the TwiceDifferentiable command in order to obtain the Hessian matrix later on, which will be used to help form t-statistics: func = TwiceDifferentiable(vars - Log_Likelihood(x, y, vars[1:nvar], vars[nvar + 1]), ones(nvar+1); autodiff=:forward); The above statment accepts 4 inputs: the x matrix, the dependent variable y, and a vector of \u03b2's and the error variance. The vars[1:nvar] is how we pass the vector of \u03b2's and the vars[nvar + 1] is how we pass the error variance. You can think of this as a vector of parameters with the first 2 being \u03b2's and the last one is the error variance. The ones(nvar+1) are the starting values for the parameters and the autodiff=:forward command performs forward mode automatic differentiation. The actual optimization of the likelihood function is accomplished with the following command: opt = optimize(func, ones(nvar+1)) * Status: success * Candidate solution Minimizer: [3.00e+00, 2.96e+00, -6.49e-01] Minimum: 3.851229e+02 * Found with Algorithm: Newton's Method Initial Point: [1.00e+00, 1.00e+00, 1.00e+00] * Convergence measures |x - x'| = 2.62e-08 \u2270 0.0e+00 |x - x'|/|x'| = 8.72e-09 \u2270 0.0e+00 |f(x) - f(x')| = 5.12e-13 \u2270 0.0e+00 |f(x) - f(x')|/|f(x')| = 1.33e-15 \u2270 0.0e+00 |g(x)| = 3.41e-13 \u2264 1.0e-08 * Work counters Seconds run: 0 (vs limit Inf) Iterations: 7 f(x) calls: 31 \u2207f(x) calls: 31 \u2207\u00b2f(x) calls: 7 The first input to the command is the function we wish to optimize and the second input are the starting values. After a brief period of time, you should see output of the optimization routine, with the parameter estimates being very close to our simulated values. The optimization routine stores several quantities and we can obtain the maximim likelihood estimates with the following command: parameters = Optim.minimizer(opt) 3-element Array{Float64,1}: 3.002788633849947 2.964549617572727 -0.648692780562844 !!! Note Fieldnames for all of the quantities can be obtained with the following command: fieldnames(opt) Since we paramaterized our likelihood to use the exponentiated value, we need to exponentiate it to get back to our original log scale: parameters[nvar+1] = exp(parameters[nvar+1]) 0.5227286513837306 In order to obtain the correct Hessian matrix, we have to \"push\" the actual parameter values that maximizes the likelihood function since the TwiceDifferentiable command uses the next to last values to calculate the Hessian: numerical_hessian = hessian!(func,parameters) 3\u00d73 Array{Float64,2}: 175.766 -12.0877 5.67464e-14 -12.0877 182.437 5.88344e-15 5.67464e-14 5.88344e-15 96.0542 We can now invert our Hessian matrix to obtain the variance-covariance matrix: var_cov_matrix = inv(numerical_hessian) 3\u00d73 Array{Float64,2}: 0.00571544 0.000378687 -3.39973e-18 0.000378687 0.00550643 -5.60994e-19 -3.39973e-18 -5.60994e-19 0.0104108 In this example, we are only interested in the statistical significance of the coefficient estimates so we obtain those with the following command: \u03b2 = parameters[1:nvar] 2-element Array{Float64,1}: 3.002788633849947 2.964549617572727 We now need to obtain those elements of the variance-covariance matrix needed to obtain our t-statistics, and we can do this with the following commands: temp = diag(var_cov_matrix) temp1 = temp[1:nvar] 2-element Array{Float64,1}: 0.005715441191953887 0.005506430533329349 The t-statistics are formed by dividing element-by-element the coefficients by their standard errors, or the square root of the diagonal elements of the variance-covariance matrix: t_stats = \u03b2./sqrt.(temp1) 2-element Array{Float64,1}: 39.719144251161474 39.95063081460274 From here, one may examine other statistics of interest using the output from the optimization routine. Plain Program Below follows a version of the program without any comments. The file is also available here: maxlikenlm.jl using Optim, NLSolversBase, Random using LinearAlgebra: diag Random.seed!(0); # Fix random seed generator for reproducibility n = 500 # Number of observations nvar = 2 # Number of variables \u03b2 = ones(nvar) * 3.0 # True coefficients x = [ones(n) randn(n, nvar - 1)] # X matrix of explanatory variables plus constant \u03b5 = randn(n) * 0.5 # Error variance y = x * \u03b2 + \u03b5; # Generate Data function Log_Likelihood(X, Y, \u03b2, log_\u03c3) \u03c3 = exp(log_\u03c3) llike = -n/2*log(2\u03c0) - n/2* log(\u03c3^2) - (sum((Y - X * \u03b2).^2) / (2\u03c3^2)) llike = -llike end func = TwiceDifferentiable(vars - Log_Likelihood(x, y, vars[1:nvar], vars[nvar + 1]), ones(nvar+1); autodiff=:forward); opt = optimize(func, ones(nvar+1)) parameters = Optim.minimizer(opt) parameters[nvar+1] = exp(parameters[nvar+1]) numerical_hessian = hessian!(func,parameters) var_cov_matrix = inv(numerical_hessian) \u03b2 = parameters[1:nvar] temp = diag(var_cov_matrix) temp1 = temp[1:nvar] t_stats = \u03b2./sqrt.(temp1) # This file was generated using Literate.jl, https://github.com/fredrikekre/Literate.jl This page was generated using Literate.jl .","title":"Maximum likelihood estimation"},{"location":"examples/generated/maxlikenlm/#maximum-likelihood-estimation-the-normal-linear-model","text":"Tip This example is also available as a Jupyter notebook: maxlikenlm.ipynb The following tutorial will introduce maximum likelihood estimation in Julia for the normal linear model. The normal linear model (sometimes referred to as the OLS model) is the workhorse of regression modeling and is utilized across a number of diverse fields. In this tutorial, we will utilize simulated data to demonstrate how Julia can be used to recover the parameters of interest. The first order of business is to use the Optim package and also include the NLSolversBase routine: using Optim, NLSolversBase, Random using LinearAlgebra: diag Random.seed!(0); # Fix random seed generator for reproducibility Tip Add Optim with the following command at the Julia command prompt: Pkg.add(\"Optim\") The first item that needs to be addressed is the data generating process or DGP. The following code will produce data from a nomral linear model: n = 500 # Number of observations nvar = 2 # Number of variables \u03b2 = ones(nvar) * 3.0 # True coefficients x = [ones(n) randn(n, nvar - 1)] # X matrix of explanatory variables plus constant \u03b5 = randn(n) * 0.5 # Error variance y = x * \u03b2 + \u03b5; # Generate Data In the above example, we have 500 observations, 2 explanatory variables plus an intercept, an error variance equal to 0.5, coefficients equal to 3.0, and all of these are subject to change by the user. Since we know the true value of these parameters, we should obtain these values when we maximize the likelihood function. The next step in our tutorial is to define a Julia function for the likelihood function. The following function defines the likelihood function for the normal linear model: function Log_Likelihood(X, Y, \u03b2, log_\u03c3) \u03c3 = exp(log_\u03c3) llike = -n/2*log(2\u03c0) - n/2* log(\u03c3^2) - (sum((Y - X * \u03b2).^2) / (2\u03c3^2)) llike = -llike end Log_Likelihood (generic function with 1 method) The log likelihood function accepts 4 inputs: the matrix of explanatory variables (X), the dependent variable (Y), the \u03b2's, and the error varicance. Note that we exponentiate the error variance in the second line of the code because the error variance cannot be negative and we want to avoid this situation when maximizing the likelihood. The next step in our tutorial is to optimize our function. We first use the TwiceDifferentiable command in order to obtain the Hessian matrix later on, which will be used to help form t-statistics: func = TwiceDifferentiable(vars - Log_Likelihood(x, y, vars[1:nvar], vars[nvar + 1]), ones(nvar+1); autodiff=:forward); The above statment accepts 4 inputs: the x matrix, the dependent variable y, and a vector of \u03b2's and the error variance. The vars[1:nvar] is how we pass the vector of \u03b2's and the vars[nvar + 1] is how we pass the error variance. You can think of this as a vector of parameters with the first 2 being \u03b2's and the last one is the error variance. The ones(nvar+1) are the starting values for the parameters and the autodiff=:forward command performs forward mode automatic differentiation. The actual optimization of the likelihood function is accomplished with the following command: opt = optimize(func, ones(nvar+1)) * Status: success * Candidate solution Minimizer: [3.00e+00, 2.96e+00, -6.49e-01] Minimum: 3.851229e+02 * Found with Algorithm: Newton's Method Initial Point: [1.00e+00, 1.00e+00, 1.00e+00] * Convergence measures |x - x'| = 2.62e-08 \u2270 0.0e+00 |x - x'|/|x'| = 8.72e-09 \u2270 0.0e+00 |f(x) - f(x')| = 5.12e-13 \u2270 0.0e+00 |f(x) - f(x')|/|f(x')| = 1.33e-15 \u2270 0.0e+00 |g(x)| = 3.41e-13 \u2264 1.0e-08 * Work counters Seconds run: 0 (vs limit Inf) Iterations: 7 f(x) calls: 31 \u2207f(x) calls: 31 \u2207\u00b2f(x) calls: 7 The first input to the command is the function we wish to optimize and the second input are the starting values. After a brief period of time, you should see output of the optimization routine, with the parameter estimates being very close to our simulated values. The optimization routine stores several quantities and we can obtain the maximim likelihood estimates with the following command: parameters = Optim.minimizer(opt) 3-element Array{Float64,1}: 3.002788633849947 2.964549617572727 -0.648692780562844 !!! Note Fieldnames for all of the quantities can be obtained with the following command: fieldnames(opt) Since we paramaterized our likelihood to use the exponentiated value, we need to exponentiate it to get back to our original log scale: parameters[nvar+1] = exp(parameters[nvar+1]) 0.5227286513837306 In order to obtain the correct Hessian matrix, we have to \"push\" the actual parameter values that maximizes the likelihood function since the TwiceDifferentiable command uses the next to last values to calculate the Hessian: numerical_hessian = hessian!(func,parameters) 3\u00d73 Array{Float64,2}: 175.766 -12.0877 5.67464e-14 -12.0877 182.437 5.88344e-15 5.67464e-14 5.88344e-15 96.0542 We can now invert our Hessian matrix to obtain the variance-covariance matrix: var_cov_matrix = inv(numerical_hessian) 3\u00d73 Array{Float64,2}: 0.00571544 0.000378687 -3.39973e-18 0.000378687 0.00550643 -5.60994e-19 -3.39973e-18 -5.60994e-19 0.0104108 In this example, we are only interested in the statistical significance of the coefficient estimates so we obtain those with the following command: \u03b2 = parameters[1:nvar] 2-element Array{Float64,1}: 3.002788633849947 2.964549617572727 We now need to obtain those elements of the variance-covariance matrix needed to obtain our t-statistics, and we can do this with the following commands: temp = diag(var_cov_matrix) temp1 = temp[1:nvar] 2-element Array{Float64,1}: 0.005715441191953887 0.005506430533329349 The t-statistics are formed by dividing element-by-element the coefficients by their standard errors, or the square root of the diagonal elements of the variance-covariance matrix: t_stats = \u03b2./sqrt.(temp1) 2-element Array{Float64,1}: 39.719144251161474 39.95063081460274 From here, one may examine other statistics of interest using the output from the optimization routine.","title":"Maximum Likelihood Estimation: The Normal Linear Model"},{"location":"examples/generated/maxlikenlm/#plain-program","text":"Below follows a version of the program without any comments. The file is also available here: maxlikenlm.jl using Optim, NLSolversBase, Random using LinearAlgebra: diag Random.seed!(0); # Fix random seed generator for reproducibility n = 500 # Number of observations nvar = 2 # Number of variables \u03b2 = ones(nvar) * 3.0 # True coefficients x = [ones(n) randn(n, nvar - 1)] # X matrix of explanatory variables plus constant \u03b5 = randn(n) * 0.5 # Error variance y = x * \u03b2 + \u03b5; # Generate Data function Log_Likelihood(X, Y, \u03b2, log_\u03c3) \u03c3 = exp(log_\u03c3) llike = -n/2*log(2\u03c0) - n/2* log(\u03c3^2) - (sum((Y - X * \u03b2).^2) / (2\u03c3^2)) llike = -llike end func = TwiceDifferentiable(vars - Log_Likelihood(x, y, vars[1:nvar], vars[nvar + 1]), ones(nvar+1); autodiff=:forward); opt = optimize(func, ones(nvar+1)) parameters = Optim.minimizer(opt) parameters[nvar+1] = exp(parameters[nvar+1]) numerical_hessian = hessian!(func,parameters) var_cov_matrix = inv(numerical_hessian) \u03b2 = parameters[1:nvar] temp = diag(var_cov_matrix) temp1 = temp[1:nvar] t_stats = \u03b2./sqrt.(temp1) # This file was generated using Literate.jl, https://github.com/fredrikekre/Literate.jl This page was generated using Literate.jl .","title":"Plain Program"},{"location":"examples/generated/rasch/","text":"Conditional Maximum Likelihood for the Rasch Model Tip This example is also available as a Jupyter notebook: $rasch.ipynb$ The Rasch model is used in psychometrics as a model for assessment data such as student responses to a standardized test. Let $X_{pi}$ be the response accuracy of student $p$ to item $i$ where $X_{pi}=1$ if the item was answered correctly and $X_{pi}=0$ otherwise for $p=1,\\ldots,n$ and $i=1,\\ldots,m$. The model for this accuracy is P(\\mathbf{X}_{p}=\\mathbf{x}_{p}|\\xi_p, \\mathbf\\epsilon) = \\prod_{i=1}^m \\dfrac{(\\xi_p \\epsilon_j)^{x_{pi}}}{1 + \\xi_p\\epsilon_i} where $\\xi_p 0$ the latent ability of person $p$ and $\\epsilon_i 0$ is the difficulty of item $i$. We simulate data from this model: Random.seed!(123) n = 1000 m = 5 theta = randn(n) delta = randn(m) r = zeros(n) s = zeros(m) for i in 1:n p = exp.(theta[i] .- delta) ./ (1.0 .+ exp.(theta[i] .- delta)) for j in 1:m if rand() p[j] ##correct r[i] += 1 s[j] += 1 end end end f = [sum(r.==j) for j in 1:m]; Since the number of parameters increases with sample size standard maximum likelihood will not provide us consistent estimates. Instead we consider the conditional likelihood. It can be shown that the Rasch model is an exponential family model and that the sum score $r_p = \\sum_{i} x_{pi}$ is the sufficient statistic for $\\xi_p$. If we condition on the sum score we should be able to eliminate $\\xi_p$. Indeed, with a bit of algebra we can show P(\\mathbf{X}_p = \\mathbf{x}_p | r_p, \\mathbf\\epsilon) = \\dfrac{\\prod_{i=1}^m \\epsilon_i^{x{ij}}}{\\gamma_{r_i}(\\mathbf\\epsilon)} where $\\gamma_r(\\mathbf\\epsilon)$ is the elementary symmetric function of order $r$ \\gamma_r(\\mathbf\\epsilon) = \\sum_{\\mathbf{y} : \\mathbf{1}^\\intercal \\mathbf{y} = r} \\prod_{j=1}^m \\epsilon_j^{y_j} where the sum is over all possible answer configurations that give a sum score of $r$. Algorithms to efficiently compute $\\gamma$ and its derivatives are available in the literature (see eg Baker (1996) for a review and Biscarri (2018) for a more modern approach) function esf_sum!(S::AbstractArray{T,1}, x::AbstractArray{T,1}) where T : Real n = length(x) fill!(S,zero(T)) S[1] = one(T) @inbounds for col in 1:n for r in 1:col row = col - r + 1 S[row+1] = S[row+1] + x[col] * S[row] end end end function esf_ext!(S::AbstractArray{T,1}, H::AbstractArray{T,3}, x::AbstractArray{T,1}) where T : Real n = length(x) esf_sum!(S, x) H[:,:,1] .= zero(T) H[:,:,2] .= one(T) @inbounds for i in 3:n+1 for j in 1:n H[j,j,i] = S[i-1] - x[j] * H[j,j,i-1] for k in j+1:n H[k,j,i] = S[i-1] - ((x[j]+x[k])*H[k,j,i-1] + x[j]*x[k]*H[k,j,i-2]) H[j,k,i] = H[k,j,i] end end end end esf_ext! (generic function with 1 method) The objective function we want to minimize is the negative log conditional likelihood \\begin{aligned} \\log{L_C(\\mathbf\\epsilon|\\mathbf{r})} &= \\sum_{p=1}^n \\sum_{i=1}^m x_{pi} \\log{\\epsilon_i} - \\log{\\gamma_{r_p}(\\mathbf\\epsilon)}\\\\ &= \\sum_{i=1}^m s_i \\log{\\epsilon_i} - \\sum_{r=1}^m f_r \\log{\\gamma_r(\\mathbf\\epsilon)} \\end{aligned} \u03f5 = ones(Float64, m) \u03b20 = zeros(Float64, m) last_\u03b2 = fill(NaN, m) S = zeros(Float64, m+1) H = zeros(Float64, m, m, m+1) function calculate_common!(x, last_x) if x != last_x copyto!(last_x, x) \u03f5 .= exp.(-x) esf_ext!(S, H, \u03f5) end end function neglogLC(\u03b2) calculate_common!(\u03b2, last_\u03b2) return -s'log.(\u03f5) + f'log.(S[2:end]) end neglogLC (generic function with 1 method) Parameter estimation is usually performed with respect to the unconstrained parameter $\\beta_i = -\\log{\\epsilon_i}$. Taking the derivative with respect to $\\beta_i$ (and applying the chain rule) one obtains \\dfrac{\\partial\\log L_C(\\mathbf\\epsilon|\\mathbf{r})}{\\partial \\beta_i} = -s_i + \\epsilon_i\\sum_{r=1}^m \\dfrac{f_r \\gamma_{r-1}^{(j)}}{\\gamma_r} where $\\gamma_{r-1}^{(i)} = \\partial \\gamma_{r}(\\mathbf\\epsilon)/\\partial\\epsilon_i$. function g!(storage, \u03b2) calculate_common!(\u03b2, last_\u03b2) for j in 1:m storage[j] = s[j] for l in 1:m storage[j] -= \u03f5[j] * f[l] * (H[j,j,l+1] / S[l+1]) end end end g! (generic function with 1 method) Similarly the Hessian matrix can be computed $$ \\dfrac{\\partial^2 \\log L_C(\\mathbf\\epsilon|\\mathbf{r})}{\\partial \\beta_i\\partial\\beta_j} = \\begin{cases} \\displaystyle -\\epsilon_i \\sum_{r=1}^m \\dfrac{f_r\\gamma_{r-1}^{(i)}}{\\gamma_r}\\left(1 - \\dfrac{\\gamma_{r-1}^{(i)}}{\\gamma_r}\\right) & \\text{if $i=j$}\\\\ \\displaystyle -\\epsilon_i\\epsilon_j\\sum_{r=1}^m \\dfrac{f_r \\gamma_{r-2}^{(i,j)}}{\\gamma_r} - \\dfrac{f_r\\gamma_{r-1}^{(i)}\\gamma_{r-1}^{(j)}}{\\gamma_r^2} &\\text{if $i\\neq j$} \\end{cases} $$ where $\\gamma_{r-2}^{(i,j)} = \\partial^2 \\gamma_{r}(\\mathbf\\epsilon)/\\partial\\epsilon_i\\partial\\epsilon_j$. function h!(storage, \u03b2) calculate_common!(\u03b2, last_\u03b2) for j in 1:m for k in 1:m storage[k,j] = 0.0 for l in 1:m if j == k storage[j,j] += f[l] * (\u03f5[j]*H[j,j,l+1] / S[l+1]) * (1 - \u03f5[j]*H[j,j,l+1] / S[l+1]) elseif k j storage[k,j] += \u03f5[j] * \u03f5[k] * f[l] * ((H[k,j,l] / S[l+1]) - (H[j,j,l+1] * H[k,k,l+1]) / S[l+1] ^ 2) else #k j storage[k,j] += \u03f5[j] * \u03f5[k] * f[l] * ((H[j,k,l] / S[l+1]) - (H[j,j,l+1] * H[k,k,l+1]) / S[l+1] ^ 2) end end end end end h! (generic function with 1 method) The estimates of the item parameters are then obtained via standard optimization algorithms (either Newton-Raphson or L-BFGS). One last issue is that the model is not identifiable (multiplying the $\\xi_p$ by a constant and dividing the $\\epsilon_i$ by the same constant results in the same likelihood). Therefore some kind of constraint must be imposed when estimating the parameters. Typically either $\\epsilon_1 = 0$ or $\\prod_{i=1}^m \\epsilon_i = 1$ (which is equivalent to $\\sum_{i=1}^m \\beta_i = 0$). con_c!(c, x) = (c[1] = sum(x); c) function con_jacobian!(J, x) J[1,:] .= ones(length(x)) end function con_h!(h, x, \u03bb) for i in 1:size(h)[1] for j in 1:size(h)[2] h[i,j] += (i == j) ? \u03bb[1] : 0.0 end end end lx = Float64[]; ux = Float64[] lc = [0.0]; uc = [0.0] df = TwiceDifferentiable(neglogLC, g!, h!, \u03b20) dfc = TwiceDifferentiableConstraints(con_c!, con_jacobian!, con_h!, lx, ux, lc, uc) res = optimize(df, dfc, \u03b20, IPNewton()) * Status: success * Candidate solution Minimizer: [1.48e+00, 8.80e-01, -9.81e-01, ...] Minimum: 1.302751e+03 * Found with Algorithm: Interior Point Newton Initial Point: [0.00e+00, 0.00e+00, 0.00e+00, ...] * Convergence measures |x - x'| = 3.39e-08 \u2270 0.0e+00 |x - x'|/|x'| = 2.29e-08 \u2270 0.0e+00 |f(x) - f(x')| = 2.27e-13 \u2270 0.0e+00 |f(x) - f(x')|/|f(x')| = 1.75e-16 \u2270 0.0e+00 |g(x)| = 1.42e-12 \u2264 1.0e-08 * Work counters Seconds run: 0 (vs limit Inf) Iterations: 27 f(x) calls: 62 \u2207f(x) calls: 62 Compare the estimate to the truth delta_hat = res.minimizer [delta delta_hat] 5\u00d72 Array{Float64,2}: 1.14112 1.48015 0.597106 0.880231 -1.30405 -0.981096 -1.2566 -0.955468 -0.706518 -0.423819 This page was generated using Literate.jl .","title":"Conditional maximum likelihood estimation"},{"location":"examples/generated/rasch/#conditional-maximum-likelihood-for-the-rasch-model","text":"Tip This example is also available as a Jupyter notebook: $rasch.ipynb$ The Rasch model is used in psychometrics as a model for assessment data such as student responses to a standardized test. Let $X_{pi}$ be the response accuracy of student $p$ to item $i$ where $X_{pi}=1$ if the item was answered correctly and $X_{pi}=0$ otherwise for $p=1,\\ldots,n$ and $i=1,\\ldots,m$. The model for this accuracy is P(\\mathbf{X}_{p}=\\mathbf{x}_{p}|\\xi_p, \\mathbf\\epsilon) = \\prod_{i=1}^m \\dfrac{(\\xi_p \\epsilon_j)^{x_{pi}}}{1 + \\xi_p\\epsilon_i} where $\\xi_p 0$ the latent ability of person $p$ and $\\epsilon_i 0$ is the difficulty of item $i$. We simulate data from this model: Random.seed!(123) n = 1000 m = 5 theta = randn(n) delta = randn(m) r = zeros(n) s = zeros(m) for i in 1:n p = exp.(theta[i] .- delta) ./ (1.0 .+ exp.(theta[i] .- delta)) for j in 1:m if rand() p[j] ##correct r[i] += 1 s[j] += 1 end end end f = [sum(r.==j) for j in 1:m]; Since the number of parameters increases with sample size standard maximum likelihood will not provide us consistent estimates. Instead we consider the conditional likelihood. It can be shown that the Rasch model is an exponential family model and that the sum score $r_p = \\sum_{i} x_{pi}$ is the sufficient statistic for $\\xi_p$. If we condition on the sum score we should be able to eliminate $\\xi_p$. Indeed, with a bit of algebra we can show P(\\mathbf{X}_p = \\mathbf{x}_p | r_p, \\mathbf\\epsilon) = \\dfrac{\\prod_{i=1}^m \\epsilon_i^{x{ij}}}{\\gamma_{r_i}(\\mathbf\\epsilon)} where $\\gamma_r(\\mathbf\\epsilon)$ is the elementary symmetric function of order $r$ \\gamma_r(\\mathbf\\epsilon) = \\sum_{\\mathbf{y} : \\mathbf{1}^\\intercal \\mathbf{y} = r} \\prod_{j=1}^m \\epsilon_j^{y_j} where the sum is over all possible answer configurations that give a sum score of $r$. Algorithms to efficiently compute $\\gamma$ and its derivatives are available in the literature (see eg Baker (1996) for a review and Biscarri (2018) for a more modern approach) function esf_sum!(S::AbstractArray{T,1}, x::AbstractArray{T,1}) where T : Real n = length(x) fill!(S,zero(T)) S[1] = one(T) @inbounds for col in 1:n for r in 1:col row = col - r + 1 S[row+1] = S[row+1] + x[col] * S[row] end end end function esf_ext!(S::AbstractArray{T,1}, H::AbstractArray{T,3}, x::AbstractArray{T,1}) where T : Real n = length(x) esf_sum!(S, x) H[:,:,1] .= zero(T) H[:,:,2] .= one(T) @inbounds for i in 3:n+1 for j in 1:n H[j,j,i] = S[i-1] - x[j] * H[j,j,i-1] for k in j+1:n H[k,j,i] = S[i-1] - ((x[j]+x[k])*H[k,j,i-1] + x[j]*x[k]*H[k,j,i-2]) H[j,k,i] = H[k,j,i] end end end end esf_ext! (generic function with 1 method) The objective function we want to minimize is the negative log conditional likelihood \\begin{aligned} \\log{L_C(\\mathbf\\epsilon|\\mathbf{r})} &= \\sum_{p=1}^n \\sum_{i=1}^m x_{pi} \\log{\\epsilon_i} - \\log{\\gamma_{r_p}(\\mathbf\\epsilon)}\\\\ &= \\sum_{i=1}^m s_i \\log{\\epsilon_i} - \\sum_{r=1}^m f_r \\log{\\gamma_r(\\mathbf\\epsilon)} \\end{aligned} \u03f5 = ones(Float64, m) \u03b20 = zeros(Float64, m) last_\u03b2 = fill(NaN, m) S = zeros(Float64, m+1) H = zeros(Float64, m, m, m+1) function calculate_common!(x, last_x) if x != last_x copyto!(last_x, x) \u03f5 .= exp.(-x) esf_ext!(S, H, \u03f5) end end function neglogLC(\u03b2) calculate_common!(\u03b2, last_\u03b2) return -s'log.(\u03f5) + f'log.(S[2:end]) end neglogLC (generic function with 1 method) Parameter estimation is usually performed with respect to the unconstrained parameter $\\beta_i = -\\log{\\epsilon_i}$. Taking the derivative with respect to $\\beta_i$ (and applying the chain rule) one obtains \\dfrac{\\partial\\log L_C(\\mathbf\\epsilon|\\mathbf{r})}{\\partial \\beta_i} = -s_i + \\epsilon_i\\sum_{r=1}^m \\dfrac{f_r \\gamma_{r-1}^{(j)}}{\\gamma_r} where $\\gamma_{r-1}^{(i)} = \\partial \\gamma_{r}(\\mathbf\\epsilon)/\\partial\\epsilon_i$. function g!(storage, \u03b2) calculate_common!(\u03b2, last_\u03b2) for j in 1:m storage[j] = s[j] for l in 1:m storage[j] -= \u03f5[j] * f[l] * (H[j,j,l+1] / S[l+1]) end end end g! (generic function with 1 method) Similarly the Hessian matrix can be computed $$ \\dfrac{\\partial^2 \\log L_C(\\mathbf\\epsilon|\\mathbf{r})}{\\partial \\beta_i\\partial\\beta_j} = \\begin{cases} \\displaystyle -\\epsilon_i \\sum_{r=1}^m \\dfrac{f_r\\gamma_{r-1}^{(i)}}{\\gamma_r}\\left(1 - \\dfrac{\\gamma_{r-1}^{(i)}}{\\gamma_r}\\right) & \\text{if $i=j$}\\\\ \\displaystyle -\\epsilon_i\\epsilon_j\\sum_{r=1}^m \\dfrac{f_r \\gamma_{r-2}^{(i,j)}}{\\gamma_r} - \\dfrac{f_r\\gamma_{r-1}^{(i)}\\gamma_{r-1}^{(j)}}{\\gamma_r^2} &\\text{if $i\\neq j$} \\end{cases} $$ where $\\gamma_{r-2}^{(i,j)} = \\partial^2 \\gamma_{r}(\\mathbf\\epsilon)/\\partial\\epsilon_i\\partial\\epsilon_j$. function h!(storage, \u03b2) calculate_common!(\u03b2, last_\u03b2) for j in 1:m for k in 1:m storage[k,j] = 0.0 for l in 1:m if j == k storage[j,j] += f[l] * (\u03f5[j]*H[j,j,l+1] / S[l+1]) * (1 - \u03f5[j]*H[j,j,l+1] / S[l+1]) elseif k j storage[k,j] += \u03f5[j] * \u03f5[k] * f[l] * ((H[k,j,l] / S[l+1]) - (H[j,j,l+1] * H[k,k,l+1]) / S[l+1] ^ 2) else #k j storage[k,j] += \u03f5[j] * \u03f5[k] * f[l] * ((H[j,k,l] / S[l+1]) - (H[j,j,l+1] * H[k,k,l+1]) / S[l+1] ^ 2) end end end end end h! (generic function with 1 method) The estimates of the item parameters are then obtained via standard optimization algorithms (either Newton-Raphson or L-BFGS). One last issue is that the model is not identifiable (multiplying the $\\xi_p$ by a constant and dividing the $\\epsilon_i$ by the same constant results in the same likelihood). Therefore some kind of constraint must be imposed when estimating the parameters. Typically either $\\epsilon_1 = 0$ or $\\prod_{i=1}^m \\epsilon_i = 1$ (which is equivalent to $\\sum_{i=1}^m \\beta_i = 0$). con_c!(c, x) = (c[1] = sum(x); c) function con_jacobian!(J, x) J[1,:] .= ones(length(x)) end function con_h!(h, x, \u03bb) for i in 1:size(h)[1] for j in 1:size(h)[2] h[i,j] += (i == j) ? \u03bb[1] : 0.0 end end end lx = Float64[]; ux = Float64[] lc = [0.0]; uc = [0.0] df = TwiceDifferentiable(neglogLC, g!, h!, \u03b20) dfc = TwiceDifferentiableConstraints(con_c!, con_jacobian!, con_h!, lx, ux, lc, uc) res = optimize(df, dfc, \u03b20, IPNewton()) * Status: success * Candidate solution Minimizer: [1.48e+00, 8.80e-01, -9.81e-01, ...] Minimum: 1.302751e+03 * Found with Algorithm: Interior Point Newton Initial Point: [0.00e+00, 0.00e+00, 0.00e+00, ...] * Convergence measures |x - x'| = 3.39e-08 \u2270 0.0e+00 |x - x'|/|x'| = 2.29e-08 \u2270 0.0e+00 |f(x) - f(x')| = 2.27e-13 \u2270 0.0e+00 |f(x) - f(x')|/|f(x')| = 1.75e-16 \u2270 0.0e+00 |g(x)| = 1.42e-12 \u2264 1.0e-08 * Work counters Seconds run: 0 (vs limit Inf) Iterations: 27 f(x) calls: 62 \u2207f(x) calls: 62 Compare the estimate to the truth delta_hat = res.minimizer [delta delta_hat] 5\u00d72 Array{Float64,2}: 1.14112 1.48015 0.597106 0.880231 -1.30405 -0.981096 -1.2566 -0.955468 -0.706518 -0.423819 This page was generated using Literate.jl .","title":"Conditional Maximum Likelihood for the Rasch Model"},{"location":"user/algochoice/","text":"Algorithm choice There are two main settings you must choose in Optim: the algorithm and the linesearch. Algorithms The first choice to be made is that of the order of the method. Zeroth-order methods do not have gradient information, and are very slow to converge, especially in high dimension. First-order methods do not have access to curvature information and can take a large number of iterations to converge for badly conditioned problems. Second-order methods can converge very quickly once in the vicinity of a minimizer. Of course, this enhanced performance comes at a cost: the objective function has to be differentiable, you have to supply gradients and Hessians, and, for second order methods, a linear system has to be solved at each step. If you can provide analytic gradients and Hessians, and the dimension of the problem is not too large, then second order methods are very efficient. The Newton method with trust region is the method of choice. When you do not have an explicit Hessian or when the dimension becomes large enough that the linear solve in the Newton method becomes the bottleneck, first order methods should be preferred. BFGS is a very efficient method, but also requires a linear system solve. LBFGS usually has a performance very close to that of BFGS, and avoids linear system solves (the parameter m can be tweaked: increasing it can improve the convergence, at the expense of memory and time spent in linear algebra operations). The conjugate gradient method usually converges less quickly than LBFGS, but requires less memory. Gradient descent should only be used for testing. Acceleration methods are experimental. When the objective function is non-differentiable or you do not want to use gradients, use zeroth-order methods. Nelder-Mead is currently the most robust. Linesearches Linesearches are used in every first- and second-order method except for the trust-region Newton method. Linesearch routines attempt to locate quickly an approximate minimizer of the univariate function $\\alpha \\to f(x+ \\alpha d)$, where $d$ is the descent direction computed by the algorithm. They vary in how accurate this minimization is. Two good linesearches are BackTracking and HagerZhang, the former being less stringent than the latter. For well-conditioned objective functions and methods where the step is usually well-scaled (such as LBFGS or Newton), a rough linesearch such as BackTracking is usually the most performant. For badly behaved problems or when extreme accuracy is needed (gradients below the square root of the machine epsilon, about $10^{-8}$ with Float64 ), the HagerZhang method proves more robust. An exception is the conjugate gradient method which requires an accurate linesearch to be efficient, and should be used with the HagerZhang linesearch. Summary As a very crude heuristic: For a low-dimensional problem with analytic gradients and Hessians, use the Newton method with trust region. For larger problems or when there is no analytic Hessian, use LBFGS, and tweak the parameter m if needed. If the function is non-differentiable, use Nelder-Mead. Use the HagerZhang linesearch for robustness and BackTracking for speed.","title":"Algorithm choice"},{"location":"user/algochoice/#algorithm-choice","text":"There are two main settings you must choose in Optim: the algorithm and the linesearch.","title":"Algorithm choice"},{"location":"user/algochoice/#algorithms","text":"The first choice to be made is that of the order of the method. Zeroth-order methods do not have gradient information, and are very slow to converge, especially in high dimension. First-order methods do not have access to curvature information and can take a large number of iterations to converge for badly conditioned problems. Second-order methods can converge very quickly once in the vicinity of a minimizer. Of course, this enhanced performance comes at a cost: the objective function has to be differentiable, you have to supply gradients and Hessians, and, for second order methods, a linear system has to be solved at each step. If you can provide analytic gradients and Hessians, and the dimension of the problem is not too large, then second order methods are very efficient. The Newton method with trust region is the method of choice. When you do not have an explicit Hessian or when the dimension becomes large enough that the linear solve in the Newton method becomes the bottleneck, first order methods should be preferred. BFGS is a very efficient method, but also requires a linear system solve. LBFGS usually has a performance very close to that of BFGS, and avoids linear system solves (the parameter m can be tweaked: increasing it can improve the convergence, at the expense of memory and time spent in linear algebra operations). The conjugate gradient method usually converges less quickly than LBFGS, but requires less memory. Gradient descent should only be used for testing. Acceleration methods are experimental. When the objective function is non-differentiable or you do not want to use gradients, use zeroth-order methods. Nelder-Mead is currently the most robust.","title":"Algorithms"},{"location":"user/algochoice/#linesearches","text":"Linesearches are used in every first- and second-order method except for the trust-region Newton method. Linesearch routines attempt to locate quickly an approximate minimizer of the univariate function $\\alpha \\to f(x+ \\alpha d)$, where $d$ is the descent direction computed by the algorithm. They vary in how accurate this minimization is. Two good linesearches are BackTracking and HagerZhang, the former being less stringent than the latter. For well-conditioned objective functions and methods where the step is usually well-scaled (such as LBFGS or Newton), a rough linesearch such as BackTracking is usually the most performant. For badly behaved problems or when extreme accuracy is needed (gradients below the square root of the machine epsilon, about $10^{-8}$ with Float64 ), the HagerZhang method proves more robust. An exception is the conjugate gradient method which requires an accurate linesearch to be efficient, and should be used with the HagerZhang linesearch.","title":"Linesearches"},{"location":"user/algochoice/#summary","text":"As a very crude heuristic: For a low-dimensional problem with analytic gradients and Hessians, use the Newton method with trust region. For larger problems or when there is no analytic Hessian, use LBFGS, and tweak the parameter m if needed. If the function is non-differentiable, use Nelder-Mead. Use the HagerZhang linesearch for robustness and BackTracking for speed.","title":"Summary"},{"location":"user/config/","text":"Configurable options There are several options that simply take on some default values if the user doensn't supply anything else than a function (and gradient) and a starting point. Solver options There quite a few different solvers available in Optim, and they are all listed below. Notice that the constructors are written without input here, but they generally take keywords to tweak the way they work. See the pages describing each solver for more detail. Requires only a function handle: NelderMead() SimulatedAnnealing() Requires a function and gradient (will be approximated if omitted): BFGS() LBFGS() ConjugateGradient() GradientDescent() MomentumGradientDescent() AcceleratedGradientDescent() Requires a function, a gradient, and a Hessian (cannot be omitted): Newton() NewtonTrustRegion() Box constrained minimization: Fminbox() Special methods for bounded univariate optimization: Brent() GoldenSection() General Options In addition to the solver, you can alter the behavior of the Optim package by using the following keywords: x_tol : Absolute tolerance in changes of the input vector x , in infinity norm. Defaults to 0.0 . f_tol : Relative tolerance in changes of the objective value. Defaults to 0.0 . g_tol : Absolute tolerance in the gradient, in infinity norm. Defaults to 1e-8 . For gradient free methods, this will control the main convergence tolerance, which is solver specific. f_calls_limit : A soft upper limit on the number of objective calls. Defaults to 0 (unlimited). g_calls_limit : A soft upper limit on the number of gradient calls. Defaults to 0 (unlimited). h_calls_limit : A soft upper limit on the number of Hessian calls. Defaults to 0 (unlimited). allow_f_increases : Allow steps that increase the objective value. Defaults to false . Note that, when setting this to true , the last iterate will be returned as the minimizer even if the objective increased. iterations : How many iterations will run before the algorithm gives up? Defaults to 1_000 . store_trace : Should a trace of the optimization algorithm's state be stored? Defaults to false . show_trace : Should a trace of the optimization algorithm's state be shown on stdout ? Defaults to false . extended_trace : Save additional information. Solver dependent. Defaults to false . trace_simplex : Include the full simplex in the trace for NelderMead . Defaults to false . show_every : Trace output is printed every show_every th iteration. callback : A function to be called during tracing. A return value of true stops the optimize call. time_limit : A soft upper limit on the total run time. Defaults to NaN (unlimited). We currently recommend the statically dispatched interface by using the Optim.Options constructor: res = optimize(f, g!, [0.0, 0.0], GradientDescent(), Optim.Options(g_tol = 1e-12, iterations = 10, store_trace = true, show_trace = false)) Another interface is also available, based directly on keywords: res = optimize(f, g!, [0.0, 0.0], method = GradientDescent(), g_tol = 1e-12, iterations = 10, store_trace = true, show_trace = false) Notice the need to specify the method using a keyword if this syntax is used. This approach might be deprecated in the future, and as a result we recommend writing code that has to maintained using the Optim.Options approach.","title":"Configurable Options"},{"location":"user/config/#configurable-options","text":"There are several options that simply take on some default values if the user doensn't supply anything else than a function (and gradient) and a starting point.","title":"Configurable options"},{"location":"user/config/#solver-options","text":"There quite a few different solvers available in Optim, and they are all listed below. Notice that the constructors are written without input here, but they generally take keywords to tweak the way they work. See the pages describing each solver for more detail. Requires only a function handle: NelderMead() SimulatedAnnealing() Requires a function and gradient (will be approximated if omitted): BFGS() LBFGS() ConjugateGradient() GradientDescent() MomentumGradientDescent() AcceleratedGradientDescent() Requires a function, a gradient, and a Hessian (cannot be omitted): Newton() NewtonTrustRegion() Box constrained minimization: Fminbox() Special methods for bounded univariate optimization: Brent() GoldenSection()","title":"Solver options"},{"location":"user/config/#general-options","text":"In addition to the solver, you can alter the behavior of the Optim package by using the following keywords: x_tol : Absolute tolerance in changes of the input vector x , in infinity norm. Defaults to 0.0 . f_tol : Relative tolerance in changes of the objective value. Defaults to 0.0 . g_tol : Absolute tolerance in the gradient, in infinity norm. Defaults to 1e-8 . For gradient free methods, this will control the main convergence tolerance, which is solver specific. f_calls_limit : A soft upper limit on the number of objective calls. Defaults to 0 (unlimited). g_calls_limit : A soft upper limit on the number of gradient calls. Defaults to 0 (unlimited). h_calls_limit : A soft upper limit on the number of Hessian calls. Defaults to 0 (unlimited). allow_f_increases : Allow steps that increase the objective value. Defaults to false . Note that, when setting this to true , the last iterate will be returned as the minimizer even if the objective increased. iterations : How many iterations will run before the algorithm gives up? Defaults to 1_000 . store_trace : Should a trace of the optimization algorithm's state be stored? Defaults to false . show_trace : Should a trace of the optimization algorithm's state be shown on stdout ? Defaults to false . extended_trace : Save additional information. Solver dependent. Defaults to false . trace_simplex : Include the full simplex in the trace for NelderMead . Defaults to false . show_every : Trace output is printed every show_every th iteration. callback : A function to be called during tracing. A return value of true stops the optimize call. time_limit : A soft upper limit on the total run time. Defaults to NaN (unlimited). We currently recommend the statically dispatched interface by using the Optim.Options constructor: res = optimize(f, g!, [0.0, 0.0], GradientDescent(), Optim.Options(g_tol = 1e-12, iterations = 10, store_trace = true, show_trace = false)) Another interface is also available, based directly on keywords: res = optimize(f, g!, [0.0, 0.0], method = GradientDescent(), g_tol = 1e-12, iterations = 10, store_trace = true, show_trace = false) Notice the need to specify the method using a keyword if this syntax is used. This approach might be deprecated in the future, and as a result we recommend writing code that has to maintained using the Optim.Options approach.","title":"General Options"},{"location":"user/gradientsandhessians/","text":"Gradients and Hessians To use first- and second-order methods, you need to provide gradients and Hessians, either in-place or out-of-place. There are three main ways of specifying derivatives: analytic, finite-difference and automatic differentiation. Analytic This results in the fastest run times, but requires the user to perform the often tedious task of computing the derivatives by hand. The gradient of complicated objective functions (e.g. involving the solution of algebraic equations, differential equations, eigendecompositions, etc.) can be computed efficiently using the adjoint method (see e.g. these lecture notes ). In particular, assuming infinite memory, the gradient of a $\\mathbb{R}^N \\to \\mathbb{R}$ function $f$ can always be computed with a runtime comparable with only one evaluation of $f$, no matter how large $N$. To use analytic derivatives, simply pass g! and h! functions to optimize . Finite differences This uses the functionality in DiffEqDiffTools.jl to compute gradients and Hessians through central finite differences: $f'(x) \\approx \\frac{f(x+h)-f(x-h)}{2h}$. For a $\\mathbb{R}^N \\to \\mathbb{R}$ objective function $f$, this requires $2N$ evaluations of $f$. It is therefore efficient in low dimensions but slow when $N$ is large. It is also inaccurate: $h$ is chosen equal to $\\epsilon^{1/3}$ where $\\epsilon$ is the machine epsilon (about $10^{-16}$ for Float64 ) to balance the truncation and rounding errors, resulting in an error of $\\epsilon^{2/3}$ (about $10^{-11}$ for Float64 ) for the derivative. Finite differences are on by default if gradients and Hessians are not supplied to the optimize call. Automatic differentiation Automatic differentiation techniques are a middle ground between finite differences and analytic computations. They are exact up to machine precision, and do not require intervention from the user. They come in two main flavors: forward and reverse mode . Forward-mode automatic differentiation is relatively straightforward to implement by propagating the sensitivities of the input variables, and is often faster than finite differences. The disadvantage is that the objective function has to be written using only Julia code. Forward-mode automatic differentiation still requires a runtime comparable to $N$ evaluations of $f$, and is therefore costly in large dimensions, like finite differences. Reverse-mode automatic differentiation can be seen as an automatic implementation of the adjoint method mentioned above, and requires a runtime comparable to only one evaluation of $f$. It is however considerably more complex to implement, requiring to record the execution of the program to then run it backwards, and incurs a larger overhead. Forward-mode automatic differentiation is supported through the ForwardDiff.jl package by providing the autodiff=:forward keyword to optimize . Reverse-mode automatic differentiation is not supported explicitly yet (although you can use it by writing your own g! function). There are a number of implementations in Julia, such as ReverseDiff.jl . Example Let us consider the Rosenbrock example again. function f(x) return (1.0 - x[1])^2 + 100.0 * (x[2] - x[1]^2)^2 end function g!(G, x) G[1] = -2.0 * (1.0 - x[1]) - 400.0 * (x[2] - x[1]^2) * x[1] G[2] = 200.0 * (x[2] - x[1]^2) end function h!(H, x) H[1, 1] = 2.0 - 400.0 * x[2] + 1200.0 * x[1]^2 H[1, 2] = -400.0 * x[1] H[2, 1] = -400.0 * x[1] H[2, 2] = 200.0 end initial_x = zeros(2) Let us see if BFGS and Newton's Method can solve this problem with the functions provided. julia Optim.minimizer(optimize(f, g!, h!, initial_x, BFGS())) 2-element Array{Float64,1}: 1.0 1.0 julia Optim.minimizer(optimize(f, g!, h!, initial_x, Newton())) 2-element Array{Float64,1}: 1.0 1.0 This is indeed the case. Now let us use finite differences for BFGS. julia Optim.minimizer(optimize(f, initial_x, BFGS())) 2-element Array{Float64,1}: 1.0 1.0 Still looks good. Returning to automatic differentiation, let us try both solvers using this method. We enable forward mode automatic differentiation by using the autodiff = :forward keyword. julia Optim.minimizer(optimize(f, initial_x, BFGS(); autodiff = :forward)) 2-element Array{Float64,1}: 1.0 1.0 julia Optim.minimizer(optimize(f, initial_x, Newton(); autodiff = :forward)) 2-element Array{Float64,1}: 1.0 1.0 Indeed, the minimizer was found, without providing any gradients or Hessians.","title":"Gradients and Hessians"},{"location":"user/gradientsandhessians/#gradients-and-hessians","text":"To use first- and second-order methods, you need to provide gradients and Hessians, either in-place or out-of-place. There are three main ways of specifying derivatives: analytic, finite-difference and automatic differentiation.","title":"Gradients and Hessians"},{"location":"user/gradientsandhessians/#analytic","text":"This results in the fastest run times, but requires the user to perform the often tedious task of computing the derivatives by hand. The gradient of complicated objective functions (e.g. involving the solution of algebraic equations, differential equations, eigendecompositions, etc.) can be computed efficiently using the adjoint method (see e.g. these lecture notes ). In particular, assuming infinite memory, the gradient of a $\\mathbb{R}^N \\to \\mathbb{R}$ function $f$ can always be computed with a runtime comparable with only one evaluation of $f$, no matter how large $N$. To use analytic derivatives, simply pass g! and h! functions to optimize .","title":"Analytic"},{"location":"user/gradientsandhessians/#finite-differences","text":"This uses the functionality in DiffEqDiffTools.jl to compute gradients and Hessians through central finite differences: $f'(x) \\approx \\frac{f(x+h)-f(x-h)}{2h}$. For a $\\mathbb{R}^N \\to \\mathbb{R}$ objective function $f$, this requires $2N$ evaluations of $f$. It is therefore efficient in low dimensions but slow when $N$ is large. It is also inaccurate: $h$ is chosen equal to $\\epsilon^{1/3}$ where $\\epsilon$ is the machine epsilon (about $10^{-16}$ for Float64 ) to balance the truncation and rounding errors, resulting in an error of $\\epsilon^{2/3}$ (about $10^{-11}$ for Float64 ) for the derivative. Finite differences are on by default if gradients and Hessians are not supplied to the optimize call.","title":"Finite differences"},{"location":"user/gradientsandhessians/#automatic-differentiation","text":"Automatic differentiation techniques are a middle ground between finite differences and analytic computations. They are exact up to machine precision, and do not require intervention from the user. They come in two main flavors: forward and reverse mode . Forward-mode automatic differentiation is relatively straightforward to implement by propagating the sensitivities of the input variables, and is often faster than finite differences. The disadvantage is that the objective function has to be written using only Julia code. Forward-mode automatic differentiation still requires a runtime comparable to $N$ evaluations of $f$, and is therefore costly in large dimensions, like finite differences. Reverse-mode automatic differentiation can be seen as an automatic implementation of the adjoint method mentioned above, and requires a runtime comparable to only one evaluation of $f$. It is however considerably more complex to implement, requiring to record the execution of the program to then run it backwards, and incurs a larger overhead. Forward-mode automatic differentiation is supported through the ForwardDiff.jl package by providing the autodiff=:forward keyword to optimize . Reverse-mode automatic differentiation is not supported explicitly yet (although you can use it by writing your own g! function). There are a number of implementations in Julia, such as ReverseDiff.jl .","title":"Automatic differentiation"},{"location":"user/gradientsandhessians/#example","text":"Let us consider the Rosenbrock example again. function f(x) return (1.0 - x[1])^2 + 100.0 * (x[2] - x[1]^2)^2 end function g!(G, x) G[1] = -2.0 * (1.0 - x[1]) - 400.0 * (x[2] - x[1]^2) * x[1] G[2] = 200.0 * (x[2] - x[1]^2) end function h!(H, x) H[1, 1] = 2.0 - 400.0 * x[2] + 1200.0 * x[1]^2 H[1, 2] = -400.0 * x[1] H[2, 1] = -400.0 * x[1] H[2, 2] = 200.0 end initial_x = zeros(2) Let us see if BFGS and Newton's Method can solve this problem with the functions provided. julia Optim.minimizer(optimize(f, g!, h!, initial_x, BFGS())) 2-element Array{Float64,1}: 1.0 1.0 julia Optim.minimizer(optimize(f, g!, h!, initial_x, Newton())) 2-element Array{Float64,1}: 1.0 1.0 This is indeed the case. Now let us use finite differences for BFGS. julia Optim.minimizer(optimize(f, initial_x, BFGS())) 2-element Array{Float64,1}: 1.0 1.0 Still looks good. Returning to automatic differentiation, let us try both solvers using this method. We enable forward mode automatic differentiation by using the autodiff = :forward keyword. julia Optim.minimizer(optimize(f, initial_x, BFGS(); autodiff = :forward)) 2-element Array{Float64,1}: 1.0 1.0 julia Optim.minimizer(optimize(f, initial_x, Newton(); autodiff = :forward)) 2-element Array{Float64,1}: 1.0 1.0 Indeed, the minimizer was found, without providing any gradients or Hessians.","title":"Example"},{"location":"user/minimization/","text":"Unconstrained Optimization To show how the Optim package can be used, we minimize the Rosenbrock function , a classical test problem for numerical optimization. We'll assume that you've already installed the Optim package using Julia's package manager. First, we load Optim and define the Rosenbrock function: using Optim f(x) = (1.0 - x[1])^2 + 100.0 * (x[2] - x[1]^2)^2 Once we've defined this function, we can find the minimizer (the input that minimizes the objective) and the minimum (the value of the objective at the minimizer) using any of our favorite optimization algorithms. With a function defined, we just specify an initial point x and call optimize with a starting point x0 : x0 = [0.0, 0.0] optimize(f, x0) Note : it is important to pass initial_x as an array. If your problem is one-dimensional, you have to wrap it in an array. An easy way to do so is to write optimize(x- f(first(x)), [initial_x]) which make sure the input is an array, but the anonymous function automatically passes the first (and only) element onto your given f . Optim will default to using the Nelder-Mead method in the multivariate case, as we did not provide a gradient. This can also be explicitly specified using: optimize(f, x0, NelderMead()) Other solvers are available. Below, we use L-BFGS, a quasi-Newton method that requires a gradient. If we pass f alone, Optim will construct an approximate gradient for us using central finite differencing: optimize(f, x0, LBFGS()) For better performance and greater precision, you can pass your own gradient function. If your objective is written in all Julia code with no special calls to external (that is non-Julia) libraries, you can also use automatic differentiation, by using the autodiff keyword and setting it to :forward : optimize(f, x0, LBFGS(); autodiff = :forward) For the Rosenbrock example, the analytical gradient can be shown to be: function g!(G, x) G[1] = -2.0 * (1.0 - x[1]) - 400.0 * (x[2] - x[1]^2) * x[1] G[2] = 200.0 * (x[2] - x[1]^2) end Note, that the functions we're using to calculate the gradient (and later the Hessian h! ) of the Rosenbrock function mutate a fixed-sized storage array, which is passed as an additional argument called G (or H for the Hessian) in these examples. By mutating a single array over many iterations, this style of function definition removes the sometimes considerable costs associated with allocating a new array during each call to the g! or h! functions. If you prefer to have your gradients simply accept an x , you can still use optimize by setting the inplace keyword to false : optimize(f, g, x0; inplace = false) where g is a function of x only. Returning to our in-place version, you simply pass g! together with f from before to use the gradient: optimize(f, g!, x0, LBFGS()) For some methods, like simulated annealing, the gradient will be ignored: optimize(f, g!, x0, SimulatedAnnealing()) In addition to providing gradients, you can provide a Hessian function h! as well. In our current case this is: function h!(H, x) H[1, 1] = 2.0 - 400.0 * x[2] + 1200.0 * x[1]^2 H[1, 2] = -400.0 * x[1] H[2, 1] = -400.0 * x[1] H[2, 2] = 200.0 end Now we can use Newton's method for optimization by running: optimize(f, g!, h!, x0) Which defaults to Newton() since a Hessian function was provided. Like gradients, the Hessian function will be ignored if you use a method that does not require it: optimize(f, g!, h!, x0, LBFGS()) Note that Optim will not generate approximate Hessians using finite differencing because of the potentially low accuracy of approximations to the Hessians. Other than Newton's method, none of the algorithms provided by the Optim package employ exact Hessians. Box Constrained Optimization A primal interior-point algorithm for simple \"box\" constraints (lower and upper bounds) is available. Reusing our Rosenbrock example from above, boxed minimization is performed as follows: lower = [1.25, -2.1] upper = [Inf, Inf] initial_x = [2.0, 2.0] inner_optimizer = GradientDescent() results = optimize(f, g!, lower, upper, initial_x, Fminbox(inner_optimizer)) This performs optimization with a barrier penalty, successively scaling down the barrier coefficient and using the chosen inner_optimizer ( GradientDescent() above) for convergence at each step. To change algorithm specific options, such as the line search algorithm, specify it directly in the inner_optimizer constructor: lower = [1.25, -2.1] upper = [Inf, Inf] initial_x = [2.0, 2.0] # requires using LineSearches inner_optimizer = GradientDescent(linesearch=LineSearches.BackTracking(order=3)) results = optimize(f, g!, lower, upper, initial_x, Fminbox(inner_optimizer)) This algorithm uses diagonal preconditioning to improve the accuracy, and hence is a good example of how to use ConjugateGradient or LBFGS with preconditioning. Other methods will currently not use preconditioning. Only the box constraints are used. If you can analytically compute the diagonal of the Hessian of your objective function, you may want to consider writing your own preconditioner. There are two iterations parameters: an outer iterations parameter used to control Fminbox and an inner iterations parameter used to control the inner optimizer. For example, the following restricts the optimization to 2 major iterations results = optimize(f, g!, lower, upper, initial_x, Fminbox(GradientDescent()), Optim.Options(outer_iterations = 2)) In contrast, the following sets the maximum number of iterations for each ConjugateGradient() optimization to 2 results = optimize(f, g!, lower, upper, initial_x, Fminbox(GradientDescent()), Optim.Options(iterations = 2)) Minimizing a univariate function on a bounded interval Minimization of univariate functions without derivatives is available through the optimize interface: optimize(f, lower, upper, method; kwargs...) Notice the lack of initial x . A specific example is the following quadratic function. julia f_univariate(x) = 2x^2+3x+1 f_univariate (generic function with 1 method) julia optimize(f_univariate, -2.0, 1.0) Results of Optimization Algorithm * Algorithm: Brent's Method * Search Interval: [-2.000000, 1.000000] * Minimizer: -7.500000e-01 * Minimum: -1.250000e-01 * Iterations: 7 * Convergence: max(|x - x_upper|, |x - x_lower|) = 2*(1.5e-08*|x|+2.2e-16): true * Objective Function Calls: 8 The output shows that we provided an initial lower and upper bound, that there is a final minimizer and minimum, and that it used seven major iterations. Importantly, we also see that convergence was declared. The default method is Brent's method, which is one out of two available methods: Brent's method, the default (can be explicitly selected with Brent() ). Golden section search, available with GoldenSection() . If we want to manually specify this method, we use the usual syntax as for multivariate optimization. optimize(f, lower, upper, Brent(); kwargs...) optimize(f, lower, upper, GoldenSection(); kwargs...) Keywords are used to set options for this special type of optimization. In addition to the iterations , store_trace , show_trace and extended_trace options, the following options are also available: rel_tol : The relative tolerance used for determining convergence. Defaults to sqrt(eps(T)) . abs_tol : The absolute tolerance used for determining convergence. Defaults to eps(T) . Obtaining results After we have our results in res , we can use the API for getting optimization results. This consists of a collection of functions. They are not exported, so they have to be prefixed by Optim. . Say we do the following optimization: res = optimize(x- dot(x,[1 0. 0; 0 3 0; 0 0 1]*x), zeros(3)) If we can't remember what method we used, we simply use summary(res) which will return \"Nelder Mead\" . A bit more useful information is the minimizer and minimum of the objective functions, which can be found using julia Optim.minimizer(res) 3-element Array{Float64,1}: -0.499921 -0.3333 -1.49994 julia Optim.minimum(res) -2.8333333205768865 Complete list of functions A complete list of functions can be found below. Defined for all methods: summary(res) minimizer(res) minimum(res) iterations(res) iteration_limit_reached(res) trace(res) x_trace(res) f_trace(res) f_calls(res) converged(res) Defined for univariate optimization: lower_bound(res) upper_bound(res) x_lower_trace(res) x_upper_trace(res) rel_tol(res) abs_tol(res) Defined for multivariate optimization: g_norm_trace(res) g_calls(res) x_converged(res) f_converged(res) g_converged(res) initial_state(res) Input types Most users will input Vector 's as their initial_x 's, and get an Optim.minimizer(res) out that is also a vector. For zeroth and first order methods, it is also possible to pass in matrices, or even higher dimensional arrays. The only restriction imposed by leaving the Vector case is, that it is no longer possible to use finite difference approximations or automatic differentiation. Second order methods (variants of Newton's method) do not support this more general input type. Notes on convergence flags and checks Currently, it is possible to access a minimizer using Optim.minimizer(result) even if all convergence flags are false . This means that the user has to be a bit careful when using the output from the solvers. It is advised to include checks for convergence if the minimizer or minimum is used to carry out further calculations. A related note is that first and second order methods makes a convergence check on the gradient before entering the optimization loop. This is done to prevent line search errors if initial_x is a stationary point. Notice, that this is only a first order check. If initial_x is any type of stationary point, g_converged will be true. This includes local minima, saddle points, and local maxima. If iterations is 0 and g_converged is true , the user needs to keep this point in mind.","title":"Minimizing a function"},{"location":"user/minimization/#unconstrained-optimization","text":"To show how the Optim package can be used, we minimize the Rosenbrock function , a classical test problem for numerical optimization. We'll assume that you've already installed the Optim package using Julia's package manager. First, we load Optim and define the Rosenbrock function: using Optim f(x) = (1.0 - x[1])^2 + 100.0 * (x[2] - x[1]^2)^2 Once we've defined this function, we can find the minimizer (the input that minimizes the objective) and the minimum (the value of the objective at the minimizer) using any of our favorite optimization algorithms. With a function defined, we just specify an initial point x and call optimize with a starting point x0 : x0 = [0.0, 0.0] optimize(f, x0) Note : it is important to pass initial_x as an array. If your problem is one-dimensional, you have to wrap it in an array. An easy way to do so is to write optimize(x- f(first(x)), [initial_x]) which make sure the input is an array, but the anonymous function automatically passes the first (and only) element onto your given f . Optim will default to using the Nelder-Mead method in the multivariate case, as we did not provide a gradient. This can also be explicitly specified using: optimize(f, x0, NelderMead()) Other solvers are available. Below, we use L-BFGS, a quasi-Newton method that requires a gradient. If we pass f alone, Optim will construct an approximate gradient for us using central finite differencing: optimize(f, x0, LBFGS()) For better performance and greater precision, you can pass your own gradient function. If your objective is written in all Julia code with no special calls to external (that is non-Julia) libraries, you can also use automatic differentiation, by using the autodiff keyword and setting it to :forward : optimize(f, x0, LBFGS(); autodiff = :forward) For the Rosenbrock example, the analytical gradient can be shown to be: function g!(G, x) G[1] = -2.0 * (1.0 - x[1]) - 400.0 * (x[2] - x[1]^2) * x[1] G[2] = 200.0 * (x[2] - x[1]^2) end Note, that the functions we're using to calculate the gradient (and later the Hessian h! ) of the Rosenbrock function mutate a fixed-sized storage array, which is passed as an additional argument called G (or H for the Hessian) in these examples. By mutating a single array over many iterations, this style of function definition removes the sometimes considerable costs associated with allocating a new array during each call to the g! or h! functions. If you prefer to have your gradients simply accept an x , you can still use optimize by setting the inplace keyword to false : optimize(f, g, x0; inplace = false) where g is a function of x only. Returning to our in-place version, you simply pass g! together with f from before to use the gradient: optimize(f, g!, x0, LBFGS()) For some methods, like simulated annealing, the gradient will be ignored: optimize(f, g!, x0, SimulatedAnnealing()) In addition to providing gradients, you can provide a Hessian function h! as well. In our current case this is: function h!(H, x) H[1, 1] = 2.0 - 400.0 * x[2] + 1200.0 * x[1]^2 H[1, 2] = -400.0 * x[1] H[2, 1] = -400.0 * x[1] H[2, 2] = 200.0 end Now we can use Newton's method for optimization by running: optimize(f, g!, h!, x0) Which defaults to Newton() since a Hessian function was provided. Like gradients, the Hessian function will be ignored if you use a method that does not require it: optimize(f, g!, h!, x0, LBFGS()) Note that Optim will not generate approximate Hessians using finite differencing because of the potentially low accuracy of approximations to the Hessians. Other than Newton's method, none of the algorithms provided by the Optim package employ exact Hessians.","title":"Unconstrained Optimization"},{"location":"user/minimization/#box-constrained-optimization","text":"A primal interior-point algorithm for simple \"box\" constraints (lower and upper bounds) is available. Reusing our Rosenbrock example from above, boxed minimization is performed as follows: lower = [1.25, -2.1] upper = [Inf, Inf] initial_x = [2.0, 2.0] inner_optimizer = GradientDescent() results = optimize(f, g!, lower, upper, initial_x, Fminbox(inner_optimizer)) This performs optimization with a barrier penalty, successively scaling down the barrier coefficient and using the chosen inner_optimizer ( GradientDescent() above) for convergence at each step. To change algorithm specific options, such as the line search algorithm, specify it directly in the inner_optimizer constructor: lower = [1.25, -2.1] upper = [Inf, Inf] initial_x = [2.0, 2.0] # requires using LineSearches inner_optimizer = GradientDescent(linesearch=LineSearches.BackTracking(order=3)) results = optimize(f, g!, lower, upper, initial_x, Fminbox(inner_optimizer)) This algorithm uses diagonal preconditioning to improve the accuracy, and hence is a good example of how to use ConjugateGradient or LBFGS with preconditioning. Other methods will currently not use preconditioning. Only the box constraints are used. If you can analytically compute the diagonal of the Hessian of your objective function, you may want to consider writing your own preconditioner. There are two iterations parameters: an outer iterations parameter used to control Fminbox and an inner iterations parameter used to control the inner optimizer. For example, the following restricts the optimization to 2 major iterations results = optimize(f, g!, lower, upper, initial_x, Fminbox(GradientDescent()), Optim.Options(outer_iterations = 2)) In contrast, the following sets the maximum number of iterations for each ConjugateGradient() optimization to 2 results = optimize(f, g!, lower, upper, initial_x, Fminbox(GradientDescent()), Optim.Options(iterations = 2))","title":"Box Constrained Optimization"},{"location":"user/minimization/#minimizing-a-univariate-function-on-a-bounded-interval","text":"Minimization of univariate functions without derivatives is available through the optimize interface: optimize(f, lower, upper, method; kwargs...) Notice the lack of initial x . A specific example is the following quadratic function. julia f_univariate(x) = 2x^2+3x+1 f_univariate (generic function with 1 method) julia optimize(f_univariate, -2.0, 1.0) Results of Optimization Algorithm * Algorithm: Brent's Method * Search Interval: [-2.000000, 1.000000] * Minimizer: -7.500000e-01 * Minimum: -1.250000e-01 * Iterations: 7 * Convergence: max(|x - x_upper|, |x - x_lower|) = 2*(1.5e-08*|x|+2.2e-16): true * Objective Function Calls: 8 The output shows that we provided an initial lower and upper bound, that there is a final minimizer and minimum, and that it used seven major iterations. Importantly, we also see that convergence was declared. The default method is Brent's method, which is one out of two available methods: Brent's method, the default (can be explicitly selected with Brent() ). Golden section search, available with GoldenSection() . If we want to manually specify this method, we use the usual syntax as for multivariate optimization. optimize(f, lower, upper, Brent(); kwargs...) optimize(f, lower, upper, GoldenSection(); kwargs...) Keywords are used to set options for this special type of optimization. In addition to the iterations , store_trace , show_trace and extended_trace options, the following options are also available: rel_tol : The relative tolerance used for determining convergence. Defaults to sqrt(eps(T)) . abs_tol : The absolute tolerance used for determining convergence. Defaults to eps(T) .","title":"Minimizing a univariate function on a bounded interval"},{"location":"user/minimization/#obtaining-results","text":"After we have our results in res , we can use the API for getting optimization results. This consists of a collection of functions. They are not exported, so they have to be prefixed by Optim. . Say we do the following optimization: res = optimize(x- dot(x,[1 0. 0; 0 3 0; 0 0 1]*x), zeros(3)) If we can't remember what method we used, we simply use summary(res) which will return \"Nelder Mead\" . A bit more useful information is the minimizer and minimum of the objective functions, which can be found using julia Optim.minimizer(res) 3-element Array{Float64,1}: -0.499921 -0.3333 -1.49994 julia Optim.minimum(res) -2.8333333205768865","title":"Obtaining results"},{"location":"user/minimization/#complete-list-of-functions","text":"A complete list of functions can be found below. Defined for all methods: summary(res) minimizer(res) minimum(res) iterations(res) iteration_limit_reached(res) trace(res) x_trace(res) f_trace(res) f_calls(res) converged(res) Defined for univariate optimization: lower_bound(res) upper_bound(res) x_lower_trace(res) x_upper_trace(res) rel_tol(res) abs_tol(res) Defined for multivariate optimization: g_norm_trace(res) g_calls(res) x_converged(res) f_converged(res) g_converged(res) initial_state(res)","title":"Complete list of functions"},{"location":"user/minimization/#input-types","text":"Most users will input Vector 's as their initial_x 's, and get an Optim.minimizer(res) out that is also a vector. For zeroth and first order methods, it is also possible to pass in matrices, or even higher dimensional arrays. The only restriction imposed by leaving the Vector case is, that it is no longer possible to use finite difference approximations or automatic differentiation. Second order methods (variants of Newton's method) do not support this more general input type.","title":"Input types"},{"location":"user/minimization/#notes-on-convergence-flags-and-checks","text":"Currently, it is possible to access a minimizer using Optim.minimizer(result) even if all convergence flags are false . This means that the user has to be a bit careful when using the output from the solvers. It is advised to include checks for convergence if the minimizer or minimum is used to carry out further calculations. A related note is that first and second order methods makes a convergence check on the gradient before entering the optimization loop. This is done to prevent line search errors if initial_x is a stationary point. Notice, that this is only a first order check. If initial_x is any type of stationary point, g_converged will be true. This includes local minima, saddle points, and local maxima. If iterations is 0 and g_converged is true , the user needs to keep this point in mind.","title":"Notes on convergence flags and checks"},{"location":"user/tipsandtricks/","text":"Dealing with constant parameters In many applications, there may be factors that are relevant to the function evaluations, but are fixed throughout the optimization. An obvious example is using data in a likelihood function, but it could also be parameters we wish to hold constant. Consider a squared error loss function that depends on some data x and y , and parameters betas . As far as the solver is concerned, there should only be one input argument to the function we want to minimize, call it sqerror . The problem is that we want to optimize a function sqerror that really depends on three inputs, and two of them are constant throught the optimization procedure. To do this, we need to define the variables x and y x = [1.0, 2.0, 3.0] y = 1.0 + 2.0 * x + [-0.3, 0.3, -0.1] We then simply define a function in three variables function sqerror(betas, X, Y) err = 0.0 for i in 1:length(X) pred_i = betas[1] + betas[2] * X[i] err += (Y[i] - pred_i)^2 end return err end and then optimize the following anonymous function res = optimize(b - sqerror(b, x, y), [0.0, 0.0]) Alternatively, we can define a closure sqerror(betas) that is aware of the variables we just defined function sqerror(betas) err = 0.0 for i in 1:length(x) pred_i = betas[1] + betas[2] * x[i] err += (y[i] - pred_i)^2 end return err end We can then optimize the sqerror function just like any other function res = optimize(sqerror, [0.0, 0.0]) Avoid repeating computations Say you are optimizing a function f(x) = x[1]^2+x[2]^2 g!(storage, x) = copyto!(storage, [2x[1], 2x[2]]) In this situation, no calculations from f could be reused in g! . However, sometimes there is a substantial similarity between the objective function, and gradient, and some calculations can be reused. To avoid repeating calculations, define functions fg! or fgh! that compute the objective function, the gradient and the Hessian (if needed) simultaneously. These functions internally can be written to avoid repeating common calculations. For example, here we define a function fg! to compute the objective function and the gradient, as required: function fg!(F,G,x) # do common computations here # ... if G != nothing # code to compute gradient here # writing the result to the vector G end if F != nothing # value = ... code to compute objective function return value end end Optim will only call this function with an argument G that is nothing (if the gradient is not required) or a Vector that should be filled (in-place) with the gradient. This flexibility is convenient for algorithms that only use the gradient in some iterations but not in others. Now we call optimize with the following syntax: Optim.optimize(Optim.only_fg!(fg!), [0., 0.], Optim.LBFGS()) Similarly, for a computation that requires the Hessian, we can write: function fgh!(F,G,H,x) G == nothing || # compute gradient and store in G H == nothing || # compute Hessian and store in H F == nothing || return f(x) nothing end Optim.optimize(Optim.only_fgh!(fgh!), [0., 0.], Optim.Newton()) Provide gradients As mentioned in the general introduction, passing analytical gradients can have an impact on performance. To show an example of this, consider the separable extension of the Rosenbrock function in dimension 5000, see SROSENBR in CUTEst. Below, we use the gradients and objective functions from mastsif through CUTEst.jl . We only show the first five iterations of an attempt to minimize the function using Gradient Descent. julia @time optimize(f, initial_x, GradientDescent(), Optim.Options(show_trace=true, iterations = 5)) Iter Function value Gradient norm 0 4.850000e+04 2.116000e+02 1 1.018734e+03 2.704951e+01 2 3.468449e+00 5.721261e-01 3 2.966899e+00 2.638790e-02 4 2.511859e+00 5.237768e-01 5 2.107853e+00 1.020287e-01 21.731129 seconds (1.61 M allocations: 63.434 MB, 0.03% gc time) Results of Optimization Algorithm * Algorithm: Gradient Descent * Starting Point: [1.2,1.0, ...] * Minimizer: [1.0287767703731154,1.058769439356144, ...] * Minimum: 2.107853e+00 * Iterations: 5 * Convergence: false * |x - x'| 0.0: false * |f(x) - f(x')| / |f(x)| 0.0: false * |g(x)| 1.0e-08: false * Reached Maximum Number of Iterations: true * Objective Function Calls: 23 * Gradient Calls: 23 julia @time optimize(f, g!, initial_x, GradientDescent(), Optim.Options(show_trace=true, iterations = 5)) Iter Function value Gradient norm 0 4.850000e+04 2.116000e+02 1 1.018769e+03 2.704998e+01 2 3.468488e+00 5.721481e-01 3 2.966900e+00 2.638792e-02 4 2.511828e+00 5.237919e-01 5 2.107802e+00 1.020415e-01 0.009889 seconds (915 allocations: 270.266 KB) Results of Optimization Algorithm * Algorithm: Gradient Descent * Starting Point: [1.2,1.0, ...] * Minimizer: [1.0287763814102757,1.05876866832087, ...] * Minimum: 2.107802e+00 * Iterations: 5 * Convergence: false * |x - x'| 0.0: false * |f(x) - f(x')| / |f(x)| 0.0: false * |g(x)| 1.0e-08: false * Reached Maximum Number of Iterations: true * Objective Function Calls: 23 * Gradient Calls: 23 The objective has obtained a value that is very similar between the two runs, but the run with the analytical gradient is way faster. It is possible that the finite differences code can be improved, but generally the optimization will be slowed down by all the function evaluations required to do the central finite differences calculations. Separating time spent in Optim's code and user provided functions Consider the Rosenbrock problem. using Optim prob = Optim.UnconstrainedProblems.examples[ Rosenbrock ]; Say we optimize this function, and look at the total run time of optimize using the Newton Trust Region method, and we are surprised that it takes a long time to run. We then wonder if time is spent in Optim's own code (solving the sub-problem for example) or in evaluating the objective, gradient or hessian that we provided. Then it can be very useful to use the TimerOutputs.jl package. This package allows us to run an over-all timer for optimize , and add individual timers for f , g! , and h! . Consider the example below, that is due to the author of the package (Kristoffer Carlsson). using TimerOutputs const to = TimerOutput() f(x ) = @timeit to f prob.f(x) g!(x, g) = @timeit to g! prob.g!(x, g) h!(x, h) = @timeit to h! prob.h!(x, h) begin reset_timer!(to) @timeit to Trust Region begin res = Optim.optimize(f, g!, h!, prob.initial_x, NewtonTrustRegion()) end show(to; allocations = false) end We see that the time is actually not spent in our provided functions, but most of the time is spent in the code for the trust region method. Early stopping Sometimes it might be of interest to stop the optimizer early. The simplest way to do this is to set the iterations keyword in Optim.Options to some number. This will prevent the iteration counter exceeding some limit, with the standard value being 1000. Alternatively, it is possible to put a soft limit on the run time of the optimization procedure by setting the time_limit keyword in the Optim.Options constructor. using Optim problem = Optim.UnconstrainedProblems.examples[ Rosenbrock ] f = problem.f initial_x = problem.initial_x function slow(x) sleep(0.1) f(x) end start_time = time() optimize(slow, zeros(2), NelderMead(), Optim.Options(time_limit = 3.0)) This will stop after about three seconds. If it is more important that we stop before the limit is reached, it is possible to use a callback with a simple model for predicting how much time will have passed when the next iteration is over. Consider the following code using Optim problem = Optim.UnconstrainedProblems.examples[ Rosenbrock ] f = problem.f initial_x = problem.initial_x function very_slow(x) sleep(.5) f(x) end start_time = time() time_to_setup = zeros(1) function advanced_time_control(x) println( * Iteration: , x.iteration) so_far = time()-start_time println( * Time so far: , so_far) if x.iteration == 0 time_to_setup[:] = time()-start_time else expected_next_time = so_far + (time()-start_time-time_to_setup[1])/(x.iteration) println( * Next iteration \u2248 , expected_next_time) println() return expected_next_time 13 ? false : true end println() false end optimize(very_slow, zeros(2), NelderMead(), Optim.Options(callback = advanced_time_control)) It will try to predict the elapsed time after the next iteration is over, and stop now if it is expected to exceed the limit of 13 seconds. Running it, we get something like the following output julia optimize(very_slow, zeros(2), NelderMead(), Optim.Options(callback = advanced_time_control)) * Iteration: 0 * Time so far: 2.219298839569092 * Iteration: 1 * Time so far: 3.4006409645080566 * Next iteration \u2248 4.5429909229278564 * Iteration: 2 * Time so far: 4.403923988342285 * Next iteration \u2248 5.476739525794983 * Iteration: 3 * Time so far: 5.407265901565552 * Next iteration \u2248 6.4569235642751055 * Iteration: 4 * Time so far: 5.909044027328491 * Next iteration \u2248 6.821732044219971 * Iteration: 5 * Time so far: 6.912338972091675 * Next iteration \u2248 7.843148183822632 * Iteration: 6 * Time so far: 7.9156060218811035 * Next iteration \u2248 8.85849153995514 * Iteration: 7 * Time so far: 8.918903827667236 * Next iteration \u2248 9.870419979095459 * Iteration: 8 * Time so far: 9.922197818756104 * Next iteration \u2248 10.880185931921005 * Iteration: 9 * Time so far: 10.925468921661377 * Next iteration \u2248 11.888488478130764 * Iteration: 10 * Time so far: 11.92870283126831 * Next iteration \u2248 12.895747828483582 * Iteration: 11 * Time so far: 12.932114839553833 * Next iteration \u2248 13.902462200684981 Results of Optimization Algorithm * Algorithm: Nelder-Mead * Starting Point: [0.0,0.0] * Minimizer: [0.23359374999999996,0.042187499999999996, ...] * Minimum: 6.291677e-01 * Iterations: 11 * Convergence: false * \u221a(\u03a3(y\u1d62-y\u0304)\u00b2)/n 1.0e-08: false * Reached Maximum Number of Iterations: false * Objective Function Calls: 24","title":"Tips and tricks"},{"location":"user/tipsandtricks/#dealing-with-constant-parameters","text":"In many applications, there may be factors that are relevant to the function evaluations, but are fixed throughout the optimization. An obvious example is using data in a likelihood function, but it could also be parameters we wish to hold constant. Consider a squared error loss function that depends on some data x and y , and parameters betas . As far as the solver is concerned, there should only be one input argument to the function we want to minimize, call it sqerror . The problem is that we want to optimize a function sqerror that really depends on three inputs, and two of them are constant throught the optimization procedure. To do this, we need to define the variables x and y x = [1.0, 2.0, 3.0] y = 1.0 + 2.0 * x + [-0.3, 0.3, -0.1] We then simply define a function in three variables function sqerror(betas, X, Y) err = 0.0 for i in 1:length(X) pred_i = betas[1] + betas[2] * X[i] err += (Y[i] - pred_i)^2 end return err end and then optimize the following anonymous function res = optimize(b - sqerror(b, x, y), [0.0, 0.0]) Alternatively, we can define a closure sqerror(betas) that is aware of the variables we just defined function sqerror(betas) err = 0.0 for i in 1:length(x) pred_i = betas[1] + betas[2] * x[i] err += (y[i] - pred_i)^2 end return err end We can then optimize the sqerror function just like any other function res = optimize(sqerror, [0.0, 0.0])","title":"Dealing with constant parameters"},{"location":"user/tipsandtricks/#avoid-repeating-computations","text":"Say you are optimizing a function f(x) = x[1]^2+x[2]^2 g!(storage, x) = copyto!(storage, [2x[1], 2x[2]]) In this situation, no calculations from f could be reused in g! . However, sometimes there is a substantial similarity between the objective function, and gradient, and some calculations can be reused. To avoid repeating calculations, define functions fg! or fgh! that compute the objective function, the gradient and the Hessian (if needed) simultaneously. These functions internally can be written to avoid repeating common calculations. For example, here we define a function fg! to compute the objective function and the gradient, as required: function fg!(F,G,x) # do common computations here # ... if G != nothing # code to compute gradient here # writing the result to the vector G end if F != nothing # value = ... code to compute objective function return value end end Optim will only call this function with an argument G that is nothing (if the gradient is not required) or a Vector that should be filled (in-place) with the gradient. This flexibility is convenient for algorithms that only use the gradient in some iterations but not in others. Now we call optimize with the following syntax: Optim.optimize(Optim.only_fg!(fg!), [0., 0.], Optim.LBFGS()) Similarly, for a computation that requires the Hessian, we can write: function fgh!(F,G,H,x) G == nothing || # compute gradient and store in G H == nothing || # compute Hessian and store in H F == nothing || return f(x) nothing end Optim.optimize(Optim.only_fgh!(fgh!), [0., 0.], Optim.Newton())","title":"Avoid repeating computations"},{"location":"user/tipsandtricks/#provide-gradients","text":"As mentioned in the general introduction, passing analytical gradients can have an impact on performance. To show an example of this, consider the separable extension of the Rosenbrock function in dimension 5000, see SROSENBR in CUTEst. Below, we use the gradients and objective functions from mastsif through CUTEst.jl . We only show the first five iterations of an attempt to minimize the function using Gradient Descent. julia @time optimize(f, initial_x, GradientDescent(), Optim.Options(show_trace=true, iterations = 5)) Iter Function value Gradient norm 0 4.850000e+04 2.116000e+02 1 1.018734e+03 2.704951e+01 2 3.468449e+00 5.721261e-01 3 2.966899e+00 2.638790e-02 4 2.511859e+00 5.237768e-01 5 2.107853e+00 1.020287e-01 21.731129 seconds (1.61 M allocations: 63.434 MB, 0.03% gc time) Results of Optimization Algorithm * Algorithm: Gradient Descent * Starting Point: [1.2,1.0, ...] * Minimizer: [1.0287767703731154,1.058769439356144, ...] * Minimum: 2.107853e+00 * Iterations: 5 * Convergence: false * |x - x'| 0.0: false * |f(x) - f(x')| / |f(x)| 0.0: false * |g(x)| 1.0e-08: false * Reached Maximum Number of Iterations: true * Objective Function Calls: 23 * Gradient Calls: 23 julia @time optimize(f, g!, initial_x, GradientDescent(), Optim.Options(show_trace=true, iterations = 5)) Iter Function value Gradient norm 0 4.850000e+04 2.116000e+02 1 1.018769e+03 2.704998e+01 2 3.468488e+00 5.721481e-01 3 2.966900e+00 2.638792e-02 4 2.511828e+00 5.237919e-01 5 2.107802e+00 1.020415e-01 0.009889 seconds (915 allocations: 270.266 KB) Results of Optimization Algorithm * Algorithm: Gradient Descent * Starting Point: [1.2,1.0, ...] * Minimizer: [1.0287763814102757,1.05876866832087, ...] * Minimum: 2.107802e+00 * Iterations: 5 * Convergence: false * |x - x'| 0.0: false * |f(x) - f(x')| / |f(x)| 0.0: false * |g(x)| 1.0e-08: false * Reached Maximum Number of Iterations: true * Objective Function Calls: 23 * Gradient Calls: 23 The objective has obtained a value that is very similar between the two runs, but the run with the analytical gradient is way faster. It is possible that the finite differences code can be improved, but generally the optimization will be slowed down by all the function evaluations required to do the central finite differences calculations.","title":"Provide gradients"},{"location":"user/tipsandtricks/#separating-time-spent-in-optims-code-and-user-provided-functions","text":"Consider the Rosenbrock problem. using Optim prob = Optim.UnconstrainedProblems.examples[ Rosenbrock ]; Say we optimize this function, and look at the total run time of optimize using the Newton Trust Region method, and we are surprised that it takes a long time to run. We then wonder if time is spent in Optim's own code (solving the sub-problem for example) or in evaluating the objective, gradient or hessian that we provided. Then it can be very useful to use the TimerOutputs.jl package. This package allows us to run an over-all timer for optimize , and add individual timers for f , g! , and h! . Consider the example below, that is due to the author of the package (Kristoffer Carlsson). using TimerOutputs const to = TimerOutput() f(x ) = @timeit to f prob.f(x) g!(x, g) = @timeit to g! prob.g!(x, g) h!(x, h) = @timeit to h! prob.h!(x, h) begin reset_timer!(to) @timeit to Trust Region begin res = Optim.optimize(f, g!, h!, prob.initial_x, NewtonTrustRegion()) end show(to; allocations = false) end We see that the time is actually not spent in our provided functions, but most of the time is spent in the code for the trust region method.","title":"Separating time spent in Optim's code and user provided functions"},{"location":"user/tipsandtricks/#early-stopping","text":"Sometimes it might be of interest to stop the optimizer early. The simplest way to do this is to set the iterations keyword in Optim.Options to some number. This will prevent the iteration counter exceeding some limit, with the standard value being 1000. Alternatively, it is possible to put a soft limit on the run time of the optimization procedure by setting the time_limit keyword in the Optim.Options constructor. using Optim problem = Optim.UnconstrainedProblems.examples[ Rosenbrock ] f = problem.f initial_x = problem.initial_x function slow(x) sleep(0.1) f(x) end start_time = time() optimize(slow, zeros(2), NelderMead(), Optim.Options(time_limit = 3.0)) This will stop after about three seconds. If it is more important that we stop before the limit is reached, it is possible to use a callback with a simple model for predicting how much time will have passed when the next iteration is over. Consider the following code using Optim problem = Optim.UnconstrainedProblems.examples[ Rosenbrock ] f = problem.f initial_x = problem.initial_x function very_slow(x) sleep(.5) f(x) end start_time = time() time_to_setup = zeros(1) function advanced_time_control(x) println( * Iteration: , x.iteration) so_far = time()-start_time println( * Time so far: , so_far) if x.iteration == 0 time_to_setup[:] = time()-start_time else expected_next_time = so_far + (time()-start_time-time_to_setup[1])/(x.iteration) println( * Next iteration \u2248 , expected_next_time) println() return expected_next_time 13 ? false : true end println() false end optimize(very_slow, zeros(2), NelderMead(), Optim.Options(callback = advanced_time_control)) It will try to predict the elapsed time after the next iteration is over, and stop now if it is expected to exceed the limit of 13 seconds. Running it, we get something like the following output julia optimize(very_slow, zeros(2), NelderMead(), Optim.Options(callback = advanced_time_control)) * Iteration: 0 * Time so far: 2.219298839569092 * Iteration: 1 * Time so far: 3.4006409645080566 * Next iteration \u2248 4.5429909229278564 * Iteration: 2 * Time so far: 4.403923988342285 * Next iteration \u2248 5.476739525794983 * Iteration: 3 * Time so far: 5.407265901565552 * Next iteration \u2248 6.4569235642751055 * Iteration: 4 * Time so far: 5.909044027328491 * Next iteration \u2248 6.821732044219971 * Iteration: 5 * Time so far: 6.912338972091675 * Next iteration \u2248 7.843148183822632 * Iteration: 6 * Time so far: 7.9156060218811035 * Next iteration \u2248 8.85849153995514 * Iteration: 7 * Time so far: 8.918903827667236 * Next iteration \u2248 9.870419979095459 * Iteration: 8 * Time so far: 9.922197818756104 * Next iteration \u2248 10.880185931921005 * Iteration: 9 * Time so far: 10.925468921661377 * Next iteration \u2248 11.888488478130764 * Iteration: 10 * Time so far: 11.92870283126831 * Next iteration \u2248 12.895747828483582 * Iteration: 11 * Time so far: 12.932114839553833 * Next iteration \u2248 13.902462200684981 Results of Optimization Algorithm * Algorithm: Nelder-Mead * Starting Point: [0.0,0.0] * Minimizer: [0.23359374999999996,0.042187499999999996, ...] * Minimum: 6.291677e-01 * Iterations: 11 * Convergence: false * \u221a(\u03a3(y\u1d62-y\u0304)\u00b2)/n 1.0e-08: false * Reached Maximum Number of Iterations: false * Objective Function Calls: 24","title":"Early stopping"}]}