-
Notifications
You must be signed in to change notification settings - Fork 214
/
search_index.json
454 lines (454 loc) · 149 KB
/
search_index.json
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
{
"docs": [
{
"location": "/",
"text": "Optim.jl\n\n\n\n\nWhat\n\n\nOptim is a Julia package for optimizing functions of various kinds. While there is some support for box constrained and Riemannian optimization, most of the solvers try to find an $x$ that minimizes a function $f(x)$ without any constraints. Thus, the main focus is on unconstrained optimization. The provided solvers, under certain conditions, will converge to a local minimum. In the case where a global minimum is desired, global optimization techniques should be employed instead (see e.g. \nBlackBoxOptim\n).\n\n\n\n\nWhy\n\n\nThere are many solvers available from both free and commercial sources, and many of them are accessible from Julia. Few of them are written in Julia. Performance-wise this is rarely a problem, as they are often written in either Fortran or C. However, solvers written directly in Julia does come with some advantages.\n\n\nWhen writing Julia software (packages) that require something to be optimized, the programmer can either choose to write their own optimization routine, or use one of the many available solvers. For example, this could be something from the \nNLOpt\n suite. This means adding a dependency which is not written in Julia, and more assumptions have to be made as to the environment the user is in. Does the user have the proper compilers? Is it possible to use GPL'ed code in the project? Optim is released under the MIT license, and installation is a simple \nPkg.add\n, so it really doesn't get much freer, easier, and lightweight than that.\n\n\nIt is also true, that using a solver written in C or Fortran makes it impossible to leverage one of the main benefits of Julia: multiple dispatch. Since Optim is entirely written in Julia, we can currently use the dispatch system to ease the use of custom preconditioners. A planned feature along these lines is to allow for user controlled choice of solvers for various steps in the algorithm, entirely based on dispatch, and not predefined possibilities chosen by the developers of Optim.\n\n\nBeing a Julia package also means that Optim has access to the automatic differentiation features through the packages in \nJuliaDiff\n.\n\n\n\n\nHow\n\n\nOptim is registered in \nMETADATA.jl\n. This means that all you need to do to install Optim, is to run\n\n\nPkg\n.\nadd\n(\nOptim\n)\n\n\n\n\n\n\n\n\nBut...\n\n\nOptim is a work in progress. There are still some rough edges to be sanded down, and features we want to implement. There are also planned breaking changes that are good to be aware of. Please see the section on Planned Changes.",
"title": "Home"
},
{
"location": "/#optimjl",
"text": "",
"title": "Optim.jl"
},
{
"location": "/#what",
"text": "Optim is a Julia package for optimizing functions of various kinds. While there is some support for box constrained and Riemannian optimization, most of the solvers try to find an $x$ that minimizes a function $f(x)$ without any constraints. Thus, the main focus is on unconstrained optimization. The provided solvers, under certain conditions, will converge to a local minimum. In the case where a global minimum is desired, global optimization techniques should be employed instead (see e.g. BlackBoxOptim ).",
"title": "What"
},
{
"location": "/#why",
"text": "There are many solvers available from both free and commercial sources, and many of them are accessible from Julia. Few of them are written in Julia. Performance-wise this is rarely a problem, as they are often written in either Fortran or C. However, solvers written directly in Julia does come with some advantages. When writing Julia software (packages) that require something to be optimized, the programmer can either choose to write their own optimization routine, or use one of the many available solvers. For example, this could be something from the NLOpt suite. This means adding a dependency which is not written in Julia, and more assumptions have to be made as to the environment the user is in. Does the user have the proper compilers? Is it possible to use GPL'ed code in the project? Optim is released under the MIT license, and installation is a simple Pkg.add , so it really doesn't get much freer, easier, and lightweight than that. It is also true, that using a solver written in C or Fortran makes it impossible to leverage one of the main benefits of Julia: multiple dispatch. Since Optim is entirely written in Julia, we can currently use the dispatch system to ease the use of custom preconditioners. A planned feature along these lines is to allow for user controlled choice of solvers for various steps in the algorithm, entirely based on dispatch, and not predefined possibilities chosen by the developers of Optim. Being a Julia package also means that Optim has access to the automatic differentiation features through the packages in JuliaDiff .",
"title": "Why"
},
{
"location": "/#how",
"text": "Optim is registered in METADATA.jl . This means that all you need to do to install Optim, is to run Pkg . add ( Optim )",
"title": "How"
},
{
"location": "/#but",
"text": "Optim is a work in progress. There are still some rough edges to be sanded down, and features we want to implement. There are also planned breaking changes that are good to be aware of. Please see the section on Planned Changes.",
"title": "But..."
},
{
"location": "/user/minimization/",
"text": "Minimizing a multivariate function\n\n\nTo show how the Optim package can be used, we implement the \nRosenbrock function\n, a classic problem in numerical optimization. We'll assume that you've already installed the Optim package using Julia's package manager. First, we load Optim and define the Rosenbrock function:\n\n\nusing\n \nOptim\n\n\nf\n(\nx\n)\n \n=\n \n(\n1.0\n \n-\n \nx\n[\n1\n])\n^\n2\n \n+\n \n100.0\n \n*\n \n(\nx\n[\n2\n]\n \n-\n \nx\n[\n1\n]\n^\n2\n)\n^\n2\n\n\n\n\n\n\nOnce we've defined this function, we can find the minimum of the Rosenbrock function using any of our favorite optimization algorithms. With a function defined, we just specify an initial point \nx\n and run:\n\n\noptimize\n(\nf\n,\n \n[\n0.0\n,\n \n0.0\n])\n\n\n\n\n\n\n!!! note\n It is important to pass \ninitial_x\n as an array. If your problem is one-dimensional, you have to wrap it in an array. An easy way to do so is to write \noptimize(x-\nf(first(x)), [initial_x], ...)\n\n\nOptim will default to using the Nelder-Mead method in this case, as we did not provide a gradient. This can also be explicitly specified using:\n\n\noptimize\n(\nf\n,\n \n[\n0.0\n,\n \n0.0\n],\n \nNelderMead\n())\n\n\n\n\n\n\nOther solvers are available. Below, we use L-BFGS, a quasi-Newton method that requires a gradient. If we pass \nf\n alone, Optim will construct an approximate gradient for us using central finite differencing:\n\n\noptimize\n(\nf\n,\n \n[\n0.0\n,\n \n0.0\n],\n \nLBFGS\n())\n\n\n\n\n\n\nFor better performance and greater precision, you can pass your own gradient function. For the Rosenbrock example, the analytical gradient can be shown to be:\n\n\nfunction\n \ng!\n(\nstorage\n,\n \nx\n)\n\n\nstorage\n[\n1\n]\n \n=\n \n-\n2.0\n \n*\n \n(\n1.0\n \n-\n \nx\n[\n1\n])\n \n-\n \n400.0\n \n*\n \n(\nx\n[\n2\n]\n \n-\n \nx\n[\n1\n]\n^\n2\n)\n \n*\n \nx\n[\n1\n]\n\n\nstorage\n[\n2\n]\n \n=\n \n200.0\n \n*\n \n(\nx\n[\n2\n]\n \n-\n \nx\n[\n1\n]\n^\n2\n)\n\n\nend\n\n\n\n\n\n\nNote that the functions we're using to calculate the gradient (and later the Hessian \nh!\n) of the Rosenbrock function mutate a fixed-sized storage array, which is passed as an additional argument called \nstorage\n. By mutating a single array over many iterations, this style of function definition removes the sometimes considerable costs associated with allocating a new array during each call to the \ng!\n or \nh!\n functions. You can use \nOptim\n without manually defining a gradient or Hessian function, but if you do define these functions, they must take these two arguments in this order. Returning to our optimization problem, you simply pass \ng!\n together with \nf\n from before to use the gradient:\n\n\noptimize\n(\nf\n,\n \ng!\n,\n \n[\n0.0\n,\n \n0.0\n],\n \nLBFGS\n())\n\n\n\n\n\n\nFor some methods, like simulated annealing, the gradient will be ignored:\n\n\noptimize\n(\nf\n,\n \ng!\n,\n \n[\n0.0\n,\n \n0.0\n],\n \nSimulatedAnnealing\n())\n\n\n\n\n\n\nIn addition to providing gradients, you can provide a Hessian function \nh!\n as well. In our current case this is:\n\n\nfunction\n \nh!\n(\nstorage\n,\n \nx\n)\n\n \nstorage\n[\n1\n,\n \n1\n]\n \n=\n \n2.0\n \n-\n \n400.0\n \n*\n \nx\n[\n2\n]\n \n+\n \n1200.0\n \n*\n \nx\n[\n1\n]\n^\n2\n\n \nstorage\n[\n1\n,\n \n2\n]\n \n=\n \n-\n400.0\n \n*\n \nx\n[\n1\n]\n\n \nstorage\n[\n2\n,\n \n1\n]\n \n=\n \n-\n400.0\n \n*\n \nx\n[\n1\n]\n\n \nstorage\n[\n2\n,\n \n2\n]\n \n=\n \n200.0\n\n\nend\n\n\n\n\n\n\nNow we can use Newton's method for optimization by running:\n\n\noptimize\n(\nf\n,\n \ng!\n,\n \nh!\n,\n \n[\n0.0\n,\n \n0.0\n])\n\n\n\n\n\n\nWhich defaults to \nNewton()\n since a Hessian was provided. Like gradients, the Hessian function will be ignored if you use a method that does not require it:\n\n\noptimize\n(\nf\n,\n \ng!\n,\n \nh!\n,\n \n[\n0.0\n,\n \n0.0\n],\n \nLBFGS\n())\n\n\n\n\n\n\nNote that Optim will not generate approximate Hessians using finite differencing because of the potentially low accuracy of approximations to the Hessians. Other than Newton's method, none of the algorithms provided by the Optim package employ exact Hessians.\n\n\n\n\nBox minimization\n\n\nA primal interior-point algorithm for simple \"box\" constraints (lower and upper bounds) is also available. Reusing our Rosenbrock example from above, boxed minimization is performed as follows:\n\n\nlower\n \n=\n \n[\n1.25\n,\n \n-\n2.1\n]\n\n\nupper\n \n=\n \n[\nInf\n,\n \nInf\n]\n\n\ninitial_x\n \n=\n \n[\n2.0\n,\n \n2.0\n]\n\n\nod\n \n=\n \nOnceDifferentiable\n(\nf\n,\n \ng!\n,\n \ninitial_x\n)\n\n\nresults\n \n=\n \noptimize\n(\nod\n,\n \ninitial_x\n,\n \nlower\n,\n \nupper\n,\n \nFminbox\n{\nGradientDescent\n}())\n\n\n\n\n\n\nThis performs optimization with a barrier penalty, successively scaling down the barrier coefficient and using the chosen \noptimizer\n (\nGradientDescent\n above) for convergence at each step. Notice that the \nOptimizer\n type, not an instance should be passed (\nGradientDescent\n, not \nGradientDescent()\n).\n\n\nThis algorithm uses diagonal preconditioning to improve the accuracy, and hence is a good example of how to use \nConjugateGradient\n or \nLBFGS\n with preconditioning. Other methods will currently not use preconditioning. Only the box constraints are used. If you can analytically compute the diagonal of the Hessian of your objective function, you may want to consider writing your own preconditioner.\n\n\nThere are two iterations parameters: an outer iterations parameter used to control \nFminbox\n and an inner iterations parameter used to control the inner optimizer. For this reason, the options syntax is a bit different from the rest of the package. All parameters regarding the outer iterations are passed as keyword arguments, and options for the interior optimizer is passed as an \nOptim.Options\n type using the keyword \noptimizer_o\n.\n\n\nFor example, the following restricts the optimization to 2 major iterations\n\n\nod\n \n=\n \nOnceDifferentiable\n(\nf\n,\n \ng!\n,\n \ninitial_x\n)\n\n\nresults\n \n=\n \noptimize\n(\nod\n,\n \ninitial_x\n,\n \nlower\n,\n \nupper\n,\n \nFminbox\n{\nGradientDescent\n}();\n \niterations\n \n=\n \n2\n)\n\n\n\n\n\n\nIn contrast, the following sets the maximum number of iterations for each \nConjugateGradient\n optimization to 2\n\n\nod\n \n=\n \nOnceDifferentiable\n(\nf\n,\n \ng!\n,\n \ninitial_x\n)\n\n\nresults\n \n=\n \nOptim\n.\noptimize\n(\nod\n,\n \ninitial_x\n,\n \nlower\n,\n \nupper\n,\n \nFminbox\n{\nGradientDescent\n}();\n \noptimizer_o\n \n=\n \nOptim\n.\nOptions\n(\niterations\n \n=\n \n2\n))\n\n\n\n\n\n\n\n\nMinimizing a univariate function on a bounded interval\n\n\nMinimization of univariate functions without derivatives is available through the \noptimize\n interface:\n\n\n \noptimize\n(\nf\n,\n \nlower\n,\n \nupper\n,\n \nmethod\n;\n \nkwargs\n...\n)\n\n\n\n\n\n\nNotice the lack of initial \nx\n. A specific example is the following quadratic function.\n\n\njulia\n \nf_univariate\n(\nx\n)\n \n=\n \n2\nx\n^\n2\n+\n3\nx\n+\n1\n\n\nf_univariate\n \n(\ngeneric\n \nfunction\n \nwith\n \n1\n \nmethod\n)\n\n\n\njulia\n \noptimize\n(\nf_univariate\n,\n \n-\n2.0\n,\n \n1.0\n)\n\n\nResults\n \nof\n \nOptimization\n \nAlgorithm\n\n \n*\n \nAlgorithm\n:\n \nBrent\ns\n \nMethod\n\n \n*\n \nSearch\n \nInterval\n:\n \n[\n-\n2.000000\n,\n \n1.000000\n]\n\n \n*\n \nMinimizer\n:\n \n-\n7.500000e-01\n\n \n*\n \nMinimum\n:\n \n-\n1.250000e-01\n\n \n*\n \nIterations\n:\n \n7\n\n \n*\n \nConvergence\n:\n \nmax\n(\n|\nx\n \n-\n \nx_upper\n|\n,\n \n|\nx\n \n-\n \nx_lower\n|\n)\n \n=\n \n2\n*\n(\n1.5e-08\n*|\nx\n|+\n2.2e-16\n)\n:\n \ntrue\n\n \n*\n \nObjective\n \nFunction\n \nCalls\n:\n \n8\n\n\n\n\n\n\nThe output shows that we provided an initial lower and upper bound, that there is a final minimizer and minimum, and that it used seven major iterations. Importantly, we also see that convergence was declared. The default method is Brent's method, which is one out of two available methods:\n\n\n\n\nBrent's method, the default (can be explicitly selected with \nBrent()\n).\n\n\nGolden section search, available with \nGoldenSection()\n.\n\n\n\n\nIf we want to manually specify this method, we use the usual syntax as for multivariate optimization.\n\n\n \noptimize\n(\nf\n,\n \nlower\n,\n \nupper\n,\n \nBrent\n();\n \nkwargs\n...\n)\n\n \noptimize\n(\nf\n,\n \nlower\n,\n \nupper\n,\n \nGoldenSection\n();\n \nkwargs\n...\n)\n\n\n\n\n\n\nKeywords are used to set options for this special type of optimization. In addition to the \niterations\n, \nstore_trace\n, \nshow_trace\n and \nextended_trace\n options, the following options are also available:\n\n\n\n\nrel_tol\n: The relative tolerance used for determining convergence. Defaults to \nsqrt(eps(T))\n.\n\n\nabs_tol\n: The absolute tolerance used for determining convergence. Defaults to \neps(T)\n.\n\n\n\n\n\n\nObtaining results\n\n\nAfter we have our results in \nres\n, we can use the API for getting optimization results. This consists of a collection of functions. They are not exported, so they have to be prefixed by \nOptim.\n. Say we do the following optimization:\n\n\nres\n \n=\n \noptimize\n(\nx\n-\ndot\n(\nx\n,[\n1\n \n0.\n \n0\n;\n \n0\n \n3\n \n0\n;\n \n0\n \n0\n \n1\n]\n*\nx\n),\n \nzeros\n(\n3\n))\n\n\n\n\n\n\nIf we can't remember what method we used, we simply use\n\n\nOptim\n.\nsummary\n(\nres\n)\n\n\n\n\n\n\nwhich will return \n\"Nelder Mead\"\n. A bit more useful information is the minimizer and minimum of the objective functions, which can be found using\n\n\njulia\n \nOptim\n.\nminimizer\n(\nres\n)\n\n\n3-element Array{Float64,1}:\n\n\n -0.499921\n\n\n -0.3333\n\n\n -1.49994\n\n\n\njulia\n \nOptim\n.\nminimum\n(\nres\n)\n\n\n -2.8333333205768865\n\n\n\n\n\n\n\n\nComplete list of functions\n\n\nA complete list of functions can be found below.\n\n\nDefined for all methods:\n\n\n\n\nsummary(res)\n\n\nminimizer(res)\n\n\nminimum(res)\n\n\niterations(res)\n\n\niteration_limit_reached(res)\n\n\ntrace(res)\n\n\nx_trace(res)\n\n\nf_trace(res)\n\n\nf_calls(res)\n\n\nconverged(res)\n\n\n\n\nDefined for univariate optimization:\n\n\n\n\nlower_bound(res)\n\n\nupper_bound(res)\n\n\nx_lower_trace(res)\n\n\nx_upper_trace(res)\n\n\nrel_tol(res)\n\n\nabs_tol(res)\n\n\n\n\nDefined for multivariate optimization:\n\n\n\n\ng_norm_trace(res)\n\n\ng_calls(res)\n\n\nx_converged(res)\n\n\nf_converged(res)\n\n\ng_converged(res)\n\n\ninitial_state(res)\n\n\n\n\n\n\nInput types\n\n\nMost users will input \nVector\n's as their \ninitial_x\n's, and get an \nOptim.minimizer(res)\n out that is also a vector. For zeroth and first order methods, it is also possible to pass in matrices, or even higher dimensional arrays. The only restriction imposed by leaving the \nVector\n case is, that it is no longer possible to use finite difference approximations or autmatic differentiation. Second order methods (variants of Newton's method) do not support this more general input type.\n\n\n\n\nNotes on convergence flags and checks\n\n\nCurrently, it is possible to access a minimizer using \nOptim.minimizer(result)\n even if all convergence flags are \nfalse\n. This means that the user has to be a bit careful when using the output from the solvers. It is advised to include checks for convergence if the minimizer or minimum is used to carry out further calculations.\n\n\nA related note is that first and second order methods makes a convergence check on the gradient before entering the optimization loop. This is done to prevent line search errors if \ninitial_x\n is a stationary point. Notice, that this is only a first order check. If \ninitial_x\n is any type of stationary point, \ng_converged\n will be true. This includes local minima, saddle points, and local maxima. If \niterations\n is \n0\n and \ng_converged\n is \ntrue\n, the user needs to keep this point in mind.",
"title": "Minimizing a function"
},
{
"location": "/user/minimization/#minimizing-a-multivariate-function",
"text": "To show how the Optim package can be used, we implement the Rosenbrock function , a classic problem in numerical optimization. We'll assume that you've already installed the Optim package using Julia's package manager. First, we load Optim and define the Rosenbrock function: using Optim f ( x ) = ( 1.0 - x [ 1 ]) ^ 2 + 100.0 * ( x [ 2 ] - x [ 1 ] ^ 2 ) ^ 2 Once we've defined this function, we can find the minimum of the Rosenbrock function using any of our favorite optimization algorithms. With a function defined, we just specify an initial point x and run: optimize ( f , [ 0.0 , 0.0 ]) !!! note\n It is important to pass initial_x as an array. If your problem is one-dimensional, you have to wrap it in an array. An easy way to do so is to write optimize(x- f(first(x)), [initial_x], ...) Optim will default to using the Nelder-Mead method in this case, as we did not provide a gradient. This can also be explicitly specified using: optimize ( f , [ 0.0 , 0.0 ], NelderMead ()) Other solvers are available. Below, we use L-BFGS, a quasi-Newton method that requires a gradient. If we pass f alone, Optim will construct an approximate gradient for us using central finite differencing: optimize ( f , [ 0.0 , 0.0 ], LBFGS ()) For better performance and greater precision, you can pass your own gradient function. For the Rosenbrock example, the analytical gradient can be shown to be: function g! ( storage , x ) storage [ 1 ] = - 2.0 * ( 1.0 - x [ 1 ]) - 400.0 * ( x [ 2 ] - x [ 1 ] ^ 2 ) * x [ 1 ] storage [ 2 ] = 200.0 * ( x [ 2 ] - x [ 1 ] ^ 2 ) end Note that the functions we're using to calculate the gradient (and later the Hessian h! ) of the Rosenbrock function mutate a fixed-sized storage array, which is passed as an additional argument called storage . By mutating a single array over many iterations, this style of function definition removes the sometimes considerable costs associated with allocating a new array during each call to the g! or h! functions. You can use Optim without manually defining a gradient or Hessian function, but if you do define these functions, they must take these two arguments in this order. Returning to our optimization problem, you simply pass g! together with f from before to use the gradient: optimize ( f , g! , [ 0.0 , 0.0 ], LBFGS ()) For some methods, like simulated annealing, the gradient will be ignored: optimize ( f , g! , [ 0.0 , 0.0 ], SimulatedAnnealing ()) In addition to providing gradients, you can provide a Hessian function h! as well. In our current case this is: function h! ( storage , x ) \n storage [ 1 , 1 ] = 2.0 - 400.0 * x [ 2 ] + 1200.0 * x [ 1 ] ^ 2 \n storage [ 1 , 2 ] = - 400.0 * x [ 1 ] \n storage [ 2 , 1 ] = - 400.0 * x [ 1 ] \n storage [ 2 , 2 ] = 200.0 end Now we can use Newton's method for optimization by running: optimize ( f , g! , h! , [ 0.0 , 0.0 ]) Which defaults to Newton() since a Hessian was provided. Like gradients, the Hessian function will be ignored if you use a method that does not require it: optimize ( f , g! , h! , [ 0.0 , 0.0 ], LBFGS ()) Note that Optim will not generate approximate Hessians using finite differencing because of the potentially low accuracy of approximations to the Hessians. Other than Newton's method, none of the algorithms provided by the Optim package employ exact Hessians.",
"title": "Minimizing a multivariate function"
},
{
"location": "/user/minimization/#box-minimization",
"text": "A primal interior-point algorithm for simple \"box\" constraints (lower and upper bounds) is also available. Reusing our Rosenbrock example from above, boxed minimization is performed as follows: lower = [ 1.25 , - 2.1 ] upper = [ Inf , Inf ] initial_x = [ 2.0 , 2.0 ] od = OnceDifferentiable ( f , g! , initial_x ) results = optimize ( od , initial_x , lower , upper , Fminbox { GradientDescent }()) This performs optimization with a barrier penalty, successively scaling down the barrier coefficient and using the chosen optimizer ( GradientDescent above) for convergence at each step. Notice that the Optimizer type, not an instance should be passed ( GradientDescent , not GradientDescent() ). This algorithm uses diagonal preconditioning to improve the accuracy, and hence is a good example of how to use ConjugateGradient or LBFGS with preconditioning. Other methods will currently not use preconditioning. Only the box constraints are used. If you can analytically compute the diagonal of the Hessian of your objective function, you may want to consider writing your own preconditioner. There are two iterations parameters: an outer iterations parameter used to control Fminbox and an inner iterations parameter used to control the inner optimizer. For this reason, the options syntax is a bit different from the rest of the package. All parameters regarding the outer iterations are passed as keyword arguments, and options for the interior optimizer is passed as an Optim.Options type using the keyword optimizer_o . For example, the following restricts the optimization to 2 major iterations od = OnceDifferentiable ( f , g! , initial_x ) results = optimize ( od , initial_x , lower , upper , Fminbox { GradientDescent }(); iterations = 2 ) In contrast, the following sets the maximum number of iterations for each ConjugateGradient optimization to 2 od = OnceDifferentiable ( f , g! , initial_x ) results = Optim . optimize ( od , initial_x , lower , upper , Fminbox { GradientDescent }(); optimizer_o = Optim . Options ( iterations = 2 ))",
"title": "Box minimization"
},
{
"location": "/user/minimization/#minimizing-a-univariate-function-on-a-bounded-interval",
"text": "Minimization of univariate functions without derivatives is available through the optimize interface: optimize ( f , lower , upper , method ; kwargs ... ) Notice the lack of initial x . A specific example is the following quadratic function. julia f_univariate ( x ) = 2 x ^ 2 + 3 x + 1 f_univariate ( generic function with 1 method ) julia optimize ( f_univariate , - 2.0 , 1.0 ) Results of Optimization Algorithm \n * Algorithm : Brent s Method \n * Search Interval : [ - 2.000000 , 1.000000 ] \n * Minimizer : - 7.500000e-01 \n * Minimum : - 1.250000e-01 \n * Iterations : 7 \n * Convergence : max ( | x - x_upper | , | x - x_lower | ) = 2 * ( 1.5e-08 *| x |+ 2.2e-16 ) : true \n * Objective Function Calls : 8 The output shows that we provided an initial lower and upper bound, that there is a final minimizer and minimum, and that it used seven major iterations. Importantly, we also see that convergence was declared. The default method is Brent's method, which is one out of two available methods: Brent's method, the default (can be explicitly selected with Brent() ). Golden section search, available with GoldenSection() . If we want to manually specify this method, we use the usual syntax as for multivariate optimization. optimize ( f , lower , upper , Brent (); kwargs ... ) \n optimize ( f , lower , upper , GoldenSection (); kwargs ... ) Keywords are used to set options for this special type of optimization. In addition to the iterations , store_trace , show_trace and extended_trace options, the following options are also available: rel_tol : The relative tolerance used for determining convergence. Defaults to sqrt(eps(T)) . abs_tol : The absolute tolerance used for determining convergence. Defaults to eps(T) .",
"title": "Minimizing a univariate function on a bounded interval"
},
{
"location": "/user/minimization/#obtaining-results",
"text": "After we have our results in res , we can use the API for getting optimization results. This consists of a collection of functions. They are not exported, so they have to be prefixed by Optim. . Say we do the following optimization: res = optimize ( x - dot ( x ,[ 1 0. 0 ; 0 3 0 ; 0 0 1 ] * x ), zeros ( 3 )) If we can't remember what method we used, we simply use Optim . summary ( res ) which will return \"Nelder Mead\" . A bit more useful information is the minimizer and minimum of the objective functions, which can be found using julia Optim . minimizer ( res ) 3-element Array{Float64,1}: -0.499921 -0.3333 -1.49994 julia Optim . minimum ( res ) -2.8333333205768865",
"title": "Obtaining results"
},
{
"location": "/user/minimization/#complete-list-of-functions",
"text": "A complete list of functions can be found below. Defined for all methods: summary(res) minimizer(res) minimum(res) iterations(res) iteration_limit_reached(res) trace(res) x_trace(res) f_trace(res) f_calls(res) converged(res) Defined for univariate optimization: lower_bound(res) upper_bound(res) x_lower_trace(res) x_upper_trace(res) rel_tol(res) abs_tol(res) Defined for multivariate optimization: g_norm_trace(res) g_calls(res) x_converged(res) f_converged(res) g_converged(res) initial_state(res)",
"title": "Complete list of functions"
},
{
"location": "/user/minimization/#input-types",
"text": "Most users will input Vector 's as their initial_x 's, and get an Optim.minimizer(res) out that is also a vector. For zeroth and first order methods, it is also possible to pass in matrices, or even higher dimensional arrays. The only restriction imposed by leaving the Vector case is, that it is no longer possible to use finite difference approximations or autmatic differentiation. Second order methods (variants of Newton's method) do not support this more general input type.",
"title": "Input types"
},
{
"location": "/user/minimization/#notes-on-convergence-flags-and-checks",
"text": "Currently, it is possible to access a minimizer using Optim.minimizer(result) even if all convergence flags are false . This means that the user has to be a bit careful when using the output from the solvers. It is advised to include checks for convergence if the minimizer or minimum is used to carry out further calculations. A related note is that first and second order methods makes a convergence check on the gradient before entering the optimization loop. This is done to prevent line search errors if initial_x is a stationary point. Notice, that this is only a first order check. If initial_x is any type of stationary point, g_converged will be true. This includes local minima, saddle points, and local maxima. If iterations is 0 and g_converged is true , the user needs to keep this point in mind.",
"title": "Notes on convergence flags and checks"
},
{
"location": "/user/config/",
"text": "Configurable options\n\n\nThere are several options that simply take on some default values if the user doensn't supply anything else than a function (and gradient) and a starting point.\n\n\n\n\nSolver options\n\n\nThere quite a few different solvers available in Optim, and they are all listed below. Notice that the constructors are written without input here, but they generally take keywords to tweak the way they work. See the pages describing each solver for more detail.\n\n\nRequires only a function handle:\n\n\n\n\nNelderMead()\n\n\nSimulatedAnnealing()\n\n\n\n\nRequires a function and gradient (will be approximated if omitted):\n\n\n\n\nBFGS()\n\n\nLBFGS()\n\n\nConjugateGradient()\n\n\nGradientDescent()\n\n\nMomentumGradientDescent()\n\n\nAcceleratedGradientDescent()\n\n\n\n\nRequires a function, a gradient, and a Hessian (cannot be omitted):\n\n\n\n\nNewton()\n\n\nNewtonTrustRegion()\n\n\n\n\nBox constrained minimization:\n\n\n\n\nFminbox()\n\n\n\n\nSpecial methods for bounded univariate optimization:\n\n\n\n\nBrent()\n\n\nGoldenSection()\n\n\n\n\n\n\nGeneral Options\n\n\nIn addition to the solver, you can alter the behavior of the Optim package by using the following keywords:\n\n\n\n\nx_tol\n: What is the threshold for determining convergence in the input vector? Defaults to \n1e-32\n.\n\n\nf_tol\n: What is the threshold for determining convergence in the objective value? Defaults to \n1e-32\n.\n\n\ng_tol\n: What is the threshold for determining convergence in the gradient? Defaults to \n1e-8\n. For gradient free methods, this will control the main convergence tolerance, which is solver specific.\n\n\nf_calls_limit\n: A soft upper limit on the number of objective calls. Defaults to \n0\n (unlimited).\n\n\ng_calls_limit\n: A soft upper limit on the number of gradient calls. Defaults to \n0\n (unlimited).\n\n\nh_calls_limit\n: A soft upper limit on the number of Hessian calls. Defaults to \n0\n (unlimited).\n\n\nallow_f_increases\n: Allow steps that increase the objective value. Defaults to \nfalse\n. Note that, when setting this to \ntrue\n, the last iterate will be returned as the minimizer even if the objective increased.\n\n\niterations\n: How many iterations will run before the algorithm gives up? Defaults to \n1_000\n.\n\n\nstore_trace\n: Should a trace of the optimization algorithm's state be stored? Defaults to \nfalse\n.\n\n\nshow_trace\n: Should a trace of the optimization algorithm's state be shown on \nSTDOUT\n? Defaults to \nfalse\n.\n\n\nextended_trace\n: Save additional information. Solver dependent. Defaults to \nfalse\n.\n\n\nshow_every\n: Trace output is printed every \nshow_every\nth iteration.\n\n\ncallback\n: A function to be called during tracing. A return value of \ntrue\n stops the \noptimize\n call.\n\n\ntime_limit\n: A soft upper limit on the total run time. Defaults to \nNaN\n (unlimited).\n\n\n\n\nWe currently recommend the statically dispatched interface by using the \nOptim.Options\n constructor:\n\n\nres\n \n=\n \noptimize\n(\nf\n,\n \ng!\n,\n\n \n[\n0.0\n,\n \n0.0\n],\n\n \nGradientDescent\n(),\n\n \nOptim\n.\nOptions\n(\ng_tol\n \n=\n \n1e-12\n,\n\n \niterations\n \n=\n \n10\n,\n\n \nstore_trace\n \n=\n \ntrue\n,\n\n \nshow_trace\n \n=\n \nfalse\n))\n\n\n\n\n\n\nAnother interface is also available, based directly on keywords:\n\n\nres\n \n=\n \noptimize\n(\nf\n,\n \ng!\n,\n\n \n[\n0.0\n,\n \n0.0\n],\n\n \nmethod\n \n=\n \nGradientDescent\n(),\n\n \ng_tol\n \n=\n \n1e-12\n,\n\n \niterations\n \n=\n \n10\n,\n\n \nstore_trace\n \n=\n \ntrue\n,\n\n \nshow_trace\n \n=\n \nfalse\n)\n\n\n\n\n\n\nNotice the need to specify the method using a keyword if this syntax is used. This approach might be deprecated in the future, and as a result we recommend writing code that has to maintained using the \nOptim.Options\n approach.",
"title": "Configurable Options"
},
{
"location": "/user/config/#configurable-options",
"text": "There are several options that simply take on some default values if the user doensn't supply anything else than a function (and gradient) and a starting point.",
"title": "Configurable options"
},
{
"location": "/user/config/#solver-options",
"text": "There quite a few different solvers available in Optim, and they are all listed below. Notice that the constructors are written without input here, but they generally take keywords to tweak the way they work. See the pages describing each solver for more detail. Requires only a function handle: NelderMead() SimulatedAnnealing() Requires a function and gradient (will be approximated if omitted): BFGS() LBFGS() ConjugateGradient() GradientDescent() MomentumGradientDescent() AcceleratedGradientDescent() Requires a function, a gradient, and a Hessian (cannot be omitted): Newton() NewtonTrustRegion() Box constrained minimization: Fminbox() Special methods for bounded univariate optimization: Brent() GoldenSection()",
"title": "Solver options"
},
{
"location": "/user/config/#general-options",
"text": "In addition to the solver, you can alter the behavior of the Optim package by using the following keywords: x_tol : What is the threshold for determining convergence in the input vector? Defaults to 1e-32 . f_tol : What is the threshold for determining convergence in the objective value? Defaults to 1e-32 . g_tol : What is the threshold for determining convergence in the gradient? Defaults to 1e-8 . For gradient free methods, this will control the main convergence tolerance, which is solver specific. f_calls_limit : A soft upper limit on the number of objective calls. Defaults to 0 (unlimited). g_calls_limit : A soft upper limit on the number of gradient calls. Defaults to 0 (unlimited). h_calls_limit : A soft upper limit on the number of Hessian calls. Defaults to 0 (unlimited). allow_f_increases : Allow steps that increase the objective value. Defaults to false . Note that, when setting this to true , the last iterate will be returned as the minimizer even if the objective increased. iterations : How many iterations will run before the algorithm gives up? Defaults to 1_000 . store_trace : Should a trace of the optimization algorithm's state be stored? Defaults to false . show_trace : Should a trace of the optimization algorithm's state be shown on STDOUT ? Defaults to false . extended_trace : Save additional information. Solver dependent. Defaults to false . show_every : Trace output is printed every show_every th iteration. callback : A function to be called during tracing. A return value of true stops the optimize call. time_limit : A soft upper limit on the total run time. Defaults to NaN (unlimited). We currently recommend the statically dispatched interface by using the Optim.Options constructor: res = optimize ( f , g! , \n [ 0.0 , 0.0 ], \n GradientDescent (), \n Optim . Options ( g_tol = 1e-12 , \n iterations = 10 , \n store_trace = true , \n show_trace = false )) Another interface is also available, based directly on keywords: res = optimize ( f , g! , \n [ 0.0 , 0.0 ], \n method = GradientDescent (), \n g_tol = 1e-12 , \n iterations = 10 , \n store_trace = true , \n show_trace = false ) Notice the need to specify the method using a keyword if this syntax is used. This approach might be deprecated in the future, and as a result we recommend writing code that has to maintained using the Optim.Options approach.",
"title": "General Options"
},
{
"location": "/user/tipsandtricks/",
"text": "Dealing with constant parameters\n\n\nIn many applications, there may be factors that are relevant to the function evaluations, but are fixed throughout the optimization. An obvious example is using data in a likelihood function, but it could also be parameters we wish to hold constant.\n\n\nConsider a squared error loss function that depends on some data \nx\n and \ny\n, and parameters \nbetas\n. As far as the solver is concerned, there should only be one input argument to the function we want to minimize, call it \nsqerror\n.\n\n\nThe problem is that we want to optimize a function \nsqerror\n that really depends on three inputs, and two of them are constant throught the optimization procedure. To do this, we need to define the variables \nx\n and \ny\n\n\nx\n \n=\n \n[\n1.0\n,\n \n2.0\n,\n \n3.0\n]\n\n\ny\n \n=\n \n1.0\n \n+\n \n2.0\n \n*\n \nx\n \n+\n \n[\n-\n0.3\n,\n \n0.3\n,\n \n-\n0.1\n]\n\n\n\n\n\n\nWe then simply define a function in three variables\n\n\nfunction\n \nsqerror\n(\nbetas\n,\n \nX\n,\n \nY\n)\n\n \nerr\n \n=\n \n0.0\n\n \nfor\n \ni\n \nin\n \n1\n:\nlength\n(\nX\n)\n\n \npred_i\n \n=\n \nbetas\n[\n1\n]\n \n+\n \nbetas\n[\n2\n]\n \n*\n \nX\n[\ni\n]\n\n \nerr\n \n+=\n \n(\nY\n[\ni\n]\n \n-\n \npred_i\n)\n^\n2\n\n \nend\n\n \nreturn\n \nerr\n\n\nend\n\n\n\n\n\n\nand then optimize the following anonymous function\n\n\nres\n \n=\n \noptimize\n(\nb\n \n-\n \nsqerror\n(\nb\n,\n \nx\n,\n \ny\n),\n \n[\n0.0\n,\n \n0.0\n])\n\n\n\n\n\n\nAlternatively, we can define a closure \nsqerror(betas)\n that is aware of the variables we just defined\n\n\nfunction\n \nsqerror\n(\nbetas\n)\n\n \nerr\n \n=\n \n0.0\n\n \nfor\n \ni\n \nin\n \n1\n:\nlength\n(\nx\n)\n\n \npred_i\n \n=\n \nbetas\n[\n1\n]\n \n+\n \nbetas\n[\n2\n]\n \n*\n \nx\n[\ni\n]\n\n \nerr\n \n+=\n \n(\ny\n[\ni\n]\n \n-\n \npred_i\n)\n^\n2\n\n \nend\n\n \nreturn\n \nerr\n\n\nend\n\n\n\n\n\n\nWe can then optimize the \nsqerror\n function just like any other function\n\n\nres\n \n=\n \noptimize\n(\nsqerror\n,\n \n[\n0.0\n,\n \n0.0\n])\n\n\n\n\n\n\n\n\nAvoid repeating computations\n\n\nSay you are optimizing a function\n\n\nf\n(\nx\n)\n \n=\n \nx\n[\n1\n]\n^\n2\n+\nx\n[\n2\n]\n^\n2\n\n\ng!\n(\nstorage\n,\n \nx\n)\n \n=\n \ncopy!\n(\nstorage\n,\n \n[\n2\nx\n[\n1\n],\n \n2\nx\n[\n2\n]])\n\n\n\n\n\n\nIn this situation, no calculations from \nf\n could be reused in \ng!\n. However, sometimes there is a substantial similarity between the objective function, and gradient, and some calculations can be reused. The trick here is essentially the same as above. We use a closure or an anonymous function. Basically, we define\n\n\nfunction\n \ncalculate_common!\n(\nx\n,\n \nlast_x\n,\n \nbuffer\n)\n\n \nif\n \nx\n \n!=\n \nlast_x\n\n \ncopy!\n(\nlast_x\n,\n \nx\n)\n\n \n#do whatever common calculations and save to buffer\n\n \nend\n\n\nend\n\n\n\nfunction\n \nf\n(\nx\n,\n \nbuffer\n,\n \nlast_x\n)\n\n \ncalculate_common!\n(\nx\n,\n \nlast_x\n,\n \nbuffer\n)\n\n \nf_body\n \n# depends on buffer\n\n\nend\n\n\n\nfunction\n \ng!\n(\nx\n,\n \nstor\n,\n \nbuffer\n,\n \nlast_x\n)\n\n \ncalculate_common!\n(\nx\n,\n \nlast_x\n,\n \nbuffer\n)\n\n \ng_body!\n \n# depends on buffer\n\n\nend\n\n\n\n\n\n\nand then the following\n\n\nusing\n \nOptim\n\n\ninitial_x\n \n=\n \n...\n\n\nbuffer\n \n=\n \nArray\n{\neltype\n(\ninitial_x\n)}(\n...\n)\n \n# Preallocate an appropriate buffer\n\n\nlast_x\n \n=\n \nsimilar\n(\ninitial_x\n)\n\n\ndf\n \n=\n \nTwiceDifferentiable\n(\nx\n \n-\n \nf\n(\nx\n,\n \nbuffer\n,\n \ninitial_x\n),\n\n \n(\nstor\n,\n \nx\n)\n \n-\n \ng!\n(\nx\n,\n \nstor\n,\n \nbuffer\n,\n \nlast_x\n))\n\n\noptimize\n(\ndf\n,\n \ninitial_x\n)\n\n\n\n\n\n\n\n\nProvide gradients\n\n\nAs mentioned in the general introduction, passing analytical gradients can have an impact on performance. To show an example of this, consider the separable extension of the Rosenbrock function in dimension 5000, see \nSROSENBR\n in CUTEst.\n\n\nBelow, we use the gradients and objective functions from \nmastsif\n through \nCUTEst.jl\n. We only show the first five iterations of an attempt to minimize the function using Gradient Descent.\n\n\njulia\n \n@time\n \noptimize\n(\nf\n,\n \ninitial_x\n,\n \nGradientDescent\n(),\n\n \nOptim\n.\nOptions\n(\nshow_trace\n=\ntrue\n,\n \niterations\n \n=\n \n5\n))\n\n\nIter Function value Gradient norm\n\n\n 0 4.850000e+04 2.116000e+02\n\n\n 1 1.018734e+03 2.704951e+01\n\n\n 2 3.468449e+00 5.721261e-01\n\n\n 3 2.966899e+00 2.638790e-02\n\n\n 4 2.511859e+00 5.237768e-01\n\n\n 5 2.107853e+00 1.020287e-01\n\n\n 21.731129 seconds (1.61 M allocations: 63.434 MB, 0.03% gc time)\n\n\nResults of Optimization Algorithm\n\n\n * Algorithm: Gradient Descent\n\n\n * Starting Point: [1.2,1.0, ...]\n\n\n * Minimizer: [1.0287767703731154,1.058769439356144, ...]\n\n\n * Minimum: 2.107853e+00\n\n\n * Iterations: 5\n\n\n * Convergence: false\n\n\n * |x - x\n| \n 1.0e-32: false\n\n\n * |f(x) - f(x\n)| / |f(x)| \n 1.0e-32: false\n\n\n * |g(x)| \n 1.0e-08: false\n\n\n * Reached Maximum Number of Iterations: true\n\n\n * Objective Function Calls: 23\n\n\n * Gradient Calls: 23\n\n\n\njulia\n \n@time\n \noptimize\n(\nf\n,\n \ng!\n,\n \ninitial_x\n,\n \nGradientDescent\n(),\n\n \nOptim\n.\nOptions\n(\nshow_trace\n=\ntrue\n,\n \niterations\n \n=\n \n5\n))\n\n\nIter Function value Gradient norm\n\n\n 0 4.850000e+04 2.116000e+02\n\n\n 1 1.018769e+03 2.704998e+01\n\n\n 2 3.468488e+00 5.721481e-01\n\n\n 3 2.966900e+00 2.638792e-02\n\n\n 4 2.511828e+00 5.237919e-01\n\n\n 5 2.107802e+00 1.020415e-01\n\n\n 0.009889 seconds (915 allocations: 270.266 KB)\n\n\nResults of Optimization Algorithm\n\n\n * Algorithm: Gradient Descent\n\n\n * Starting Point: [1.2,1.0, ...]\n\n\n * Minimizer: [1.0287763814102757,1.05876866832087, ...]\n\n\n * Minimum: 2.107802e+00\n\n\n * Iterations: 5\n\n\n * Convergence: false\n\n\n * |x - x\n| \n 1.0e-32: false\n\n\n * |f(x) - f(x\n)| / |f(x)| \n 1.0e-32: false\n\n\n * |g(x)| \n 1.0e-08: false\n\n\n * Reached Maximum Number of Iterations: true\n\n\n * Objective Function Calls: 23\n\n\n * Gradient Calls: 23\n\n\n\n\n\n\nThe objective has obtained a value that is very similar between the two runs, but the run with the analytical gradient is way faster. It is possible that the finite differences code can be improved, but generally the optimization will be slowed down by all the function evaluations required to do the central finite differences calculations.\n\n\n\n\nSeparating time spent in Optim's code and user provided functions\n\n\nConsider the Rosenbrock problem.\n\n\nusing\n \nOptim\n\n\nprob\n \n=\n \nOptim\n.\nUnconstrainedProblems\n.\nexamples\n[\nRosenbrock\n];\n\n\n\n\n\n\nSay we optimize this function, and look at the total run time of \noptimize\n using the Newton Trust Region method, and we are surprised that it takes a long time to run. We then wonder if time is spent in Optim's own code (solving the sub-problem for example) or in evaluating the objective, gradient or hessian that we provided. Then it can be very useful to use the \nTimerOutputs.jl\n package. This package allows us to run an over-all timer for \noptimize\n, and add individual timers for \nf\n, \ng!\n, and \nh!\n. Consider the example below, that is due to the author of the package (Kristoffer Carlsson).\n\n\nusing\n \nTimerOutputs\n\n\nconst\n \nto\n \n=\n \nTimerOutput\n()\n\n\n\nf\n(\nx\n \n)\n \n=\n \n@timeit\n \nto\n \nf\n \nprob\n.\nf\n(\nx\n)\n\n\ng!\n(\nx\n,\n \ng\n)\n \n=\n \n@timeit\n \nto\n \ng!\n \nprob\n.\ng!\n(\nx\n,\n \ng\n)\n\n\nh!\n(\nx\n,\n \nh\n)\n \n=\n \n@timeit\n \nto\n \nh!\n \nprob\n.\nh!\n(\nx\n,\n \nh\n)\n\n\n\nbegin\n\n\nreset_timer!\n(\nto\n)\n\n\n@timeit\n \nto\n \nTrust Region\n \nbegin\n\n \nres\n \n=\n \nOptim\n.\noptimize\n(\nf\n,\n \ng!\n,\n \nh!\n,\n \nprob\n.\ninitial_x\n,\n \nNewtonTrustRegion\n())\n\n\nend\n\n\nshow\n(\nto\n;\n \nallocations\n \n=\n \nfalse\n)\n\n\nend\n\n\n\n\n\n\nWe see that the time is actually \nnot\n spent in our provided functions, but most of the time is spent in the code for the trust region method.\n\n\n\n\nEarly stopping\n\n\nSometimes it might be of interest to stop the optimizer early. The simplest way to do this is to set the \niterations\n keyword in \nOptim.Options\n to some number. This will prevent the iteration counter exceeding some limit, with the standard value being 1000. Alternatively, it is possible to put a soft limit on the run time of the optimization procedure by setting the \ntime_limit\n keyword in the \nOptim.Options\n constructor.\n\n\nusing\n \nOptim\n\n\nproblem\n \n=\n \nOptim\n.\nUnconstrainedProblems\n.\nexamples\n[\nRosenbrock\n]\n\n\n\nf\n \n=\n \nproblem\n.\nf\n\n\ninitial_x\n \n=\n \nproblem\n.\ninitial_x\n\n\n\nfunction\n \nslow\n(\nx\n)\n\n \nsleep\n(\n0.1\n)\n\n \nf\n(\nx\n)\n\n\nend\n\n\n\nstart_time\n \n=\n \ntime\n()\n\n\n\noptimize\n(\nslow\n,\n \nzeros\n(\n2\n),\n \nNelderMead\n(),\n \nOptim\n.\nOptions\n(\ntime_limit\n \n=\n \n3.0\n))\n\n\n\n\n\n\nThis will stop after about three seconds. If it is more important that we stop before the limit is reached, it is possible to use a callback with a simple model for predicting how much time will have passed when the next iteration is over. Consider the following code\n\n\nusing\n \nOptim\n\n\nproblem\n \n=\n \nOptim\n.\nUnconstrainedProblems\n.\nexamples\n[\nRosenbrock\n]\n\n\n\nf\n \n=\n \nproblem\n.\nf\n\n\ninitial_x\n \n=\n \nproblem\n.\ninitial_x\n\n\n\nfunction\n \nvery_slow\n(\nx\n)\n\n \nsleep\n(\n.\n5\n)\n\n \nf\n(\nx\n)\n\n\nend\n\n\n\nstart_time\n \n=\n \ntime\n()\n\n\ntime_to_setup\n \n=\n \nzeros\n(\n1\n)\n\n\nfunction\n \nadvanced_time_control\n(\nx\n)\n\n \nprintln\n(\n * Iteration: \n,\n \nx\n.\niteration\n)\n\n \nso_far\n \n=\n \ntime\n()\n-\nstart_time\n\n \nprintln\n(\n * Time so far: \n,\n \nso_far\n)\n\n \nif\n \nx\n.\niteration\n \n==\n \n0\n\n \ntime_to_setup\n[\n:\n]\n \n=\n \ntime\n()\n-\nstart_time\n\n \nelse\n\n \nexpected_next_time\n \n=\n \nso_far\n \n+\n \n(\ntime\n()\n-\nstart_time\n-\ntime_to_setup\n[\n1\n])\n/\n(\nx\n.\niteration\n)\n\n \nprintln\n(\n * Next iteration \u2248 \n,\n \nexpected_next_time\n)\n\n \nprintln\n()\n\n \nreturn\n \nexpected_next_time\n \n \n13\n \n?\n \nfalse\n \n:\n \ntrue\n\n \nend\n\n \nprintln\n()\n\n \nfalse\n\n\nend\n\n\noptimize\n(\nvery_slow\n,\n \nzeros\n(\n2\n),\n \nNelderMead\n(),\n \nOptim\n.\nOptions\n(\ncallback\n \n=\n \nadvanced_time_control\n))\n\n\n\n\n\n\nIt will try to predict the elapsed time after the next iteration is over, and stop now if it is expected to exceed the limit of 13 seconds. Running it, we get something like the following output\n\n\njulia\n \noptimize\n(\nvery_slow\n,\n \nzeros\n(\n2\n),\n \nNelderMead\n(),\n \nOptim\n.\nOptions\n(\ncallback\n \n=\n \nadvanced_time_control\n))\n\n\n * Iteration: 0\n\n\n * Time so far: 2.219298839569092\n\n\n\n * Iteration: 1\n\n\n * Time so far: 3.4006409645080566\n\n\n * Next iteration \u2248 4.5429909229278564\n\n\n\n * Iteration: 2\n\n\n * Time so far: 4.403923988342285\n\n\n * Next iteration \u2248 5.476739525794983\n\n\n\n * Iteration: 3\n\n\n * Time so far: 5.407265901565552\n\n\n * Next iteration \u2248 6.4569235642751055\n\n\n\n * Iteration: 4\n\n\n * Time so far: 5.909044027328491\n\n\n * Next iteration \u2248 6.821732044219971\n\n\n\n * Iteration: 5\n\n\n * Time so far: 6.912338972091675\n\n\n * Next iteration \u2248 7.843148183822632\n\n\n\n * Iteration: 6\n\n\n * Time so far: 7.9156060218811035\n\n\n * Next iteration \u2248 8.85849153995514\n\n\n\n * Iteration: 7\n\n\n * Time so far: 8.918903827667236\n\n\n * Next iteration \u2248 9.870419979095459\n\n\n\n * Iteration: 8\n\n\n * Time so far: 9.922197818756104\n\n\n * Next iteration \u2248 10.880185931921005\n\n\n\n * Iteration: 9\n\n\n * Time so far: 10.925468921661377\n\n\n * Next iteration \u2248 11.888488478130764\n\n\n\n * Iteration: 10\n\n\n * Time so far: 11.92870283126831\n\n\n * Next iteration \u2248 12.895747828483582\n\n\n\n * Iteration: 11\n\n\n * Time so far: 12.932114839553833\n\n\n * Next iteration \u2248 13.902462200684981\n\n\n\nResults of Optimization Algorithm\n\n\n * Algorithm: Nelder-Mead\n\n\n * Starting Point: [0.0,0.0]\n\n\n * Minimizer: [0.23359374999999996,0.042187499999999996, ...]\n\n\n * Minimum: 6.291677e-01\n\n\n * Iterations: 11\n\n\n * Convergence: false\n\n\n * \u221a(\u03a3(y\u1d62-y\u0304)\u00b2)/n \n 1.0e-08: false\n\n\n * Reached Maximum Number of Iterations: false\n\n\n * Objective Function Calls: 24",
"title": "Tips and tricks"
},
{
"location": "/user/tipsandtricks/#dealing-with-constant-parameters",
"text": "In many applications, there may be factors that are relevant to the function evaluations, but are fixed throughout the optimization. An obvious example is using data in a likelihood function, but it could also be parameters we wish to hold constant. Consider a squared error loss function that depends on some data x and y , and parameters betas . As far as the solver is concerned, there should only be one input argument to the function we want to minimize, call it sqerror . The problem is that we want to optimize a function sqerror that really depends on three inputs, and two of them are constant throught the optimization procedure. To do this, we need to define the variables x and y x = [ 1.0 , 2.0 , 3.0 ] y = 1.0 + 2.0 * x + [ - 0.3 , 0.3 , - 0.1 ] We then simply define a function in three variables function sqerror ( betas , X , Y ) \n err = 0.0 \n for i in 1 : length ( X ) \n pred_i = betas [ 1 ] + betas [ 2 ] * X [ i ] \n err += ( Y [ i ] - pred_i ) ^ 2 \n end \n return err end and then optimize the following anonymous function res = optimize ( b - sqerror ( b , x , y ), [ 0.0 , 0.0 ]) Alternatively, we can define a closure sqerror(betas) that is aware of the variables we just defined function sqerror ( betas ) \n err = 0.0 \n for i in 1 : length ( x ) \n pred_i = betas [ 1 ] + betas [ 2 ] * x [ i ] \n err += ( y [ i ] - pred_i ) ^ 2 \n end \n return err end We can then optimize the sqerror function just like any other function res = optimize ( sqerror , [ 0.0 , 0.0 ])",
"title": "Dealing with constant parameters"
},
{
"location": "/user/tipsandtricks/#avoid-repeating-computations",
"text": "Say you are optimizing a function f ( x ) = x [ 1 ] ^ 2 + x [ 2 ] ^ 2 g! ( storage , x ) = copy! ( storage , [ 2 x [ 1 ], 2 x [ 2 ]]) In this situation, no calculations from f could be reused in g! . However, sometimes there is a substantial similarity between the objective function, and gradient, and some calculations can be reused. The trick here is essentially the same as above. We use a closure or an anonymous function. Basically, we define function calculate_common! ( x , last_x , buffer ) \n if x != last_x \n copy! ( last_x , x ) \n #do whatever common calculations and save to buffer \n end end function f ( x , buffer , last_x ) \n calculate_common! ( x , last_x , buffer ) \n f_body # depends on buffer end function g! ( x , stor , buffer , last_x ) \n calculate_common! ( x , last_x , buffer ) \n g_body! # depends on buffer end and then the following using Optim initial_x = ... buffer = Array { eltype ( initial_x )}( ... ) # Preallocate an appropriate buffer last_x = similar ( initial_x ) df = TwiceDifferentiable ( x - f ( x , buffer , initial_x ), \n ( stor , x ) - g! ( x , stor , buffer , last_x )) optimize ( df , initial_x )",
"title": "Avoid repeating computations"
},
{
"location": "/user/tipsandtricks/#provide-gradients",
"text": "As mentioned in the general introduction, passing analytical gradients can have an impact on performance. To show an example of this, consider the separable extension of the Rosenbrock function in dimension 5000, see SROSENBR in CUTEst. Below, we use the gradients and objective functions from mastsif through CUTEst.jl . We only show the first five iterations of an attempt to minimize the function using Gradient Descent. julia @time optimize ( f , initial_x , GradientDescent (), \n Optim . Options ( show_trace = true , iterations = 5 )) Iter Function value Gradient norm 0 4.850000e+04 2.116000e+02 1 1.018734e+03 2.704951e+01 2 3.468449e+00 5.721261e-01 3 2.966899e+00 2.638790e-02 4 2.511859e+00 5.237768e-01 5 2.107853e+00 1.020287e-01 21.731129 seconds (1.61 M allocations: 63.434 MB, 0.03% gc time) Results of Optimization Algorithm * Algorithm: Gradient Descent * Starting Point: [1.2,1.0, ...] * Minimizer: [1.0287767703731154,1.058769439356144, ...] * Minimum: 2.107853e+00 * Iterations: 5 * Convergence: false * |x - x | 1.0e-32: false * |f(x) - f(x )| / |f(x)| 1.0e-32: false * |g(x)| 1.0e-08: false * Reached Maximum Number of Iterations: true * Objective Function Calls: 23 * Gradient Calls: 23 julia @time optimize ( f , g! , initial_x , GradientDescent (), \n Optim . Options ( show_trace = true , iterations = 5 )) Iter Function value Gradient norm 0 4.850000e+04 2.116000e+02 1 1.018769e+03 2.704998e+01 2 3.468488e+00 5.721481e-01 3 2.966900e+00 2.638792e-02 4 2.511828e+00 5.237919e-01 5 2.107802e+00 1.020415e-01 0.009889 seconds (915 allocations: 270.266 KB) Results of Optimization Algorithm * Algorithm: Gradient Descent * Starting Point: [1.2,1.0, ...] * Minimizer: [1.0287763814102757,1.05876866832087, ...] * Minimum: 2.107802e+00 * Iterations: 5 * Convergence: false * |x - x | 1.0e-32: false * |f(x) - f(x )| / |f(x)| 1.0e-32: false * |g(x)| 1.0e-08: false * Reached Maximum Number of Iterations: true * Objective Function Calls: 23 * Gradient Calls: 23 The objective has obtained a value that is very similar between the two runs, but the run with the analytical gradient is way faster. It is possible that the finite differences code can be improved, but generally the optimization will be slowed down by all the function evaluations required to do the central finite differences calculations.",
"title": "Provide gradients"
},
{
"location": "/user/tipsandtricks/#separating-time-spent-in-optims-code-and-user-provided-functions",
"text": "Consider the Rosenbrock problem. using Optim prob = Optim . UnconstrainedProblems . examples [ Rosenbrock ]; Say we optimize this function, and look at the total run time of optimize using the Newton Trust Region method, and we are surprised that it takes a long time to run. We then wonder if time is spent in Optim's own code (solving the sub-problem for example) or in evaluating the objective, gradient or hessian that we provided. Then it can be very useful to use the TimerOutputs.jl package. This package allows us to run an over-all timer for optimize , and add individual timers for f , g! , and h! . Consider the example below, that is due to the author of the package (Kristoffer Carlsson). using TimerOutputs const to = TimerOutput () f ( x ) = @timeit to f prob . f ( x ) g! ( x , g ) = @timeit to g! prob . g! ( x , g ) h! ( x , h ) = @timeit to h! prob . h! ( x , h ) begin reset_timer! ( to ) @timeit to Trust Region begin \n res = Optim . optimize ( f , g! , h! , prob . initial_x , NewtonTrustRegion ()) end show ( to ; allocations = false ) end We see that the time is actually not spent in our provided functions, but most of the time is spent in the code for the trust region method.",
"title": "Separating time spent in Optim's code and user provided functions"
},
{
"location": "/user/tipsandtricks/#early-stopping",
"text": "Sometimes it might be of interest to stop the optimizer early. The simplest way to do this is to set the iterations keyword in Optim.Options to some number. This will prevent the iteration counter exceeding some limit, with the standard value being 1000. Alternatively, it is possible to put a soft limit on the run time of the optimization procedure by setting the time_limit keyword in the Optim.Options constructor. using Optim problem = Optim . UnconstrainedProblems . examples [ Rosenbrock ] f = problem . f initial_x = problem . initial_x function slow ( x ) \n sleep ( 0.1 ) \n f ( x ) end start_time = time () optimize ( slow , zeros ( 2 ), NelderMead (), Optim . Options ( time_limit = 3.0 )) This will stop after about three seconds. If it is more important that we stop before the limit is reached, it is possible to use a callback with a simple model for predicting how much time will have passed when the next iteration is over. Consider the following code using Optim problem = Optim . UnconstrainedProblems . examples [ Rosenbrock ] f = problem . f initial_x = problem . initial_x function very_slow ( x ) \n sleep ( . 5 ) \n f ( x ) end start_time = time () time_to_setup = zeros ( 1 ) function advanced_time_control ( x ) \n println ( * Iteration: , x . iteration ) \n so_far = time () - start_time \n println ( * Time so far: , so_far ) \n if x . iteration == 0 \n time_to_setup [ : ] = time () - start_time \n else \n expected_next_time = so_far + ( time () - start_time - time_to_setup [ 1 ]) / ( x . iteration ) \n println ( * Next iteration \u2248 , expected_next_time ) \n println () \n return expected_next_time 13 ? false : true \n end \n println () \n false end optimize ( very_slow , zeros ( 2 ), NelderMead (), Optim . Options ( callback = advanced_time_control )) It will try to predict the elapsed time after the next iteration is over, and stop now if it is expected to exceed the limit of 13 seconds. Running it, we get something like the following output julia optimize ( very_slow , zeros ( 2 ), NelderMead (), Optim . Options ( callback = advanced_time_control )) * Iteration: 0 * Time so far: 2.219298839569092 * Iteration: 1 * Time so far: 3.4006409645080566 * Next iteration \u2248 4.5429909229278564 * Iteration: 2 * Time so far: 4.403923988342285 * Next iteration \u2248 5.476739525794983 * Iteration: 3 * Time so far: 5.407265901565552 * Next iteration \u2248 6.4569235642751055 * Iteration: 4 * Time so far: 5.909044027328491 * Next iteration \u2248 6.821732044219971 * Iteration: 5 * Time so far: 6.912338972091675 * Next iteration \u2248 7.843148183822632 * Iteration: 6 * Time so far: 7.9156060218811035 * Next iteration \u2248 8.85849153995514 * Iteration: 7 * Time so far: 8.918903827667236 * Next iteration \u2248 9.870419979095459 * Iteration: 8 * Time so far: 9.922197818756104 * Next iteration \u2248 10.880185931921005 * Iteration: 9 * Time so far: 10.925468921661377 * Next iteration \u2248 11.888488478130764 * Iteration: 10 * Time so far: 11.92870283126831 * Next iteration \u2248 12.895747828483582 * Iteration: 11 * Time so far: 12.932114839553833 * Next iteration \u2248 13.902462200684981 Results of Optimization Algorithm * Algorithm: Nelder-Mead * Starting Point: [0.0,0.0] * Minimizer: [0.23359374999999996,0.042187499999999996, ...] * Minimum: 6.291677e-01 * Iterations: 11 * Convergence: false * \u221a(\u03a3(y\u1d62-y\u0304)\u00b2)/n 1.0e-08: false * Reached Maximum Number of Iterations: false * Objective Function Calls: 24",
"title": "Early stopping"
},
{
"location": "/algo/nelder_mead/",
"text": "Nelder-Mead\n\n\nNelder-Mead is currently the standard algorithm when no derivatives are provided.\n\n\n\n\nConstructor\n\n\nNelderMead\n(;\n \nparameters\n \n=\n \nAdaptiveParameters\n(),\n\n \ninitial_simplex\n \n=\n \nAffineSimplexer\n())\n\n\n\n\n\n\nThe keywords in the constructor are used to control the following parts of the solver:\n\n\n\n\nparameters\n is a an instance of either \nAdaptiveParameters\n or \nFixedParameters\n, and is\n\n\n\n\nused to generate parameters for the Nelder-Mead Algorithm.\n\n\n\n\ninitial_simplex\n is an instance of \nAffineSimplexer\n. See more\n\n\n\n\ndetails below.\n\n\n\n\nDescription\n\n\nOur current implementation of the Nelder-Mead algorithm is based on Nelder and Mead (1965) and Gao and Han (2010). Gradient free methods can be a bit sensitive to starting values and tuning parameters, so it is a good idea to be careful with the defaults provided in Optim.\n\n\nInstead of using gradient information, Nelder-Mead is a direct search method. It keeps track of the function value at a number of points in the search space. Together, the points form a simplex. Given a simplex, we can perform one of four actions: reflect, expand, contract, or shrink. Basically, the goal is to iteratively replace the worst point with a better point. More information can be found in Nelder and Mead (1965), Lagarias, et al (1998) or Gao and Han (2010).\n\n\nThe stopping rule is the same as in the original paper, and is the standard error of the function values at the vertices. To set the tolerance level for this convergence criterion, set the \ng_tol\n level as described in the Configurable Options section.\n\n\nWhen the solver finishes, we return a minimizer which is either the centroid or one of the vertices. The function value at the centroid adds a function evaluation, as we need to evaluate the objection at the centroid to choose the smallest function value. However, even if the function value at the centroid can be returned as the minimum, we do not trace it during the optimization iterations. This is to avoid too many evaluations of the objective function which can be computationally expensive. Typically, there should be no more than twice as many \nf_calls\n than \niterations\n. Adding an evaluation at the centroid when tracing could considerably increase the total run-time of the algorithm.\n\n\n\n\nSpecifying the initial simplex\n\n\nThe default choice of \ninitial_simplex\n is \nAffineSimplexer()\n. A simplex is represented by an $(n+1)$-dimensional vector of $n$-dimensional vectors. It is used together with the initial \nx\n to create the initial simplex. To construct the $i$th vertex, it simply multiplies entry $i$ in the initial vector with a constant \nb\n, and adds a constant \na\n. This means that the $i$th of the $n$ additional vertices is of the form\n\n\n\n\n\n(x_0^1, x_0^2, \\ldots, x_0^i, \\ldots, 0,0) + (0, 0, \\ldots, x_0^i\\cdot b+a,\\ldots, 0,0)\n\n\n\n\n\nIf an $x_0^i$ is zero, we need the $a$ to make sure all vertices are unique. Generally, it is advised to start with a relatively large simplex.\n\n\nIf a specific simplex is wanted, it is possible to construct the $(n+1)$-vector of $n$-dimensional vectors, and pass it to the solver using a new type definition and a new method for the function \nsimplexer\n. For example, let us minimize the two-dimensional Rosenbrock function, and choose three vertices that have elements that are simply standard uniform draws.\n\n\nusing\n \nOptim\n\n\nstruct\n \nMySimplexer\n \n:\n \nOptim\n.\nSimplexer\n \nend\n\n\nOptim\n.\nsimplexer\n(\nS\n::\nMySimplexer\n,\n \ninitial_x\n)\n \n=\n \n[\nrand\n(\nlength\n(\ninitial_x\n))\n \nfor\n \ni\n \n=\n \n1\n:\nlength\n(\ninitial_x\n)\n+\n1\n]\n\n\nf\n(\nx\n)\n \n=\n \n(\n1.0\n \n-\n \nx\n[\n1\n])\n^\n2\n \n+\n \n100.0\n \n*\n \n(\nx\n[\n2\n]\n \n-\n \nx\n[\n1\n]\n^\n2\n)\n^\n2\n\n\noptimize\n(\nf\n,\n \n[\n.\n0\n,\n \n.\n0\n],\n \nNelderMead\n(\ninitial_simplex\n \n=\n \nMySimplexer\n()))\n\n\n\n\n\n\nSay we want to implement the initial simplex as in Matlab's \nfminsearch\n. This is very close to the \nAffineSimplexer\n above, but with a small twist. Instead of always adding the \na\n, a constant is only added to entries that are zero. If the entry is non-zero, five percent of the level is added. This might be implemented (by the user) as\n\n\nstruct\n \nMatlabSimplexer\n \n:\n \nOptim\n.\nSimplexer\n\n \na\n::\nFloat64\n\n \nb\n::\nFloat64\n\n\nend\n\n\nMatlabSimplexer\n(;\na\n \n=\n \n0.00025\n,\n \nb\n \n=\n \n0.05\n)\n \n=\n \nMatlabSimplexer\n(\na\n,\n \nb\n)\n\n\n\nfunction\n \nOptim\n.\nsimplexer\n(\nA\n::\nMatlabSimplexer\n,\n \ninitial_x\n::\nArray\n{\nT\n,\n \nN\n})\n \nwhere\n \n{\nT\n,\n \nN\n}\n\n \nn\n \n=\n \nlength\n(\ninitial_x\n)\n\n \ninitial_simplex\n \n=\n \nArray\n{\nT\n,\n \nN\n}[\ninitial_x\n \nfor\n \ni\n \n=\n \n1\n:\nn\n+\n1\n]\n\n \nfor\n \nj\n \n=\n \n1\n:\nn\n\n \ninitial_simplex\n[\nj\n+\n1\n][\nj\n]\n \n+=\n \ninitial_simplex\n[\nj\n+\n1\n][\nj\n]\n \n==\n \nzero\n(\nT\n)\n \n?\n \nS\n.\nb\n \n*\n \ninitial_simplex\n[\nj\n+\n1\n][\nj\n]\n \n:\n \nS\n.\na\n\n \nend\n\n \ninitial_simplex\n\n\nend\n\n\n\n\n\n\n\n\nThe parameters of Nelder-Mead\n\n\nThe different types of steps in the algorithm are governed by four parameters: $\\alpha$ for the reflection, $\\beta$ for the expansion, $\\gamma$ for the contraction, and $\\delta$ for the shrink step. We default to the adaptive parameters scheme in Gao and Han (2010). These are based on the dimensionality of the problem, and are given by\n\n\n\n\n\n\\alpha = 1, \\quad \\beta = 1+2/n,\\quad \\gamma =0.75 + 1/2n,\\quad \\delta = 1-1/n\n\n\n\n\n\nIt is also possible to specify the original parameters from Nelder and Mead (1965)\n\n\n\n\n\n\\alpha = 1,\\quad \\beta = 2, \\quad\\gamma = 1/2, \\quad\\delta = 1/2\n\n\n\n\n\nby specifying \nparameters = Optim.FixedParameters()\n. For specifying custom values, \nparameters = Optim.FixedParameters(\u03b1 = a, \u03b2 = b, \u03b3 = g, \u03b4 = d)\n is used, where a, b, g, d are the chosen values. If another parameter specification is wanted, it is possible to create a custom sub-type of\nOptim.NMParameters\n, and add a method to the \nparameters\n function. It should take the new type as the first positional argument, and the dimensionality of \nx\n as the second positional argument, and return a 4-tuple of parameters. However, it will often be easier to simply supply the wanted parameters to \nFixedParameters\n.\n\n\n\n\nReferences\n\n\nNelder, John A. and R. Mead (1965). \"A simplex method for function minimization\". Computer Journal 7: 308\u2013313. doi:10.1093/comjnl/7.4.308.\n\n\nLagarias, Jeffrey C., et al. \"Convergence properties of the Nelder\u2013Mead simplex method in low dimensions.\" SIAM Journal on optimization 9.1 (1998): 112-147.\n\n\nGao, Fuchang and Lixing Han (2010). \"Implementing the Nelder-Mead simplex algorithm with adaptive parameters\". Computational Optimization and Applications [DOI 10.1007/s10589-010-9329-3]",
"title": "Nelder Mead"
},
{
"location": "/algo/nelder_mead/#nelder-mead",
"text": "Nelder-Mead is currently the standard algorithm when no derivatives are provided.",
"title": "Nelder-Mead"
},
{
"location": "/algo/nelder_mead/#constructor",
"text": "NelderMead (; parameters = AdaptiveParameters (), \n initial_simplex = AffineSimplexer ()) The keywords in the constructor are used to control the following parts of the solver: parameters is a an instance of either AdaptiveParameters or FixedParameters , and is used to generate parameters for the Nelder-Mead Algorithm. initial_simplex is an instance of AffineSimplexer . See more details below.",
"title": "Constructor"
},
{
"location": "/algo/nelder_mead/#description",
"text": "Our current implementation of the Nelder-Mead algorithm is based on Nelder and Mead (1965) and Gao and Han (2010). Gradient free methods can be a bit sensitive to starting values and tuning parameters, so it is a good idea to be careful with the defaults provided in Optim. Instead of using gradient information, Nelder-Mead is a direct search method. It keeps track of the function value at a number of points in the search space. Together, the points form a simplex. Given a simplex, we can perform one of four actions: reflect, expand, contract, or shrink. Basically, the goal is to iteratively replace the worst point with a better point. More information can be found in Nelder and Mead (1965), Lagarias, et al (1998) or Gao and Han (2010). The stopping rule is the same as in the original paper, and is the standard error of the function values at the vertices. To set the tolerance level for this convergence criterion, set the g_tol level as described in the Configurable Options section. When the solver finishes, we return a minimizer which is either the centroid or one of the vertices. The function value at the centroid adds a function evaluation, as we need to evaluate the objection at the centroid to choose the smallest function value. However, even if the function value at the centroid can be returned as the minimum, we do not trace it during the optimization iterations. This is to avoid too many evaluations of the objective function which can be computationally expensive. Typically, there should be no more than twice as many f_calls than iterations . Adding an evaluation at the centroid when tracing could considerably increase the total run-time of the algorithm.",
"title": "Description"
},
{
"location": "/algo/nelder_mead/#specifying-the-initial-simplex",
"text": "The default choice of initial_simplex is AffineSimplexer() . A simplex is represented by an $(n+1)$-dimensional vector of $n$-dimensional vectors. It is used together with the initial x to create the initial simplex. To construct the $i$th vertex, it simply multiplies entry $i$ in the initial vector with a constant b , and adds a constant a . This means that the $i$th of the $n$ additional vertices is of the form \n(x_0^1, x_0^2, \\ldots, x_0^i, \\ldots, 0,0) + (0, 0, \\ldots, x_0^i\\cdot b+a,\\ldots, 0,0) If an $x_0^i$ is zero, we need the $a$ to make sure all vertices are unique. Generally, it is advised to start with a relatively large simplex. If a specific simplex is wanted, it is possible to construct the $(n+1)$-vector of $n$-dimensional vectors, and pass it to the solver using a new type definition and a new method for the function simplexer . For example, let us minimize the two-dimensional Rosenbrock function, and choose three vertices that have elements that are simply standard uniform draws. using Optim struct MySimplexer : Optim . Simplexer end Optim . simplexer ( S :: MySimplexer , initial_x ) = [ rand ( length ( initial_x )) for i = 1 : length ( initial_x ) + 1 ] f ( x ) = ( 1.0 - x [ 1 ]) ^ 2 + 100.0 * ( x [ 2 ] - x [ 1 ] ^ 2 ) ^ 2 optimize ( f , [ . 0 , . 0 ], NelderMead ( initial_simplex = MySimplexer ())) Say we want to implement the initial simplex as in Matlab's fminsearch . This is very close to the AffineSimplexer above, but with a small twist. Instead of always adding the a , a constant is only added to entries that are zero. If the entry is non-zero, five percent of the level is added. This might be implemented (by the user) as struct MatlabSimplexer : Optim . Simplexer \n a :: Float64 \n b :: Float64 end MatlabSimplexer (; a = 0.00025 , b = 0.05 ) = MatlabSimplexer ( a , b ) function Optim . simplexer ( A :: MatlabSimplexer , initial_x :: Array { T , N }) where { T , N } \n n = length ( initial_x ) \n initial_simplex = Array { T , N }[ initial_x for i = 1 : n + 1 ] \n for j = 1 : n \n initial_simplex [ j + 1 ][ j ] += initial_simplex [ j + 1 ][ j ] == zero ( T ) ? S . b * initial_simplex [ j + 1 ][ j ] : S . a \n end \n initial_simplex end",
"title": "Specifying the initial simplex"
},
{
"location": "/algo/nelder_mead/#the-parameters-of-nelder-mead",
"text": "The different types of steps in the algorithm are governed by four parameters: $\\alpha$ for the reflection, $\\beta$ for the expansion, $\\gamma$ for the contraction, and $\\delta$ for the shrink step. We default to the adaptive parameters scheme in Gao and Han (2010). These are based on the dimensionality of the problem, and are given by \n\\alpha = 1, \\quad \\beta = 1+2/n,\\quad \\gamma =0.75 + 1/2n,\\quad \\delta = 1-1/n It is also possible to specify the original parameters from Nelder and Mead (1965) \n\\alpha = 1,\\quad \\beta = 2, \\quad\\gamma = 1/2, \\quad\\delta = 1/2 by specifying parameters = Optim.FixedParameters() . For specifying custom values, parameters = Optim.FixedParameters(\u03b1 = a, \u03b2 = b, \u03b3 = g, \u03b4 = d) is used, where a, b, g, d are the chosen values. If another parameter specification is wanted, it is possible to create a custom sub-type of Optim.NMParameters , and add a method to the parameters function. It should take the new type as the first positional argument, and the dimensionality of x as the second positional argument, and return a 4-tuple of parameters. However, it will often be easier to simply supply the wanted parameters to FixedParameters .",
"title": "The parameters of Nelder-Mead"
},
{
"location": "/algo/nelder_mead/#references",
"text": "Nelder, John A. and R. Mead (1965). \"A simplex method for function minimization\". Computer Journal 7: 308\u2013313. doi:10.1093/comjnl/7.4.308. Lagarias, Jeffrey C., et al. \"Convergence properties of the Nelder\u2013Mead simplex method in low dimensions.\" SIAM Journal on optimization 9.1 (1998): 112-147. Gao, Fuchang and Lixing Han (2010). \"Implementing the Nelder-Mead simplex algorithm with adaptive parameters\". Computational Optimization and Applications [DOI 10.1007/s10589-010-9329-3]",
"title": "References"
},
{
"location": "/algo/simulated_annealing/",
"text": "Simulated Annealing\n\n\n\n\nConstructor\n\n\nSimulatedAnnealing\n(;\n \nneighbor\n \n=\n \ndefault_neighbor!\n,\n\n \nT\n \n=\n \ndefault_temperature\n,\n\n \np\n \n=\n \nkirkpatrick\n)\n\n\n\n\n\n\nThe constructor takes three keywords:\n\n\n\n\nneighbor = a!(x_proposed, x_current)\n, a mutating function of the current x, and the proposed x\n\n\nT = b(iteration)\n, a function of the current iteration that returns a temperature\n\n\np = c(f_proposal, f_current, T)\n, a function of the current temperature, current function value and proposed function value that returns an acceptance probability\n\n\n\n\n\n\nDescription\n\n\nSimulated Annealing is a derivative free method for optimization. It is based on the Metropolis-Hastings algorithm that was originally used to generate samples from a thermodynamics system, and is often used to generate draws from a posterior when doing Bayesian inference. As such, it is a probabilistic method for finding the minimum of a function, often over a quite large domains. For the historical reasons given above, the algorithm uses terms such as cooling, temperature, and acceptance probabilities.\n\n\nAs the constructor shows, a simulated annealing implementation is characterized by a temperature, a neighbor function, and an acceptance probability. The temperature controls how volatile the changes in minimizer candidates are allowed to be, as it enters the acceptance probability. For example, the original Kirkpatrick et al. acceptance probability function can be written as follows\n\n\np\n(\nf_proposal\n,\n \nf_current\n,\n \nT\n)\n \n=\n \nexp\n(\n-\n(\nf_proposal\n \n-\n \nf_current\n)\n/\nT\n)\n\n\n\n\n\n\nA high temperature makes it more likely that a draw is accepted, by pushing acceptance probability to 1. As in the Metropolis-Hastings algorithm, we always accept a smaller function value, but we also sometimes accept a larger value. As the temperature decreases, we're more and more likely to only accept candidate \nx\n's that lowers the function value. To obtain a new \nf_proposal\n, we need a neighbor function. A simple neighbor function adds a standard normal draw to each dimension of \nx\n\n\nfunction\n \nneighbor!\n(\nx_proposal\n::\nArray\n,\n \nx\n::\nArray\n)\n\n \nfor\n \ni\n \nin\n \neachindex\n(\nx\n)\n\n \nx_proposal\n[\ni\n]\n \n=\n \nx\n[\ni\n]\n+\nrandn\n()\n\n \nend\n\n\nend\n\n\n\n\n\n\nAs we see, it is not really possible to disentangle the role of the different components of the algorithm. For example, both the functional form of the acceptance function, the temperature and (indirectly) the neighbor function determine if the next draw of \nx\n is accepted or not.\n\n\nThe current implementation of Simulated Annealing is very rough. It lacks quite a few features which are normally part of a proper SA implementation. A better implementation is under way, see \nthis issue\n.\n\n\n\n\nExample\n\n\n\n\nReferences",
"title": "Simulated Annealing"
},
{
"location": "/algo/simulated_annealing/#simulated-annealing",
"text": "",
"title": "Simulated Annealing"
},
{
"location": "/algo/simulated_annealing/#constructor",
"text": "SimulatedAnnealing (; neighbor = default_neighbor! , \n T = default_temperature , \n p = kirkpatrick ) The constructor takes three keywords: neighbor = a!(x_proposed, x_current) , a mutating function of the current x, and the proposed x T = b(iteration) , a function of the current iteration that returns a temperature p = c(f_proposal, f_current, T) , a function of the current temperature, current function value and proposed function value that returns an acceptance probability",
"title": "Constructor"
},
{
"location": "/algo/simulated_annealing/#description",
"text": "Simulated Annealing is a derivative free method for optimization. It is based on the Metropolis-Hastings algorithm that was originally used to generate samples from a thermodynamics system, and is often used to generate draws from a posterior when doing Bayesian inference. As such, it is a probabilistic method for finding the minimum of a function, often over a quite large domains. For the historical reasons given above, the algorithm uses terms such as cooling, temperature, and acceptance probabilities. As the constructor shows, a simulated annealing implementation is characterized by a temperature, a neighbor function, and an acceptance probability. The temperature controls how volatile the changes in minimizer candidates are allowed to be, as it enters the acceptance probability. For example, the original Kirkpatrick et al. acceptance probability function can be written as follows p ( f_proposal , f_current , T ) = exp ( - ( f_proposal - f_current ) / T ) A high temperature makes it more likely that a draw is accepted, by pushing acceptance probability to 1. As in the Metropolis-Hastings algorithm, we always accept a smaller function value, but we also sometimes accept a larger value. As the temperature decreases, we're more and more likely to only accept candidate x 's that lowers the function value. To obtain a new f_proposal , we need a neighbor function. A simple neighbor function adds a standard normal draw to each dimension of x function neighbor! ( x_proposal :: Array , x :: Array ) \n for i in eachindex ( x ) \n x_proposal [ i ] = x [ i ] + randn () \n end end As we see, it is not really possible to disentangle the role of the different components of the algorithm. For example, both the functional form of the acceptance function, the temperature and (indirectly) the neighbor function determine if the next draw of x is accepted or not. The current implementation of Simulated Annealing is very rough. It lacks quite a few features which are normally part of a proper SA implementation. A better implementation is under way, see this issue .",
"title": "Description"
},
{
"location": "/algo/simulated_annealing/#example",
"text": "",
"title": "Example"
},
{
"location": "/algo/simulated_annealing/#references",
"text": "",
"title": "References"
},
{
"location": "/algo/cg/",
"text": "Conjugate Gradient Descent\n\n\n\n\nConstructor\n\n\nConjugateGradient\n(;\n \nalphaguess\n \n=\n \nLineSearches\n.\nInitialHagerZhang\n(),\n\n \nlinesearch\n \n=\n \nLineSearches\n.\nHagerZhang\n(),\n\n \neta\n \n=\n \n0.4\n,\n\n \nP\n \n=\n \nnothing\n,\n\n \nprecondprep\n \n=\n \n(\nP\n,\n \nx\n)\n \n-\n \nnothing\n)\n\n\n\n\n\n\n\n\nDescription\n\n\nThe \nConjugateGradient\n method implements Hager and Zhang (2006) and elements from Hager and Zhang (2013). Notice, that the default \nlinesearch\n is \nHagerZhang\n from LineSearches.jl. This line search is exactly the one proposed in Hager and Zhang (2006). The constant $eta$ is used in determining the next step direction, and the default here deviates from the one used in the original paper ($0.01$). It needs to be a strictly positive number.\n\n\n\n\nExample\n\n\nLet's optimize the 2D Rosenbrock function. The function and gradient are given by\n\n\nf\n(\nx\n)\n \n=\n \n(\n1.0\n \n-\n \nx\n[\n1\n])\n^\n2\n \n+\n \n100.0\n \n*\n \n(\nx\n[\n2\n]\n \n-\n \nx\n[\n1\n]\n^\n2\n)\n^\n2\n\n\nfunction\n \ng\n!\n(\nstorage\n,\n \nx\n)\n\n \nstorage\n[\n1\n]\n \n=\n \n-\n2.0\n \n*\n \n(\n1.0\n \n-\n \nx\n[\n1\n])\n \n-\n \n400.0\n \n*\n \n(\nx\n[\n2\n]\n \n-\n \nx\n[\n1\n]\n^\n2\n)\n \n*\n \nx\n[\n1\n]\n\n \nstorage\n[\n2\n]\n \n=\n \n200.0\n \n*\n \n(\nx\n[\n2\n]\n \n-\n \nx\n[\n1\n]\n^\n2\n)\n\n\nend\n\n\n\n\n\n\nwe can then try to optimize this function from \nx=[0.0, 0.0]\n\n\njulia\n optimize(f, g!, zeros(2), ConjugateGradient())\nResults of Optimization Algorithm\n * Algorithm: Conjugate Gradient\n * Starting Point: [0.0,0.0]\n * Minimizer: [1.000000002262018,1.0000000045408348]\n * Minimum: 5.144946e-18\n * Iterations: 21\n * Convergence: true\n * |x - x\n| \n 1.0e-32: false\n |x - x\n| = 2.09e-10\n * |f(x) - f(x\n)| / |f(x)| \n 1.0e-32: false\n |f(x) - f(x\n)| / |f(x)| = 1.55e+00\n * |g(x)| \n 1.0e-08: true\n |g(x)| = 3.36e-09\n * stopped by an increasing objective: false\n * Reached Maximum Number of Iterations: false\n * Objective Calls: 54\n * Gradient Calls: 39\n\n\n\n\n\nWe can compare this to the default first order solver in Optim.jl\n\n\n julia\n optimize(f, g!, zeros(2))\n\n Results of Optimization Algorithm\n * Algorithm: L-BFGS\n * Starting Point: [0.0,0.0]\n * Minimizer: [0.9999999999373614,0.999999999868622]\n * Minimum: 7.645684e-21\n * Iterations: 16\n * Convergence: true\n * |x - x\n| \n 1.0e-32: false\n |x - x\n| = 3.48e-07\n * |f(x) - f(x\n)| / |f(x)| \n 1.0e-32: false\n |f(x) - f(x\n)| / |f(x)| = 9.03e+06\n * |g(x)| \n 1.0e-08: true\n |g(x)| = 2.32e-09\n * stopped by an increasing objective: false\n * Reached Maximum Number of Iterations: false\n * Objective Calls: 53\n * Gradient Calls: 53\n\n\n\n\n\nWe see that for this objective and starting point, \nConjugateGradient()\n requires fewer gradient evaluations to reach convergence.\n\n\n\n\nReferences\n\n\n\n\nW. W. Hager and H. Zhang (2006) Algorithm 851: CG_DESCENT, a conjugate gradient method with guaranteed descent. ACM Transactions on Mathematical Software 32: 113-137.\n\n\nW. W. Hager and H. Zhang (2013), The Limited Memory Conjugate Gradient Method. SIAM Journal on Optimization, 23, pp. 2150-2168.",
"title": "Conjugate Gradient"
},
{
"location": "/algo/cg/#conjugate-gradient-descent",
"text": "",
"title": "Conjugate Gradient Descent"
},
{
"location": "/algo/cg/#constructor",
"text": "ConjugateGradient (; alphaguess = LineSearches . InitialHagerZhang (), \n linesearch = LineSearches . HagerZhang (), \n eta = 0.4 , \n P = nothing , \n precondprep = ( P , x ) - nothing )",
"title": "Constructor"
},
{
"location": "/algo/cg/#description",
"text": "The ConjugateGradient method implements Hager and Zhang (2006) and elements from Hager and Zhang (2013). Notice, that the default linesearch is HagerZhang from LineSearches.jl. This line search is exactly the one proposed in Hager and Zhang (2006). The constant $eta$ is used in determining the next step direction, and the default here deviates from the one used in the original paper ($0.01$). It needs to be a strictly positive number.",
"title": "Description"
},
{
"location": "/algo/cg/#example",
"text": "Let's optimize the 2D Rosenbrock function. The function and gradient are given by f ( x ) = ( 1.0 - x [ 1 ]) ^ 2 + 100.0 * ( x [ 2 ] - x [ 1 ] ^ 2 ) ^ 2 function g ! ( storage , x ) \n storage [ 1 ] = - 2.0 * ( 1.0 - x [ 1 ]) - 400.0 * ( x [ 2 ] - x [ 1 ] ^ 2 ) * x [ 1 ] \n storage [ 2 ] = 200.0 * ( x [ 2 ] - x [ 1 ] ^ 2 ) end we can then try to optimize this function from x=[0.0, 0.0] julia optimize(f, g!, zeros(2), ConjugateGradient())\nResults of Optimization Algorithm\n * Algorithm: Conjugate Gradient\n * Starting Point: [0.0,0.0]\n * Minimizer: [1.000000002262018,1.0000000045408348]\n * Minimum: 5.144946e-18\n * Iterations: 21\n * Convergence: true\n * |x - x | 1.0e-32: false\n |x - x | = 2.09e-10\n * |f(x) - f(x )| / |f(x)| 1.0e-32: false\n |f(x) - f(x )| / |f(x)| = 1.55e+00\n * |g(x)| 1.0e-08: true\n |g(x)| = 3.36e-09\n * stopped by an increasing objective: false\n * Reached Maximum Number of Iterations: false\n * Objective Calls: 54\n * Gradient Calls: 39 We can compare this to the default first order solver in Optim.jl julia optimize(f, g!, zeros(2))\n\n Results of Optimization Algorithm\n * Algorithm: L-BFGS\n * Starting Point: [0.0,0.0]\n * Minimizer: [0.9999999999373614,0.999999999868622]\n * Minimum: 7.645684e-21\n * Iterations: 16\n * Convergence: true\n * |x - x | 1.0e-32: false\n |x - x | = 3.48e-07\n * |f(x) - f(x )| / |f(x)| 1.0e-32: false\n |f(x) - f(x )| / |f(x)| = 9.03e+06\n * |g(x)| 1.0e-08: true\n |g(x)| = 2.32e-09\n * stopped by an increasing objective: false\n * Reached Maximum Number of Iterations: false\n * Objective Calls: 53\n * Gradient Calls: 53 We see that for this objective and starting point, ConjugateGradient() requires fewer gradient evaluations to reach convergence.",
"title": "Example"
},
{
"location": "/algo/cg/#references",
"text": "W. W. Hager and H. Zhang (2006) Algorithm 851: CG_DESCENT, a conjugate gradient method with guaranteed descent. ACM Transactions on Mathematical Software 32: 113-137. W. W. Hager and H. Zhang (2013), The Limited Memory Conjugate Gradient Method. SIAM Journal on Optimization, 23, pp. 2150-2168.",
"title": "References"
},
{
"location": "/algo/gradientdescent/",
"text": "Gradient Descent\n\n\n\n\nConstructor\n\n\nGradientDescent\n(;\n \nalphaguess\n \n=\n \nLineSearches\n.\nInitialPrevious\n(),\n\n \nlinesearch\n \n=\n \nLineSearches\n.\nHagerZhang\n(),\n\n \nP\n \n=\n \nnothing\n,\n\n \nprecondprep\n \n=\n \n(\nP\n,\n \nx\n)\n \n-\n \nnothing\n)\n\n\n\n\n\n\n\n\nDescription\n\n\nGradient Descent a common name for a quasi-Newton solver. This means that it takes steps according to\n\n\n\n\n\nx_{n+1} = x_n - P^{-1}\\nabla f(x_n)\n\n\n\n\n\nwhere $P$ is a positive definite matrix. If $P$ is the Hessian, we get Newton's method. In Gradient Descent, $P$ is simply an appropriately dimensioned identity matrix, such that we go in the exact opposite direction of the gradient. This means that we do not use the curvature information from the Hessian, or an approximation of it. While it does seem quite logical to go in the opposite direction of the fastest increase in objective value, the procedure can be very slow if the problem is ill-conditioned. See the section on preconditioners for ways to remedy this when using Gradient Descent.\n\n\nAs with the other quasi-Newton solvers in this package, a scalar $\\alpha$ is introduced as follows\n\n\n\n\n\nx_{n+1} = x_n - \\alpha P^{-1}\\nabla f(x_n)\n\n\n\n\n\nand is chosen by a linesearch algorithm such that each step gives sufficient descent.\n\n\n\n\nExample\n\n\n\n\nReferences",
"title": "Gradient Descent"
},
{
"location": "/algo/gradientdescent/#gradient-descent",
"text": "",
"title": "Gradient Descent"
},
{
"location": "/algo/gradientdescent/#constructor",
"text": "GradientDescent (; alphaguess = LineSearches . InitialPrevious (), \n linesearch = LineSearches . HagerZhang (), \n P = nothing , \n precondprep = ( P , x ) - nothing )",
"title": "Constructor"
},
{
"location": "/algo/gradientdescent/#description",
"text": "Gradient Descent a common name for a quasi-Newton solver. This means that it takes steps according to \nx_{n+1} = x_n - P^{-1}\\nabla f(x_n) where $P$ is a positive definite matrix. If $P$ is the Hessian, we get Newton's method. In Gradient Descent, $P$ is simply an appropriately dimensioned identity matrix, such that we go in the exact opposite direction of the gradient. This means that we do not use the curvature information from the Hessian, or an approximation of it. While it does seem quite logical to go in the opposite direction of the fastest increase in objective value, the procedure can be very slow if the problem is ill-conditioned. See the section on preconditioners for ways to remedy this when using Gradient Descent. As with the other quasi-Newton solvers in this package, a scalar $\\alpha$ is introduced as follows \nx_{n+1} = x_n - \\alpha P^{-1}\\nabla f(x_n) and is chosen by a linesearch algorithm such that each step gives sufficient descent.",
"title": "Description"
},
{
"location": "/algo/gradientdescent/#example",
"text": "",
"title": "Example"
},
{
"location": "/algo/gradientdescent/#references",
"text": "",
"title": "References"
},
{
"location": "/algo/lbfgs/",
"text": "(L-)BFGS\n\n\nThis page contains information about BFGS and its limited memory version L-BFGS.\n\n\n\n\nConstructors\n\n\nBFGS\n(;\n \nalphaguess\n \n=\n \nLineSearches\n.\nInitialStatic\n(),\n\n \nlinesearch\n \n=\n \nLineSearches\n.\nHagerZhang\n(),\n\n \nP\n \n=\n \nnothing\n,\n\n \nprecondprep\n \n=\n \n(\nP\n,\n \nx\n)\n \n-\n \nnothing\n)\n\n\n\n\n\n\nLBFGS\n(;\n \nm\n \n=\n \n10\n,\n\n \nalphaguess\n \n=\n \nLineSearches\n.\nInitialStatic\n(),\n\n \nlinesearch\n \n=\n \nLineSearches\n.\nHagerZhang\n(),\n\n \nP\n \n=\n \nnothing\n,\n\n \nprecondprep\n \n=\n \n(\nP\n,\n \nx\n)\n \n-\n \nnothing\n,\n\n \nmanifold\n \n=\n \nFlat\n(),\n\n \nscaleinvH0\n::\nBool\n \n=\n \ntrue\n \n \n(\ntypeof\n(\nP\n)\n \n:\n \nVoid\n))\n\n\n\n\n\n\n\n\nDescription\n\n\nThis means that it takes steps according to\n\n\n\n\n\nx_{n+1} = x_n - P^{-1}\\nabla f(x_n)\n\n\n\n\n\nwhere $P$ is a positive definite matrix. If $P$ is the Hessian, we get Newton's method. In (L-)BFGS, the matrix is an approximation to the Hessian built using differences in the gradient across iterations. As long as the initial matrix is positive definite it is possible to show that all the follow matrices will be as well. The starting matrix could simply be the identity matrix, such that the first step is identical to the Gradient Descent algorithm, or even the actual Hessian.\n\n\nThere are two versions of BFGS in the package: BFGS, and L-BFGS. The latter is different from the former because it doesn't use a complete history of the iterative procedure to construct $P$, but rather only the latest $m$ steps. It doesn't actually build the Hessian approximation matrix either, but computes the direction directly. This makes more suitable for large scale problems, as the memory requirement to store the relevant vectors will grow quickly in large problems.\n\n\nAs with the other quasi-Newton solvers in this package, a scalar $\\alpha$ is introduced as follows\n\n\n\n\n\nx_{n+1} = x_n - \\alpha P^{-1}\\nabla f(x_n)\n\n\n\n\n\nand is chosen by a linesearch algorithm such that each step gives sufficient descent.\n\n\n\n\nExample\n\n\n\n\nReferences\n\n\nWright, Stephen, and Jorge Nocedal (2006) \"Numerical optimization.\" Springer",
"title": "(L-)BFGS"
},
{
"location": "/algo/lbfgs/#l-bfgs",
"text": "This page contains information about BFGS and its limited memory version L-BFGS.",
"title": "(L-)BFGS"
},
{
"location": "/algo/lbfgs/#constructors",
"text": "BFGS (; alphaguess = LineSearches . InitialStatic (), \n linesearch = LineSearches . HagerZhang (), \n P = nothing , \n precondprep = ( P , x ) - nothing ) LBFGS (; m = 10 , \n alphaguess = LineSearches . InitialStatic (), \n linesearch = LineSearches . HagerZhang (), \n P = nothing , \n precondprep = ( P , x ) - nothing , \n manifold = Flat (), \n scaleinvH0 :: Bool = true ( typeof ( P ) : Void ))",
"title": "Constructors"
},
{
"location": "/algo/lbfgs/#description",
"text": "This means that it takes steps according to \nx_{n+1} = x_n - P^{-1}\\nabla f(x_n) where $P$ is a positive definite matrix. If $P$ is the Hessian, we get Newton's method. In (L-)BFGS, the matrix is an approximation to the Hessian built using differences in the gradient across iterations. As long as the initial matrix is positive definite it is possible to show that all the follow matrices will be as well. The starting matrix could simply be the identity matrix, such that the first step is identical to the Gradient Descent algorithm, or even the actual Hessian. There are two versions of BFGS in the package: BFGS, and L-BFGS. The latter is different from the former because it doesn't use a complete history of the iterative procedure to construct $P$, but rather only the latest $m$ steps. It doesn't actually build the Hessian approximation matrix either, but computes the direction directly. This makes more suitable for large scale problems, as the memory requirement to store the relevant vectors will grow quickly in large problems. As with the other quasi-Newton solvers in this package, a scalar $\\alpha$ is introduced as follows \nx_{n+1} = x_n - \\alpha P^{-1}\\nabla f(x_n) and is chosen by a linesearch algorithm such that each step gives sufficient descent.",
"title": "Description"
},
{
"location": "/algo/lbfgs/#example",
"text": "",
"title": "Example"
},
{
"location": "/algo/lbfgs/#references",
"text": "Wright, Stephen, and Jorge Nocedal (2006) \"Numerical optimization.\" Springer",
"title": "References"
},
{
"location": "/algo/newton/",
"text": "Newton's Method\n\n\n\n\nConstructor\n\n\nNewton\n(;\n \nalphaguess\n \n=\n \nLineSearches\n.\nInitialStatic\n(),\n\n \nlinesearch\n \n=\n \nLineSearches\n.\nHagerZhang\n())\n\n\n\n\n\n\nThe constructor takes two keywords:\n\n\n\n\nlinesearch = a(d, x, p, x_new, g_new, lsr, c, mayterminate)\n, a function performing line search, see the line search section.\n\n\nalphaguess = a(state, dphi0, d)\n, a function for setting the initial guess for the line search algorithm, see the line search section.\n\n\n\n\n\n\nDescription\n\n\nNewton's method for optimization has a long history, and is in some sense the gold standard in unconstrained optimization of smooth functions, at least from a theoretical viewpoint. The main benefit is that it has a quadratic rate of convergence near a local optimum. The main disadvantage is that the user has to provide a Hessian. This can be difficult, complicated, or simply annoying. It can also be computationally expensive to calculate it.\n\n\nNewton's method for optimization consists of applying Newton's method for solving systems of equations, where the equations are the first order conditions, saying that the gradient should equal the zero vector.\n\n\n\n\n\n\\nabla f(x) = 0\n\n\n\n\n\nA second order Taylor expansion of the left-hand side leads to the iterative scheme\n\n\n\n\n\nx_{n+1} = x_n - H(x_n)^{-1}\\nabla f(x_n)\n\n\n\n\n\nwhere the inverse is not calculated directly, but the step size is instead calculated by solving\n\n\n\n\n\nH(x) \\textbf{s} = \\nabla f(x_n).\n\n\n\n\n\nThis is equivalent to minimizing a quadratic model, $m_k$ around the current $x_n$\n\n\n\n\n\nm_k(s) = f(x_n) + \\nabla f(x_n)^\\top \\textbf{s} + \\frac{1}{2} \\textbf{s}^\\top H(x_n) \\textbf{s}\n\n\n\n\n\nFor functions where $H(x_n)$ is difficult, or computationally expensive to obtain, we might replace the Hessian with another positive definite matrix that approximates it. Such methods are called Quasi-Newton methods; see (L-)BFGS and Gradient Descent.\n\n\nIn a sufficiently small neighborhood around the minimizer, Newton's method has quadratic convergence, but globally it might have slower convergence, or it might even diverge. To ensure convergence, a line search is performed for each $\\textbf{s}$. This amounts to replacing the step formula above with\n\n\n\n\n\nx_{n+1} = x_n - \\alpha \\textbf{s}\n\n\n\n\n\nand finding a scalar $\\alpha$ such that we get sufficient descent; see the line search section for more information.\n\n\nAdditionally, if the function is locally concave, the step taken in the formulas above will go in a direction of ascent, as the Hessian will not be positive (semi)definite. To avoid this, we use a specialized method to calculate the step direction. If the Hessian is positive semidefinite then the method used is standard, but if it is not, a correction is made using the functionality in \nPositiveFactorizations.jl\n.\n\n\n\n\nExample\n\n\nshow the example from the issue\n\n\n\n\nReferences",
"title": "Newton"
},
{
"location": "/algo/newton/#newtons-method",
"text": "",
"title": "Newton's Method"
},
{
"location": "/algo/newton/#constructor",
"text": "Newton (; alphaguess = LineSearches . InitialStatic (), \n linesearch = LineSearches . HagerZhang ()) The constructor takes two keywords: linesearch = a(d, x, p, x_new, g_new, lsr, c, mayterminate) , a function performing line search, see the line search section. alphaguess = a(state, dphi0, d) , a function for setting the initial guess for the line search algorithm, see the line search section.",
"title": "Constructor"
},
{
"location": "/algo/newton/#description",
"text": "Newton's method for optimization has a long history, and is in some sense the gold standard in unconstrained optimization of smooth functions, at least from a theoretical viewpoint. The main benefit is that it has a quadratic rate of convergence near a local optimum. The main disadvantage is that the user has to provide a Hessian. This can be difficult, complicated, or simply annoying. It can also be computationally expensive to calculate it. Newton's method for optimization consists of applying Newton's method for solving systems of equations, where the equations are the first order conditions, saying that the gradient should equal the zero vector. \n\\nabla f(x) = 0 A second order Taylor expansion of the left-hand side leads to the iterative scheme \nx_{n+1} = x_n - H(x_n)^{-1}\\nabla f(x_n) where the inverse is not calculated directly, but the step size is instead calculated by solving \nH(x) \\textbf{s} = \\nabla f(x_n). This is equivalent to minimizing a quadratic model, $m_k$ around the current $x_n$ \nm_k(s) = f(x_n) + \\nabla f(x_n)^\\top \\textbf{s} + \\frac{1}{2} \\textbf{s}^\\top H(x_n) \\textbf{s} For functions where $H(x_n)$ is difficult, or computationally expensive to obtain, we might replace the Hessian with another positive definite matrix that approximates it. Such methods are called Quasi-Newton methods; see (L-)BFGS and Gradient Descent. In a sufficiently small neighborhood around the minimizer, Newton's method has quadratic convergence, but globally it might have slower convergence, or it might even diverge. To ensure convergence, a line search is performed for each $\\textbf{s}$. This amounts to replacing the step formula above with \nx_{n+1} = x_n - \\alpha \\textbf{s} and finding a scalar $\\alpha$ such that we get sufficient descent; see the line search section for more information. Additionally, if the function is locally concave, the step taken in the formulas above will go in a direction of ascent, as the Hessian will not be positive (semi)definite. To avoid this, we use a specialized method to calculate the step direction. If the Hessian is positive semidefinite then the method used is standard, but if it is not, a correction is made using the functionality in PositiveFactorizations.jl .",
"title": "Description"
},
{
"location": "/algo/newton/#example",
"text": "show the example from the issue",
"title": "Example"
},
{
"location": "/algo/newton/#references",
"text": "",
"title": "References"
},
{
"location": "/algo/newton_trust_region/",
"text": "Newton's Method With a Trust Region\n\n\n\n\nConstructor\n\n\nNewtonTrustRegion\n(;\n \ninitial_delta\n \n=\n \n1.0\n,\n\n \ndelta_hat\n \n=\n \n100.0\n,\n\n \neta\n \n=\n \n0.1\n,\n\n \nrho_lower\n \n=\n \n0.25\n,\n\n \nrho_upper\n \n=\n \n0.75\n)\n\n\n\n\n\n\nThe constructor takes keywords that determine the initial and maximal size of the trust region, when to grow and shrink the region, and how close the function should be to the quadratic approximation. The notation follows chapter four of Numerical Optimization. Below, \nrho\n $=\\rho$ refers to the ratio of the actual function change to the change in the quadratic approximation for a given step.\n\n\n\n\ninitial_delta:\nThe starting trust region radius\n\n\ndelta_hat:\n The largest allowable trust region radius\n\n\neta:\n When \nrho\n is at least \neta\n, accept the step.\n\n\nrho_lower:\n When \nrho\n is less than \nrho_lower\n, shrink the trust region.\n\n\nrho_upper:\n When \nrho\n is greater than \nrho_upper\n, grow the trust region (though no greater than \ndelta_hat\n).\n\n\n\n\n\n\nDescription\n\n\nNewton's method with a trust region is designed to take advantage of the second-order information in a function's Hessian, but with more stability that Newton's method when functions are not globally well-approximated by a quadratic. This is achieved by repeatedly minimizing quadratic approximations within a dynamically-sized \"trust region\" in which the function is assumed to be locally quadratic [1].\n\n\nNewton's method optimizes a quadratic approximation to a function. When a function is well approximated by a quadratic (for example, near an optimum), Newton's method converges very quickly by exploiting the second-order information in the Hessian matrix. However, when the function is not well-approximated by a quadratic, either because the starting point is far from the optimum or the function has a more irregular shape, Newton steps can be erratically large, leading to distant, irrelevant areas of the space.\n\n\nTrust region methods use second-order information but restrict the steps to be within a \"trust region\" where the function is believed to be approximately quadratic. At iteration $k$, a trust region method chooses a step $p$ to minimize a quadratic approximation to the objective such that the step size is no larger than a given trust region size, $\\Delta_k$.\n\n\n\n\n\n\\underset{p\\in\\mathbb{R}^n}\\min m_k(p) = f_k + g_k^T p + \\frac{1}{2}p^T B_k p \\quad\\textrm{such that } ||p||\\le \\Delta_k\n\n\n\n\n\nHere, $p$ is the step to take at iteration $k$, so that $x_{k+1} = x_k + p$. In the definition of $m_k(p)$, $f_k = f(x_k)$ is the value at the previous location, $g_k=\\nabla f(x_k)$ is the gradient at the previous location, $B_k = \\nabla^2 f(x_k)$ is the Hessian matrix at the previous iterate, and $||\\cdot||$ is the Euclidian norm.\n\n\nIf the trust region size, $\\Delta_k$, is large enough that the minimizer of the quadratic approximation $m_k(p)$ has $||p|| \\le \\Delta_k$, then the step is the same as an ordinary Newton step. However, if the unconstrained quadratic minimizer lies outside the trust region, then the minimizer to the constrained problem will occur on the boundary, i.e. we will have $||p|| = \\Delta_k$. It turns out that when the Cholesky decomposition of $B_k$ can be computed, the optimal $p$ can be found numerically with relative ease. ([1], section 4.3) This is the method currently used in Optim.\n\n\nIt makes sense to adapt the trust region size, $\\Delta_k$, as one moves through the space and assesses the quality of the quadratic fit. This adaptation is controlled by the parameters $\\eta$, $\\rho_{lower}$, and $\\rho_{upper}$, which are parameters to the \nNewtonTrustRegion\n optimization method. For each step, we calculate\n\n\n\n\n\n\\rho_k := \\frac{f(x_{k+1}) - f(x_k)}{m_k(p) - m_k(0)}\n\n\n\n\n\nIntuitively, $\\rho_k$ measures the quality of the quadratic approximation: if $\\rho_k \\approx 1$, then our quadratic approximation is reasonable. If $p$ was on the boundary and $\\rho_k \n \\rho_{upper}$, then perhaps we can benefit from larger steps. In this case, for the next iteration we grow the trust region geometrically up to a maximum of $\\hat\\Delta$:\n\n\n\n\n\n\\rho_k > \\rho_{upper} \\Rightarrow \\Delta_{k+1} = \\min(2 \\Delta_k, \\hat\\Delta).\n\n\n\n\n\nConversely, if $\\rho_k \n \\rho_{lower}$, then we shrink the trust region geometrically:\n\n\n$\\rho_k \n \\rho_{lower} \\Rightarrow \\Delta_{k+1} = 0.25 \\Delta_k$. Finally, we only accept a point if its decrease is appreciable compared to the quadratic approximation. Specifically, a step is only accepted $\\rho_k \n \\eta$. As long as we choose $\\eta$ to be less than $\\rho_{lower}$, we will shrink the trust region whenever we reject a step. Eventually, if the objective function is locally quadratic, $\\Delta_k$ will become small enough that a quadratic approximation will be accurate enough to make progress again.\n\n\n\n\nExample\n\n\nusing\n \nOptim\n\n\nprob\n \n=\n \nOptim\n.\nUnconstrainedProblems\n.\nexamples\n[\nRosenbrock\n];\n\n\nres\n \n=\n \nOptim\n.\noptimize\n(\nprob\n.\nf\n,\n \nprob\n.\ng!\n,\n \nprob\n.\nh!\n,\n \nprob\n.\ninitial_x\n,\n \nmethod\n=\nNewtonTrustRegion\n())\n\n\n\n\n\n\n\n\nReferences\n\n\n[1] Nocedal, Jorge, and Stephen Wright. Numerical optimization. Springer Science \n Business Media, 2006.",
"title": "Newton with Trust Region"
},
{
"location": "/algo/newton_trust_region/#newtons-method-with-a-trust-region",
"text": "",
"title": "Newton's Method With a Trust Region"
},
{
"location": "/algo/newton_trust_region/#constructor",
"text": "NewtonTrustRegion (; initial_delta = 1.0 , \n delta_hat = 100.0 , \n eta = 0.1 , \n rho_lower = 0.25 , \n rho_upper = 0.75 ) The constructor takes keywords that determine the initial and maximal size of the trust region, when to grow and shrink the region, and how close the function should be to the quadratic approximation. The notation follows chapter four of Numerical Optimization. Below, rho $=\\rho$ refers to the ratio of the actual function change to the change in the quadratic approximation for a given step. initial_delta: The starting trust region radius delta_hat: The largest allowable trust region radius eta: When rho is at least eta , accept the step. rho_lower: When rho is less than rho_lower , shrink the trust region. rho_upper: When rho is greater than rho_upper , grow the trust region (though no greater than delta_hat ).",
"title": "Constructor"
},
{
"location": "/algo/newton_trust_region/#description",
"text": "Newton's method with a trust region is designed to take advantage of the second-order information in a function's Hessian, but with more stability that Newton's method when functions are not globally well-approximated by a quadratic. This is achieved by repeatedly minimizing quadratic approximations within a dynamically-sized \"trust region\" in which the function is assumed to be locally quadratic [1]. Newton's method optimizes a quadratic approximation to a function. When a function is well approximated by a quadratic (for example, near an optimum), Newton's method converges very quickly by exploiting the second-order information in the Hessian matrix. However, when the function is not well-approximated by a quadratic, either because the starting point is far from the optimum or the function has a more irregular shape, Newton steps can be erratically large, leading to distant, irrelevant areas of the space. Trust region methods use second-order information but restrict the steps to be within a \"trust region\" where the function is believed to be approximately quadratic. At iteration $k$, a trust region method chooses a step $p$ to minimize a quadratic approximation to the objective such that the step size is no larger than a given trust region size, $\\Delta_k$. \n\\underset{p\\in\\mathbb{R}^n}\\min m_k(p) = f_k + g_k^T p + \\frac{1}{2}p^T B_k p \\quad\\textrm{such that } ||p||\\le \\Delta_k Here, $p$ is the step to take at iteration $k$, so that $x_{k+1} = x_k + p$. In the definition of $m_k(p)$, $f_k = f(x_k)$ is the value at the previous location, $g_k=\\nabla f(x_k)$ is the gradient at the previous location, $B_k = \\nabla^2 f(x_k)$ is the Hessian matrix at the previous iterate, and $||\\cdot||$ is the Euclidian norm. If the trust region size, $\\Delta_k$, is large enough that the minimizer of the quadratic approximation $m_k(p)$ has $||p|| \\le \\Delta_k$, then the step is the same as an ordinary Newton step. However, if the unconstrained quadratic minimizer lies outside the trust region, then the minimizer to the constrained problem will occur on the boundary, i.e. we will have $||p|| = \\Delta_k$. It turns out that when the Cholesky decomposition of $B_k$ can be computed, the optimal $p$ can be found numerically with relative ease. ([1], section 4.3) This is the method currently used in Optim. It makes sense to adapt the trust region size, $\\Delta_k$, as one moves through the space and assesses the quality of the quadratic fit. This adaptation is controlled by the parameters $\\eta$, $\\rho_{lower}$, and $\\rho_{upper}$, which are parameters to the NewtonTrustRegion optimization method. For each step, we calculate \n\\rho_k := \\frac{f(x_{k+1}) - f(x_k)}{m_k(p) - m_k(0)} Intuitively, $\\rho_k$ measures the quality of the quadratic approximation: if $\\rho_k \\approx 1$, then our quadratic approximation is reasonable. If $p$ was on the boundary and $\\rho_k \\rho_{upper}$, then perhaps we can benefit from larger steps. In this case, for the next iteration we grow the trust region geometrically up to a maximum of $\\hat\\Delta$: \n\\rho_k > \\rho_{upper} \\Rightarrow \\Delta_{k+1} = \\min(2 \\Delta_k, \\hat\\Delta). Conversely, if $\\rho_k \\rho_{lower}$, then we shrink the trust region geometrically: $\\rho_k \\rho_{lower} \\Rightarrow \\Delta_{k+1} = 0.25 \\Delta_k$. Finally, we only accept a point if its decrease is appreciable compared to the quadratic approximation. Specifically, a step is only accepted $\\rho_k \\eta$. As long as we choose $\\eta$ to be less than $\\rho_{lower}$, we will shrink the trust region whenever we reject a step. Eventually, if the objective function is locally quadratic, $\\Delta_k$ will become small enough that a quadratic approximation will be accurate enough to make progress again.",
"title": "Description"
},
{
"location": "/algo/newton_trust_region/#example",
"text": "using Optim prob = Optim . UnconstrainedProblems . examples [ Rosenbrock ]; res = Optim . optimize ( prob . f , prob . g! , prob . h! , prob . initial_x , method = NewtonTrustRegion ())",
"title": "Example"
},
{
"location": "/algo/newton_trust_region/#references",
"text": "[1] Nocedal, Jorge, and Stephen Wright. Numerical optimization. Springer Science Business Media, 2006.",
"title": "References"
},
{
"location": "/algo/autodiff/",
"text": "Automatic Differentiation\n\n\nAs mentioned in the \nMinimizing a function\n section, it is possible to avoid passing gradients even when using gradient based methods. This is because Optim will call the finite central differences functionality in \nCalculus.jl\n in those cases. The advantages are clear: you do not have to write the gradients yourself, and it works for any function you can pass to Optim. However, there is another good way of making the computer provide gradients: automatic differentiation. Again, the advantage is that you can easily get gradients from the objective function alone. As opposed to finite difference, these gradients are exact and we also get Hessians for Newton's method. They can perform better than a finite differences scheme, depending on the exact problem. The disadvantage is that the objective function has to be written using only Julia code, so no calls to BLAS or Fortran functions.\n\n\nLet us consider the Rosenbrock example again.\n\n\nfunction\n \nf\n(\nx\n)\n\n \nreturn\n \n(\n1.0\n \n-\n \nx\n[\n1\n])\n^\n2\n \n+\n \n100.0\n \n*\n \n(\nx\n[\n2\n]\n \n-\n \nx\n[\n1\n]\n^\n2\n)\n^\n2\n\n\nend\n\n\n\nfunction\n \ng!\n(\nstorage\n,\n \nx\n)\n\n \nstorage\n[\n1\n]\n \n=\n \n-\n2.0\n \n*\n \n(\n1.0\n \n-\n \nx\n[\n1\n])\n \n-\n \n400.0\n \n*\n \n(\nx\n[\n2\n]\n \n-\n \nx\n[\n1\n]\n^\n2\n)\n \n*\n \nx\n[\n1\n]\n\n \nstorage\n[\n2\n]\n \n=\n \n200.0\n \n*\n \n(\nx\n[\n2\n]\n \n-\n \nx\n[\n1\n]\n^\n2\n)\n\n\nend\n\n\n\nfunction\n \nh!\n(\nstorage\n,\n \nx\n)\n\n \nstorage\n[\n1\n,\n \n1\n]\n \n=\n \n2.0\n \n-\n \n400.0\n \n*\n \nx\n[\n2\n]\n \n+\n \n1200.0\n \n*\n \nx\n[\n1\n]\n^\n2\n\n \nstorage\n[\n1\n,\n \n2\n]\n \n=\n \n-\n400.0\n \n*\n \nx\n[\n1\n]\n\n \nstorage\n[\n2\n,\n \n1\n]\n \n=\n \n-\n400.0\n \n*\n \nx\n[\n1\n]\n\n \nstorage\n[\n2\n,\n \n2\n]\n \n=\n \n200.0\n\n\nend\n\n\n\ninitial_x\n \n=\n \nzeros\n(\n2\n)\n\n\n\n\n\n\nLet us see if BFGS and Newton's Method can solve this problem with the functions provided.\n\n\njulia\n \nOptim\n.\nminimizer\n(\noptimize\n(\nf\n,\n \ng!\n,\n \nh!\n,\n \ninitial_x\n,\n \nBFGS\n()))\n\n\n2-element Array{Float64,1}:\n\n\n 1.0\n\n\n 1.0\n\n\n\njulia\n \nOptim\n.\nminimizer\n(\noptimize\n(\nf\n,\n \ng!\n,\n \nh!\n,\n \ninitial_x\n,\n \nNewton\n()))\n\n\n\n2-element Array{Float64,1}:\n\n\n 1.0\n\n\n 1.0\n\n\n\n\n\n\nThis is indeed the case. Now let us use finite differences for BFGS.\n\n\njulia\n \nOptim\n.\nminimizer\n(\noptimize\n(\nf\n,\n \ninitial_x\n,\n \nBFGS\n()))\n\n\n2-element Array{Float64,1}:\n\n\n 1.0\n\n\n 1.0\n\n\n\n\n\n\nStill looks good. Returning to automatic differentiation, let us try both solvers using this method. We enable \nforward mode\n automatic differentiation by adding \nautodiff = :forward\n when we construct a \nOnceDifferentiable\n instance.\n\n\njulia\n \nod\n \n=\n \nOnceDifferentiable\n(\nf\n,\n \ninitial_x\n;\n \nautodiff\n \n=\n \n:\nforward\n);\n\n\n\njulia\n \nOptim\n.\nminimizer\n(\noptimize\n(\nod\n,\n \ninitial_x\n,\n \nBFGS\n()))\n\n\n2-element Array{Float64,1}:\n\n\n 1.0\n\n\n 1.0\n\n\n\njulia\n \ntd\n \n=\n \nTwiceDifferentiable\n(\nf\n,\n \ninitial_x\n;\n \nautodiff\n \n=\n \n:\nforward\n)\n\n\n\njulia\n \nOptim\n.\nminimizer\n(\noptimize\n(\ntd\n,\n \ninitial_x\n,\n \nNewton\n()))\n\n\n2-element Array{Float64,1}:\n\n\n 1.0\n\n\n 1.0\n\n\n\n\n\n\nIndeed, the minimizer was found, without providing any gradients or Hessians.",
"title": "Automatic Differentiation"
},
{
"location": "/algo/autodiff/#automatic-differentiation",
"text": "As mentioned in the Minimizing a function section, it is possible to avoid passing gradients even when using gradient based methods. This is because Optim will call the finite central differences functionality in Calculus.jl in those cases. The advantages are clear: you do not have to write the gradients yourself, and it works for any function you can pass to Optim. However, there is another good way of making the computer provide gradients: automatic differentiation. Again, the advantage is that you can easily get gradients from the objective function alone. As opposed to finite difference, these gradients are exact and we also get Hessians for Newton's method. They can perform better than a finite differences scheme, depending on the exact problem. The disadvantage is that the objective function has to be written using only Julia code, so no calls to BLAS or Fortran functions. Let us consider the Rosenbrock example again. function f ( x ) \n return ( 1.0 - x [ 1 ]) ^ 2 + 100.0 * ( x [ 2 ] - x [ 1 ] ^ 2 ) ^ 2 end function g! ( storage , x ) \n storage [ 1 ] = - 2.0 * ( 1.0 - x [ 1 ]) - 400.0 * ( x [ 2 ] - x [ 1 ] ^ 2 ) * x [ 1 ] \n storage [ 2 ] = 200.0 * ( x [ 2 ] - x [ 1 ] ^ 2 ) end function h! ( storage , x ) \n storage [ 1 , 1 ] = 2.0 - 400.0 * x [ 2 ] + 1200.0 * x [ 1 ] ^ 2 \n storage [ 1 , 2 ] = - 400.0 * x [ 1 ] \n storage [ 2 , 1 ] = - 400.0 * x [ 1 ] \n storage [ 2 , 2 ] = 200.0 end initial_x = zeros ( 2 ) Let us see if BFGS and Newton's Method can solve this problem with the functions provided. julia Optim . minimizer ( optimize ( f , g! , h! , initial_x , BFGS ())) 2-element Array{Float64,1}: 1.0 1.0 julia Optim . minimizer ( optimize ( f , g! , h! , initial_x , Newton ())) 2-element Array{Float64,1}: 1.0 1.0 This is indeed the case. Now let us use finite differences for BFGS. julia Optim . minimizer ( optimize ( f , initial_x , BFGS ())) 2-element Array{Float64,1}: 1.0 1.0 Still looks good. Returning to automatic differentiation, let us try both solvers using this method. We enable forward mode automatic differentiation by adding autodiff = :forward when we construct a OnceDifferentiable instance. julia od = OnceDifferentiable ( f , initial_x ; autodiff = : forward ); julia Optim . minimizer ( optimize ( od , initial_x , BFGS ())) 2-element Array{Float64,1}: 1.0 1.0 julia td = TwiceDifferentiable ( f , initial_x ; autodiff = : forward ) julia Optim . minimizer ( optimize ( td , initial_x , Newton ())) 2-element Array{Float64,1}: 1.0 1.0 Indeed, the minimizer was found, without providing any gradients or Hessians.",
"title": "Automatic Differentiation"
},
{
"location": "/algo/linesearch/",
"text": "Line search\n\n\n\n\nDescription\n\n\nThe line search functionality has been moved to \nLineSearches.jl\n.\n\n\nLine search is used to decide the step length along the direction computed by an optimization algorithm.\n\n\nThe following \nOptim\n algorithms use line search:\n\n\n\n\nAccelerated Gradient Descent\n\n\n(L-)BFGS\n\n\nConjugate Gradient\n\n\nGradient Descent\n\n\nMomentum Gradient Descent\n\n\nNewton\n\n\n\n\nBy default \nOptim\n calls the line search algorithm \nHagerZhang()\n provided by \nLineSearches\n. Different line search algorithms can be assigned with the \nlinesearch\n keyword argument to the given algorithm.\n\n\nLineSearches\n also allows the user to decide how the initial step length for the line search algorithm is chosen. This is set with the \nalphaguess\n keyword argument for the \nOptim\n algorithm. The default procedure varies.\n\n\n\n\nExample\n\n\nThis example compares two different line search algorithms on the Rosenbrock problem.\n\n\nFirst, run \nNewton\n with the default line search algorithm:\n\n\nusing\n \nOptim\n,\n \nLineSearches\n\n\nprob\n \n=\n \nOptim\n.\nUnconstrainedProblems\n.\nexamples\n[\nRosenbrock\n]\n\n\n\nalgo_hz\n \n=\n \nNewton\n(;\nalphaguess\n \n=\n \nLineSearches\n.\nInitialStatic\n(),\n \nlinesearch\n \n=\n \nLineSearches\n.\nHagerZhang\n())\n\n\nres_hz\n \n=\n \nOptim\n.\noptimize\n(\nprob\n.\nf\n,\n \nprob\n.\ng!\n,\n \nprob\n.\nh!\n,\n \nprob\n.\ninitial_x\n,\n \nmethod\n=\nalgo_hz\n)\n\n\n\n\n\n\nThis gives the result\n\n\n \n*\n \nAlgorithm\n:\n \nNewton\ns\n \nMethod\n\n \n*\n \nStarting\n \nPoint\n:\n \n[\n0.0\n,\n0.0\n]\n\n \n*\n \nMinimizer\n:\n \n[\n0.9999999999999994\n,\n0.9999999999999989\n]\n\n \n*\n \nMinimum\n:\n \n3.081488e-31\n\n \n*\n \nIterations\n:\n \n14\n\n \n*\n \nConvergence\n:\n \ntrue\n\n \n*\n \n|\nx\n \n-\n \nx\n|\n \n \n1.0e-32\n:\n \nfalse\n\n \n|\nx\n \n-\n \nx\n|\n \n=\n \n3.06e-09\n\n \n*\n \n|\nf\n(\nx\n)\n \n-\n \nf\n(\nx\n)\n|\n \n/\n \n|\nf\n(\nx\n)\n|\n \n \n1.0e-32\n:\n \nfalse\n\n \n|\nf\n(\nx\n)\n \n-\n \nf\n(\nx\n)\n|\n \n/\n \n|\nf\n(\nx\n)\n|\n \n=\n \n2.94e+13\n\n \n*\n \n|\ng\n(\nx\n)\n|\n \n \n1.0e-08\n:\n \ntrue\n\n \n|\ng\n(\nx\n)\n|\n \n=\n \n1.11e-15\n\n \n*\n \nstopped\n \nby\n \nan\n \nincreasing\n \nobjective\n:\n \nfalse\n\n \n*\n \nReached\n \nMaximum\n \nNumber\n \nof\n \nIterations\n:\n \nfalse\n\n \n*\n \nObjective\n \nCalls\n:\n \n44\n\n \n*\n \nGradient\n \nCalls\n:\n \n44\n\n \n*\n \nHessian\n \nCalls\n:\n \n14\n\n\n\n\n\n\nNow we can try \nNewton\n with the More-Thuente line search:\n\n\nalgo_mt\n \n=\n \nNewton\n(;\nalphaguess\n \n=\n \nLineSearches\n.\nInitialStatic\n(),\n \nlinesearch\n \n=\n \nLineSearches\n.\nMoreThuente\n())\n\n\nres_mt\n \n=\n \nOptim\n.\noptimize\n(\nprob\n.\nf\n,\n \nprob\n.\ng!\n,\n \nprob\n.\nh!\n,\n \nprob\n.\ninitial_x\n,\n \nmethod\n=\nalgo_mt\n)\n\n\n\n\n\n\nThis gives the following result, reducing the number of function and gradient calls:\n\n\nResults\n \nof\n \nOptimization\n \nAlgorithm\n\n \n*\n \nAlgorithm\n:\n \nNewton\ns\n \nMethod\n\n \n*\n \nStarting\n \nPoint\n:\n \n[\n0.0\n,\n0.0\n]\n\n \n*\n \nMinimizer\n:\n \n[\n0.9999999999999992\n,\n0.999999999999998\n]\n\n \n*\n \nMinimum\n:\n \n2.032549e-29\n\n \n*\n \nIterations\n:\n \n14\n\n \n*\n \nConvergence\n:\n \ntrue\n\n \n*\n \n|\nx\n \n-\n \nx\n|\n \n \n1.0e-32\n:\n \nfalse\n\n \n|\nx\n \n-\n \nx\n|\n \n=\n \n3.67e-08\n\n \n*\n \n|\nf\n(\nx\n)\n \n-\n \nf\n(\nx\n)\n|\n \n/\n \n|\nf\n(\nx\n)\n|\n \n \n1.0e-32\n:\n \nfalse\n\n \n|\nf\n(\nx\n)\n \n-\n \nf\n(\nx\n)\n|\n \n/\n \n|\nf\n(\nx\n)\n|\n \n=\n \n1.66e+13\n\n \n*\n \n|\ng\n(\nx\n)\n|\n \n \n1.0e-08\n:\n \ntrue\n\n \n|\ng\n(\nx\n)\n|\n \n=\n \n1.76e-13\n\n \n*\n \nstopped\n \nby\n \nan\n \nincreasing\n \nobjective\n:\n \nfalse\n\n \n*\n \nReached\n \nMaximum\n \nNumber\n \nof\n \nIterations\n:\n \nfalse\n\n \n*\n \nObjective\n \nCalls\n:\n \n17\n\n \n*\n \nGradient\n \nCalls\n:\n \n17\n\n \n*\n \nHessian\n \nCalls\n:\n \n14\n\n\n\n\n\n\n\n\nReferences",
"title": "Linesearch"
},
{
"location": "/algo/linesearch/#line-search",
"text": "",
"title": "Line search"
},
{
"location": "/algo/linesearch/#description",
"text": "The line search functionality has been moved to LineSearches.jl . Line search is used to decide the step length along the direction computed by an optimization algorithm. The following Optim algorithms use line search: Accelerated Gradient Descent (L-)BFGS Conjugate Gradient Gradient Descent Momentum Gradient Descent Newton By default Optim calls the line search algorithm HagerZhang() provided by LineSearches . Different line search algorithms can be assigned with the linesearch keyword argument to the given algorithm. LineSearches also allows the user to decide how the initial step length for the line search algorithm is chosen. This is set with the alphaguess keyword argument for the Optim algorithm. The default procedure varies.",
"title": "Description"
},
{
"location": "/algo/linesearch/#example",
"text": "This example compares two different line search algorithms on the Rosenbrock problem. First, run Newton with the default line search algorithm: using Optim , LineSearches prob = Optim . UnconstrainedProblems . examples [ Rosenbrock ] algo_hz = Newton (; alphaguess = LineSearches . InitialStatic (), linesearch = LineSearches . HagerZhang ()) res_hz = Optim . optimize ( prob . f , prob . g! , prob . h! , prob . initial_x , method = algo_hz ) This gives the result * Algorithm : Newton s Method \n * Starting Point : [ 0.0 , 0.0 ] \n * Minimizer : [ 0.9999999999999994 , 0.9999999999999989 ] \n * Minimum : 3.081488e-31 \n * Iterations : 14 \n * Convergence : true \n * | x - x | 1.0e-32 : false \n | x - x | = 3.06e-09 \n * | f ( x ) - f ( x ) | / | f ( x ) | 1.0e-32 : false \n | f ( x ) - f ( x ) | / | f ( x ) | = 2.94e+13 \n * | g ( x ) | 1.0e-08 : true \n | g ( x ) | = 1.11e-15 \n * stopped by an increasing objective : false \n * Reached Maximum Number of Iterations : false \n * Objective Calls : 44 \n * Gradient Calls : 44 \n * Hessian Calls : 14 Now we can try Newton with the More-Thuente line search: algo_mt = Newton (; alphaguess = LineSearches . InitialStatic (), linesearch = LineSearches . MoreThuente ()) res_mt = Optim . optimize ( prob . f , prob . g! , prob . h! , prob . initial_x , method = algo_mt ) This gives the following result, reducing the number of function and gradient calls: Results of Optimization Algorithm \n * Algorithm : Newton s Method \n * Starting Point : [ 0.0 , 0.0 ] \n * Minimizer : [ 0.9999999999999992 , 0.999999999999998 ] \n * Minimum : 2.032549e-29 \n * Iterations : 14 \n * Convergence : true \n * | x - x | 1.0e-32 : false \n | x - x | = 3.67e-08 \n * | f ( x ) - f ( x ) | / | f ( x ) | 1.0e-32 : false \n | f ( x ) - f ( x ) | / | f ( x ) | = 1.66e+13 \n * | g ( x ) | 1.0e-08 : true \n | g ( x ) | = 1.76e-13 \n * stopped by an increasing objective : false \n * Reached Maximum Number of Iterations : false \n * Objective Calls : 17 \n * Gradient Calls : 17 \n * Hessian Calls : 14",
"title": "Example"
},
{
"location": "/algo/linesearch/#references",
"text": "",
"title": "References"
},
{
"location": "/algo/precondition/",
"text": "Preconditioning\n\n\nThe \nGradientDescent\n, \nConjugateGradient\n and \nLBFGS\n methods support preconditioning. A preconditioner can be thought of as a change of coordinates under which the Hessian is better conditioned. With a good preconditioner substantially improved convergence is possible.\n\n\nA preconditioner \nP\ncan be of any type as long as the following two methods are implemented:\n\n\n\n\nA_ldiv_B!(pgr, P, gr)\n : apply \nP\n to a vector \ngr\n and store in \npgr\n (intuitively, \npgr = P \\ gr\n)\n\n\ndot(x, P, y)\n : the inner product induced by \nP\n (intuitively, \ndot(x, P * y)\n)\n\n\n\n\nPrecisely what these operations mean, depends on how \nP\n is stored. Commonly, we store a matrix \nP\n which approximates the Hessian in some vague sense. In this case,\n\n\n\n\nA_ldiv_B!(pgr, P, gr) = copy!(pgr, P \\ A)\n\n\ndot(x, P, y) = dot(x, P * y)\n\n\n\n\nFinally, it is possible to update the preconditioner as the state variable \nx\n changes. This is done through \nprecondprep!\n which is passed to the optimizers as kw-argument, e.g.,\n\n\n \nmethod\n=\nConjugateGradient\n(\nP\n \n=\n \nprecond\n(\n100\n),\n \nprecondprep!\n \n=\n \nprecond\n(\n100\n))\n\n\n\n\n\n\nthough in this case it would always return the same matrix. (See \nfminbox.jl\n for a more natural example.)\n\n\nApart from preconditioning with matrices, \nOptim.jl\n provides a type \nInverseDiagonal\n, which represents a diagonal matrix by its inverse elements.\n\n\n\n\nExample\n\n\nBelow, we see an example where a function is minimized without and with a preconditioner applied.\n\n\nusing\n \nForwardDiff\n\n\ninitial_x\n \n=\n \nzeros\n(\n100\n)\n\n\nplap\n(\nU\n;\n \nn\n \n=\n \nlength\n(\nU\n))\n \n=\n \n(\nn\n-\n1\n)\n*\nsum\n((\n0.1\n \n+\n \ndiff\n(\nU\n)\n.^\n2\n)\n.^\n2\n \n)\n \n-\n \nsum\n(\nU\n)\n \n/\n \n(\nn\n-\n1\n)\n\n\nplap1\n(\nx\n)\n \n=\n \nForwardDiff\n.\ngradient\n(\nplap\n,\nx\n)\n\n\nprecond\n(\nn\n)\n \n=\n \nspdiagm\n((\n-\nones\n(\nn\n-\n1\n),\n \n2\n*\nones\n(\nn\n),\n \n-\nones\n(\nn\n-\n1\n)),\n \n(\n-\n1\n,\n0\n,\n1\n),\n \nn\n,\n \nn\n)\n*\n(\nn\n+\n1\n)\n\n\ndf\n \n=\n \nOnceDifferentiable\n(\nx\n \n-\n \nplap\n([\n0\n;\n \nx\n;\n \n0\n]),\n\n \n(\ng\n,\n \nx\n)\n \n-\n \ncopy!\n(\ng\n,\n \n(\nplap1\n([\n0\n;\n \nx\n;\n \n0\n]))[\n2\n:\nend\n-\n1\n]))\n\n\nresult\n \n=\n \nOptim\n.\noptimize\n(\ndf\n,\n \ninitial_x\n,\n \nmethod\n \n=\n \nConjugateGradient\n(\nP\n \n=\n \nnothing\n))\n\n\nresult\n \n=\n \nOptim\n.\noptimize\n(\ndf\n,\n \ninitial_x\n,\n \nmethod\n \n=\n \nConjugateGradient\n(\nP\n \n=\n \nprecond\n(\n100\n)))\n\n\n\n\n\n\nThe former optimize call converges at a slower rate than the latter. Looking at a plot of the 2D version of the function shows the problem.\n\n\n\n\nThe contours are shaped like ellipsoids, but we would rather want them to be circles. Using the preconditioner effectively changes the coordinates such that the contours becomes less ellipsoid-like. Benchmarking shows that using preconditioning provides an approximate speed-up factor of 15 in this 100 dimensional case.\n\n\n\n\nReferences",
"title": "Preconditioners"
},
{
"location": "/algo/precondition/#preconditioning",
"text": "The GradientDescent , ConjugateGradient and LBFGS methods support preconditioning. A preconditioner can be thought of as a change of coordinates under which the Hessian is better conditioned. With a good preconditioner substantially improved convergence is possible. A preconditioner P can be of any type as long as the following two methods are implemented: A_ldiv_B!(pgr, P, gr) : apply P to a vector gr and store in pgr (intuitively, pgr = P \\ gr ) dot(x, P, y) : the inner product induced by P (intuitively, dot(x, P * y) ) Precisely what these operations mean, depends on how P is stored. Commonly, we store a matrix P which approximates the Hessian in some vague sense. In this case, A_ldiv_B!(pgr, P, gr) = copy!(pgr, P \\ A) dot(x, P, y) = dot(x, P * y) Finally, it is possible to update the preconditioner as the state variable x changes. This is done through precondprep! which is passed to the optimizers as kw-argument, e.g., method = ConjugateGradient ( P = precond ( 100 ), precondprep! = precond ( 100 )) though in this case it would always return the same matrix. (See fminbox.jl for a more natural example.) Apart from preconditioning with matrices, Optim.jl provides a type InverseDiagonal , which represents a diagonal matrix by its inverse elements.",
"title": "Preconditioning"
},
{
"location": "/algo/precondition/#example",
"text": "Below, we see an example where a function is minimized without and with a preconditioner applied. using ForwardDiff initial_x = zeros ( 100 ) plap ( U ; n = length ( U )) = ( n - 1 ) * sum (( 0.1 + diff ( U ) .^ 2 ) .^ 2 ) - sum ( U ) / ( n - 1 ) plap1 ( x ) = ForwardDiff . gradient ( plap , x ) precond ( n ) = spdiagm (( - ones ( n - 1 ), 2 * ones ( n ), - ones ( n - 1 )), ( - 1 , 0 , 1 ), n , n ) * ( n + 1 ) df = OnceDifferentiable ( x - plap ([ 0 ; x ; 0 ]), \n ( g , x ) - copy! ( g , ( plap1 ([ 0 ; x ; 0 ]))[ 2 : end - 1 ])) result = Optim . optimize ( df , initial_x , method = ConjugateGradient ( P = nothing )) result = Optim . optimize ( df , initial_x , method = ConjugateGradient ( P = precond ( 100 ))) The former optimize call converges at a slower rate than the latter. Looking at a plot of the 2D version of the function shows the problem. The contours are shaped like ellipsoids, but we would rather want them to be circles. Using the preconditioner effectively changes the coordinates such that the contours becomes less ellipsoid-like. Benchmarking shows that using preconditioning provides an approximate speed-up factor of 15 in this 100 dimensional case.",
"title": "Example"
},
{
"location": "/algo/precondition/#references",
"text": "",
"title": "References"
},
{
"location": "/algo/complex/",
"text": "Complex optimization\n\n\nOptimization of functions defined on complex inputs (C^n to R) is supported by simply passing a complex \nx0\n as input. All zeroth and first order optimization algorithms are supported. For now, only explicit gradients are supported.\n\n\nThe gradient of a complex-to-real function is defined as the only vector \ng\n such that \nf(x+h) = f(x) + real(g' * h) + O(h^2)\n. This is sometimes written \ng = df/d(z*) = df/d(re(z)) + i df/d(im(z))\n.\n\n\nThe gradient of a C^n to R function is a C^n to C^n map. Even if it is differentiable when seen as a function of R^2n to R^2n, it might not be complex-differentiable. For instance, take f(z) = Re(z)^2. Then g(z) = 2 Re(z), which is not complex-differentiable (holomorphic). Therefore, the Hessian of a C^n to R function is in general not well-defined as a n x n complex matrix (only as a 2n x 2n real matrix), and therefore second-order optimization algorithms are not applicable directly. To use second-order optimization, convert to real variables.",
"title": "Complex optimization"
},
{
"location": "/algo/complex/#complex-optimization",
"text": "Optimization of functions defined on complex inputs (C^n to R) is supported by simply passing a complex x0 as input. All zeroth and first order optimization algorithms are supported. For now, only explicit gradients are supported. The gradient of a complex-to-real function is defined as the only vector g such that f(x+h) = f(x) + real(g' * h) + O(h^2) . This is sometimes written g = df/d(z*) = df/d(re(z)) + i df/d(im(z)) . The gradient of a C^n to R function is a C^n to C^n map. Even if it is differentiable when seen as a function of R^2n to R^2n, it might not be complex-differentiable. For instance, take f(z) = Re(z)^2. Then g(z) = 2 Re(z), which is not complex-differentiable (holomorphic). Therefore, the Hessian of a C^n to R function is in general not well-defined as a n x n complex matrix (only as a 2n x 2n real matrix), and therefore second-order optimization algorithms are not applicable directly. To use second-order optimization, convert to real variables.",
"title": "Complex optimization"
},
{
"location": "/algo/manifolds/",
"text": "Manifold optimization\n\n\nOptim.jl supports the minimization of functions defined on Riemannian manifolds, i.e. with simple constraints such as normalization and orthogonality. The basic idea of such algorithms is to project back (\"retract\") each iterate of an unconstrained minimization method onto the manifold. This is used by passing a \nmanifold\n keyword argument to the optimizer.\n\n\n\n\nHowto\n\n\nHere is a simple test case where we minimize the Rayleigh quotient \nx, A x\n of a symmetric matrix \nA\n under the constraint \n||x|| = 1\n, finding an eigenvector associated with the lowest eigenvalue of \nA\n.\n\n\nn\n \n=\n \n10\n\n\nA\n \n=\n \nDiagonal\n(\nlinspace\n(\n1\n,\n2\n,\nn\n))\n\n\nf\n(\nx\n)\n \n=\n \nvecdot\n(\nx\n,\nA\n*\nx\n)\n/\n2\n\n\ng\n(\nx\n)\n \n=\n \nA\n*\nx\n\n\ng!\n(\nstor\n,\nx\n)\n \n=\n \ncopy!\n(\nstor\n,\ng\n(\nx\n))\n\n\nx0\n \n=\n \nrandn\n(\nn\n)\n\n\n\nmanif\n \n=\n \nOptim\n.\nSphere\n()\n\n\nOptim\n.\noptimize\n(\nf\n,\n \ng!\n,\n \nx0\n,\n \nOptim\n.\nConjugateGradient\n(\nmanifold\n=\nmanif\n))\n\n\n\n\n\n\n\n\nSupported solvers and manifolds\n\n\nAll first-order optimization methods are supported.\n\n\nThe following manifolds are currently supported:\n\n\n\n\nFlat: Euclidean space, default. Standard unconstrained optimization.\n\n\nSphere: spherical constraint \n||x|| = 1\n\n\nStiefel: Stiefel manifold of N by n matrices with orthogonal columns, i.e. \nX'*X = I\n\n\n\n\nThe following meta-manifolds construct manifolds out of pre-existing ones:\n\n\n\n\nPowerManifold: identical copies of a specified manifold\n\n\nProductManifold: product of two (potentially different) manifolds\n\n\n\n\nSee \ntest/multivariate/manifolds.jl\n for usage examples.\n\n\nImplementing new manifolds is as simple as adding methods \nproject_tangent!(M::YourManifold,x)\n and \nretract!(M::YourManifold,g,x)\n. If you implement another manifold or optimization method, please contribute a PR!\n\n\n\n\nReferences\n\n\nThe Geometry of Algorithms with Orthogonality Constraints, Alan Edelman, Tom\u00e1s A. Arias, Steven T. Smith, SIAM. J. Matrix Anal. \n Appl., 20(2), 303\u2013353\n\n\nOptimization Algorithms on Matrix Manifolds, P.-A. Absil, R. Mahony, R. Sepulchre, Princeton University Press, 2008",
"title": "Manifolds"
},
{
"location": "/algo/manifolds/#manifold-optimization",
"text": "Optim.jl supports the minimization of functions defined on Riemannian manifolds, i.e. with simple constraints such as normalization and orthogonality. The basic idea of such algorithms is to project back (\"retract\") each iterate of an unconstrained minimization method onto the manifold. This is used by passing a manifold keyword argument to the optimizer.",
"title": "Manifold optimization"
},
{
"location": "/algo/manifolds/#howto",
"text": "Here is a simple test case where we minimize the Rayleigh quotient x, A x of a symmetric matrix A under the constraint ||x|| = 1 , finding an eigenvector associated with the lowest eigenvalue of A . n = 10 A = Diagonal ( linspace ( 1 , 2 , n )) f ( x ) = vecdot ( x , A * x ) / 2 g ( x ) = A * x g! ( stor , x ) = copy! ( stor , g ( x )) x0 = randn ( n ) manif = Optim . Sphere () Optim . optimize ( f , g! , x0 , Optim . ConjugateGradient ( manifold = manif ))",
"title": "Howto"
},
{
"location": "/algo/manifolds/#supported-solvers-and-manifolds",
"text": "All first-order optimization methods are supported. The following manifolds are currently supported: Flat: Euclidean space, default. Standard unconstrained optimization. Sphere: spherical constraint ||x|| = 1 Stiefel: Stiefel manifold of N by n matrices with orthogonal columns, i.e. X'*X = I The following meta-manifolds construct manifolds out of pre-existing ones: PowerManifold: identical copies of a specified manifold ProductManifold: product of two (potentially different) manifolds See test/multivariate/manifolds.jl for usage examples. Implementing new manifolds is as simple as adding methods project_tangent!(M::YourManifold,x) and retract!(M::YourManifold,g,x) . If you implement another manifold or optimization method, please contribute a PR!",
"title": "Supported solvers and manifolds"
},
{
"location": "/algo/manifolds/#references",
"text": "The Geometry of Algorithms with Orthogonality Constraints, Alan Edelman, Tom\u00e1s A. Arias, Steven T. Smith, SIAM. J. Matrix Anal. Appl., 20(2), 303\u2013353 Optimization Algorithms on Matrix Manifolds, P.-A. Absil, R. Mahony, R. Sepulchre, Princeton University Press, 2008",
"title": "References"
},
{
"location": "/dev/contributing/",
"text": "Notes for contributing\n\n\nWe are always happy to get help from people who normally do not contribute to the package. However, to make the process run smoothly, we ask you to read this page before creating your pull request. That way it is more probable that your changes will be incorporated, and in the end it will mean less work for everyone.\n\n\n\n\nThings to consider\n\n\nWhen proposing a change to \nOptim.jl\n, there are a few things to consider. If you're in doubt feel free to reach out. A simple way to get in touch, is to join our \ngitter channel\n.\n\n\nBefore submitting a pull request, please consider the following bullets:\n\n\n\n\nDid you remember to provide tests for your changes? If not, please do so, or ask for help.\n\n\nDid your change add new functionality? Remember to add a section in the documentation.\n\n\nDid you change existing code in a breaking way? Then remember to use Julia's deprecation tools to help users migrate to the new syntax.\n\n\nAdd a note in the NEWS.md file, so we can keep track of changes between versions.\n\n\n\n\n\n\nAdding a solver\n\n\nIf you're contributing a new solver, you shouldn't need to touch any of the code in \nsrc/optimize.jl\n. You should rather add a file named (\nsolver\n is the name of the solver) \nsolver.jl\n in \nsrc\n, and make sure that you define an \nOptimizer\n subtype \nstruct Solver \n: Optimizer end\n with appropriate fields, a default constructor with a keyword for each field, a state type that holds all variables that are (re)used throughout the iterative procedure, an \ninitial_state\n that initializes such a state, and an \nupdate!\n method that does the actual work. Say you want to contribute a solver called \nMinim\n, then your \nsrc/minim.jl\n file would look something like\n\n\nstruct\n \nMinim\n{\nIF\n,\n \nF\n:\nFunction\n,\n \nT\n}\n \n:\n \nOptimizer\n\n \nalphaguess\n!::\nIF\n\n \nlinesearch\n!::\nF\n\n \nminim_parameter\n::\nT\n\n\nend\n\n\n\nMinim\n(;\n \nalphaguess\n \n=\n \nLineSearches\n.\nInitialStatic\n(),\n \nlinesearch\n \n=\n \nLineSearches\n.\nHagerZhang\n(),\n \nminim_parameter\n \n=\n \n1.0\n)\n \n=\n\n \nMinim\n(\nlinesearch\n,\n \nminim_parameter\n)\n\n\n\ntype\n \nMinimState\n{\nT\n,\nN\n,\nG\n}\n\n \nx\n::\nArray\n{\nT\n,\nN\n}\n\n \nx_previous\n::\nArray\n{\nT\n,\nN\n}\n\n \nf_x_previous\n::\nT\n\n \ns\n::\nArray\n{\nT\n,\nN\n}\n\n \n@add_linesearch_fields\n()\n\n\nend\n\n\n\nfunction\n \ninitial_state\n(\nmethod\n::\nMinim\n,\n \noptions\n,\n \nd\n,\n \ninitial_x\n)\n\n\n#\n \nprepare\n \ncache\n \nvariables\n \netc\n \nhere\n\n\n\nend\n\n\n\nfunction\n \nupdate\n!\n{\nT\n}(\nd\n,\n \nstate\n::\nMinimState\n{\nT\n},\n \nmethod\n::\nMinim\n)\n\n \n#\n \ncode\n \nfor\n \nMinim\n \nhere\n\n \nfalse\n \n#\n \nshould\n \nthe\n \nprocedure\n \nforce\n \nquit\n?\n\n\nend",
"title": "Contributing"
},
{
"location": "/dev/contributing/#notes-for-contributing",
"text": "We are always happy to get help from people who normally do not contribute to the package. However, to make the process run smoothly, we ask you to read this page before creating your pull request. That way it is more probable that your changes will be incorporated, and in the end it will mean less work for everyone.",
"title": "Notes for contributing"
},
{
"location": "/dev/contributing/#things-to-consider",
"text": "When proposing a change to Optim.jl , there are a few things to consider. If you're in doubt feel free to reach out. A simple way to get in touch, is to join our gitter channel . Before submitting a pull request, please consider the following bullets: Did you remember to provide tests for your changes? If not, please do so, or ask for help. Did your change add new functionality? Remember to add a section in the documentation. Did you change existing code in a breaking way? Then remember to use Julia's deprecation tools to help users migrate to the new syntax. Add a note in the NEWS.md file, so we can keep track of changes between versions.",
"title": "Things to consider"
},
{
"location": "/dev/contributing/#adding-a-solver",
"text": "If you're contributing a new solver, you shouldn't need to touch any of the code in src/optimize.jl . You should rather add a file named ( solver is the name of the solver) solver.jl in src , and make sure that you define an Optimizer subtype struct Solver : Optimizer end with appropriate fields, a default constructor with a keyword for each field, a state type that holds all variables that are (re)used throughout the iterative procedure, an initial_state that initializes such a state, and an update! method that does the actual work. Say you want to contribute a solver called Minim , then your src/minim.jl file would look something like struct Minim { IF , F : Function , T } : Optimizer \n alphaguess !:: IF \n linesearch !:: F \n minim_parameter :: T end Minim (; alphaguess = LineSearches . InitialStatic (), linesearch = LineSearches . HagerZhang (), minim_parameter = 1.0 ) = \n Minim ( linesearch , minim_parameter ) type MinimState { T , N , G } \n x :: Array { T , N } \n x_previous :: Array { T , N } \n f_x_previous :: T \n s :: Array { T , N } \n @add_linesearch_fields () end function initial_state ( method :: Minim , options , d , initial_x ) # prepare cache variables etc here end function update ! { T }( d , state :: MinimState { T }, method :: Minim ) \n # code for Minim here \n false # should the procedure force quit ? end",
"title": "Adding a solver"
},
{
"location": "/LICENSE/",
"text": "Optim.jl is licensed under the MIT License:\n\n\nCopyright (c) 2012: John Myles White and other contributors.\n\nPermission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \nSoftware\n), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \nAS IS\n, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.",
"title": "License"
}
]
}