## Optimization loops

During this lesson, we'll learn how to use an *optimizer* to iteratively explore our ansatz's parameterized quantum states:

- Bootstrap an optimization loop
- Understand tradeoffs while using local and global optimizers
- Explore barren plateaus and how to avoid them

At a high level, optimizers are central to explore our search space. The optimizer uses cost function evaluations to select the next set of parameters in a variational loop, and repeats until the optimizer reaches stable state. At this stage, an optimal set of parameter values $\vec\theta^*$ are returned.

![Optimization Workflow](images/optimization_workflow.png)

## Local and Global Optimizers

### Local Optimizers

Local optimizers look for a point that minimizes the cost function starting at an initial point(s) $C(\vec{\theta_0})$ and moving to different points based on what they see in the region they happen to be at on successive iterations. That implies that the convergence of these algorithms will usually be fast, but can heavily dependent on the initial point. Local optimizers are unable to see beyond the region where they are evaluating, and turn out to be especially vulnerable to local minima, reporting convergence when they find one, ignoring other states with more favorable evaluations.

### Global Optimizers

Global optimizers look for the point that minimizes the cost function over several regions of its domain (i.e. non-local), evaluating it iteratively (i.e. iteration $i$) over a set of parameter vectors $\Theta_i := \{\vec\theta_{i,j} \; | \; j \in \mathcal{J}_\text{opt}^i\}$ determined by the optimizer. This makes them less susceptible to local minima and somewhat independent of initialization, but also significantly slower to converge to a proposed solution. 

### Bootstraping Optimization

*Bootstrapping*, or setting the initial value for parameters $\vec\theta$ based on a prior optimization, can help our optimzer converge on a solution faster. We refer to these as the _initial point_ $\vec\theta_0$, and $|\psi(\vec\theta_0)\rangle = U_V(\vec\theta_0)|\rho\rangle$ as the _initial state_. This initial state differs from our *reference state* $|\rho\rangle$, as the former focuses on initial parameters set during our optimization loop, while the latter focuses on using known "reference" solutions. They may coincide if $U_V(\vec\theta_0) \equiv I$ (i.e. the identity operation).

When local optimizers converge to non-optimal local minima, we can try bootstrapping the optimization globally, and refine the convergence locally. While this requires setting up two variational workloads, it allows the optimizer to find a more optimal solution than the local optimizer alone. 

## Gradient-Based and Gradient-Free Optimizers

### Gradient-Based

For our cost function $C(\vec\theta)$, and we have access to the gradient of the function, $\vec{\nabla} C(\vec\theta)$, starting from an initial point. The simplest way to minimize the function is to update the parameters towards the direction of steepest descent of the function: $\vec\theta_{n+1} = \vec\theta_n - \eta \vec{\nabla} C(\vec\theta)$, where $\eta$ is the learning rate - a small, positive [hyperparameter](gloss:hyperparameter) controlling the size of the update. We continue doing this until we converge to a [local minimum](gloss:local-minimum) of the cost function, $C({\vec\theta^*})$. [`qiskit.algorithms`](https://qiskit.org/documentation/stubs/qiskit.algorithms.gradients.html) provides several different methods to compute gradients, and you can learn more about them [here](https://learn.qiskit.org/course/machine-learning/training-quantum-circuits#training-2-0). The main disadvantages of this type of optimization are the convergence speed can be very slow and there is no guarantee to achieve the optimal solution. 

![graph of f(theta) against theta, multiple dots show different states of a gradient descent algorithm finding the minimum of a curve.](images/optimization_gradient_descent.svg)

### Gradient-Free

Gradient-free optimization algorithms do not require gradient information and can be useful in situations where the gradient is difficult or expensive to compute (or too noisy). They also tend to be more robust to find the global optima sometimes, while the gradient-based methods tend to converge into local optima. We'll explore a few instances where a gradient-free optimizer can avoid barren plateaus. However, gradient-free methods require higher computational capacities, especially for the problems with high-dimensional search spaces.

## Barren Plateaus

In fact, the cost landscape can be quite complicated, as shown in hills and valleys of the example below. The optimization method navigates us around the cost landscape, searching for the minimum, as shown by the black points and lines. we can see that two of the three searches end up in a local landscape minimum, rather than a global one. 

![Cost Landscape](images/optimization_loss_landscape.png)

Despite what type of optimization method is used, if the cost landscape is fairly flat, it can be difficult for the method to determine which direction to search. This situation is called a [barren plateau](gloss:barren-plateaus), where the cost landscape becomes increasingly flat (and thus hard to determine the direction to the minimum). For a wide class of reasonable parameterized quantum circuits, the probability that the gradient along any reasonable direction is non-zero to some fixed precision is exponentially small as a function of the number of qubits.

![Barren Plateaus](images/optimization_barren_plateaus.png)

While this is still an area of active research, we have a few recommendations:

- **Bootstrapping** helps the optimization loop avoid getting stuck in a parameter space where the gradient is small.
- **Experimenting with hardware-efficient ansatz**: as we're using a noisy quantum system as a *black-box oracle*, the _quality_ of those evaluations will affect the performance of the optimizer. Using hardware-efficent ansatz, such as [`EfficientSU2`](https://qiskit.org/documentation/stubs/qiskit.circuit.library.EfficientSU2.html), could avoid producing exponentially small gradients.
- **Experimenting with error suppression and error mitigation**: the Qiskit Runtime Primitives offer a simple interface to experiment with a variety of `optimization_level`s and `resilience_setting`s respectively. This can reduce the impact of noise and make the optimization process more efficient.
- **Experimenting with gradient-free optimizers**: Unlike gradient-based optimization algorithms, `COBYLA` does not rely on gradient information to optimize the parameters, and is less likely to be affected by the barren plateau.

With this lesson, you learned how to define your optimization loop:

- Bootstrap an optimization loop
- Understand tradeoffs while using local and global optimizers
- Explore barren plateaus and how to avoid them

Our high-level variational workload is complete:

![Optimization Circuit](images/optimization_circuit.png)

Next, we'll explore specific variational algorithms with this framework in mind.