# Chapter 60: Advanced Optimization Techniques

## Learning Objectives

By the end of this chapter, you will be able to:

- Understand the limitations of basic gradient descent and when advanced optimizers are needed
- Implement and apply gradient descent variants: SGD with momentum, Nesterov accelerated gradient, AdaGrad, RMSProp, and Adam
- Leverage second‑order optimization methods (Newton's method, quasi‑Newton, L‑BFGS) for faster convergence
- Apply adaptive optimization techniques that adjust learning rates per parameter
- Implement distributed optimization for training large models across multiple machines
- Formulate and solve multi‑objective optimization problems (e.g., balancing profit and risk in trading)
- Handle constrained optimization using Lagrange multipliers and projection methods
- Automate hyperparameter optimization using grid search, random search, Bayesian methods, and population‑based training
- Understand neural architecture search (NAS) for finding optimal network structures
- Apply meta‑learning to learn optimizers or initialization strategies
- Recognise practical tips and pitfalls when applying advanced optimizers to the NEPSE prediction system

---

## Introduction

Throughout this handbook, we have trained numerous models for the NEPSE prediction system: linear models, tree‑based ensembles, neural networks, and more. Each of these models was optimised using some form of gradient‑based algorithm, typically stochastic gradient descent (SGD) or one of its variants. But as models become deeper, datasets larger, and objectives more complex, the choice of optimisation algorithm becomes critical. The right optimiser can mean the difference between a model that converges in hours versus days, or one that finds a good local minimum versus getting stuck.

**Optimisation** in machine learning is the process of minimising (or maximising) an objective function, typically the loss function, with respect to the model's parameters. While basic SGD is simple and effective for many problems, advanced techniques can accelerate convergence, handle non‑stationary objectives, escape poor local minima, and scale to massive datasets.

In this chapter, we will explore a spectrum of optimisation techniques, from improved gradient‑based methods to second‑order approaches, distributed optimisation, and even optimisation of the optimisation process itself (meta‑learning). Using the NEPSE prediction system as a motivating example, we will implement these techniques and discuss when and why to use them.

---

## 60.1 Gradient Descent Variants

Gradient descent updates parameters in the direction of the negative gradient of the loss function. The basic update is:

`θ ← θ - η ∇_θ L(θ)`

where `η` is the learning rate. Variants modify this update to improve convergence speed and stability.

### 60.1.1 Stochastic Gradient Descent (SGD) with Momentum

Momentum helps accelerate gradients in the right direction and dampens oscillations. It accumulates a velocity vector:

`v_t = β v_{t-1} + (1 - β) ∇_θ L(θ)`
`θ ← θ - η v_t`

Typical values: `β = 0.9` or `0.99`.

**PyTorch implementation:**

```python
import torch.optim as optim

optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
```

**Explanation:**  
Momentum helps the optimizer build up speed in directions of consistent gradient, smoothing out updates and speeding convergence. For NEPSE training, momentum can help escape shallow local minima and navigate noisy loss landscapes.

### 60.1.2 Nesterov Accelerated Gradient (NAG)

Nesterov momentum looks ahead by computing the gradient at the approximate future position:

`v_t = β v_{t-1} + (1 - β) ∇_θ L(θ - β v_{t-1})`
`θ ← θ - η v_t`

This "peek ahead" often leads to faster convergence and better performance.

```python
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9, nesterov=True)
```

### 60.1.3 AdaGrad

AdaGrad adapts the learning rate per parameter based on the historical sum of squared gradients. Parameters with large gradients get smaller learning rates, and vice versa.

`G_t = G_{t-1} + (∇_θ L(θ))²`
`θ ← θ - η / (√(G_t + ε)) * ∇_θ L(θ)`

AdaGrad works well for sparse features (like one‑hot encoded categories) but the learning rate can become too small over time.

```python
optimizer = optim.Adagrad(model.parameters(), lr=0.01)
```

### 60.1.4 RMSProp

RMSProp modifies AdaGrad to use a moving average of squared gradients, preventing the learning rate from decaying too aggressively.

`E[g²]_t = β E[g²]_{t-1} + (1 - β) (∇_θ L(θ))²`
`θ ← θ - η / √(E[g²]_t + ε) * ∇_θ L(θ)`

Typical `β = 0.9`.

```python
optimizer = optim.RMSprop(model.parameters(), lr=0.001, alpha=0.9)
```

### 60.1.5 Adam (Adaptive Moment Estimation)

Adam combines momentum and RMSProp, maintaining both a moving average of gradients (first moment) and squared gradients (second moment).

`m_t = β₁ m_{t-1} + (1 - β₁) ∇_θ L(θ)`
`v_t = β₂ v_{t-1} + (1 - β₂) (∇_θ L(θ))²`
`m̂_t = m_t / (1 - β₁ᵗ)`  (bias correction)
`v̂_t = v_t / (1 - β₂ᵗ)`
`θ ← θ - η m̂_t / (√(v̂_t) + ε)`

Default values: `β₁ = 0.9`, `β₂ = 0.999`, `ε = 1e-8`. Adam is often the default choice for training neural networks.

```python
optimizer = optim.Adam(model.parameters(), lr=0.001)
```

**Which to use for NEPSE?**  
For the LSTM models we built earlier, Adam is a safe starting point. For linear models or simple networks, SGD with momentum may suffice. The choice can be tuned as a hyperparameter.

---

## 60.2 Second‑Order Methods

First‑order methods use only gradient information. Second‑order methods also use curvature information (the Hessian matrix) to make more informed steps, potentially converging in fewer iterations.

### 60.2.1 Newton's Method

Newton's method updates parameters using:

`θ ← θ - H⁻¹ ∇_θ L(θ)`

where `H` is the Hessian matrix (second derivatives). This can give quadratic convergence near the optimum, but computing and inverting the Hessian is expensive (O(n³)) and infeasible for large models.

### 60.2.2 Quasi‑Newton Methods (BFGS, L‑BFGS)

Quasi‑Newton methods approximate the inverse Hessian using only gradient information, avoiding the O(n³) cost. BFGS (Broyden–Fletcher–Goldfarb–Shanno) builds an approximation iteratively. L‑BFGS (Limited‑memory BFGS) is a memory‑efficient version that stores only a few vectors, making it suitable for large problems.

L‑BFGS is often used for optimisation problems with a small number of iterations, such as training logistic regression or support vector machines.

**Example with `scipy`**

```python
from scipy.optimize import minimize
import numpy as np

# Define loss function and gradient for a simple model
def loss(params, X, y):
    # params: model weights
    y_pred = X @ params
    return np.mean((y - y_pred)**2)

def grad(params, X, y):
    y_pred = X @ params
    return -2 * X.T @ (y - y_pred) / len(y)

# Optimize
X_train, y_train = ...  # numpy arrays
initial_params = np.zeros(X_train.shape[1])
result = minimize(loss, initial_params, args=(X_train, y_train), method='L-BFGS-B', jac=grad)
print(result.x)
```

**Explanation:**  
L‑BFGS is a batch method (uses full dataset) and is well‑suited for convex problems with a moderate number of parameters. For the NEPSE system, it could be used to train linear models or the final layer of a neural network.

---

## 60.3 Adaptive Optimization

Adaptive optimisers (AdaGrad, RMSProp, Adam) are a class of methods that adjust learning rates per parameter based on historical gradients. They are particularly useful when the loss landscape has different curvatures in different directions, which is common in deep learning.

### 60.3.1 Learning Rate Scheduling

Even with adaptive methods, manually annealing the learning rate can improve final performance. Common schedules:

- **Step decay**: Reduce learning rate by a factor every few epochs.
- **Exponential decay**: `η = η₀ * exp(-k t)`
- **Cosine annealing**: `η = η_min + 0.5(η_max - η_min)(1 + cos(t/T π))`

**PyTorch example with step decay:**

```python
optimizer = optim.Adam(model.parameters(), lr=0.001)
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.1)

for epoch in range(50):
    train_one_epoch()
    scheduler.step()
```

**Explanation:**  
After every 10 epochs, the learning rate is multiplied by 0.1. This allows large steps early and fine‑tuning later.

### 60.3.2 Warmup

For very deep networks or transformers, a **learning rate warmup** (gradually increasing the learning rate from a small value to the target) prevents early instability.

```python
def warmup_lambda(epoch):
    if epoch < 5:
        return epoch / 5  # linear increase
    else:
        return 1.0

scheduler = optim.lr_scheduler.LambdaLR(optimizer, lr_lambda=warmup_lambda)
```

---

## 60.4 Distributed Optimization

When the dataset or model is too large to fit on one machine, we must distribute the optimisation across multiple workers. Two main paradigms exist:

- **Data parallelism**: Each worker has a copy of the model and processes a different subset of data; gradients are aggregated.
- **Model parallelism**: Different parts of the model are placed on different devices (e.g., layers on different GPUs).

### 60.4.1 Synchronous SGD

In synchronous SGD, all workers compute gradients on their batch, then gradients are averaged (e.g., via all‑reduce) before updating the model. This is equivalent to using a larger batch size.

**PyTorch distributed example (simplified):**

```python
import torch.distributed as dist
import torch.multiprocessing as mp

def train(rank, world_size):
    dist.init_process_group('nccl', rank=rank, world_size=world_size)
    model = MyModel().to(rank)
    model = nn.parallel.DistributedDataParallel(model, device_ids=[rank])
    optimizer = optim.Adam(model.parameters())

    for epoch in range(epochs):
        for data, target in train_loader:
            optimizer.zero_grad()
            output = model(data.to(rank))
            loss = criterion(output, target.to(rank))
            loss.backward()
            optimizer.step()

if __name__ == '__main__':
    world_size = 4
    mp.spawn(train, args=(world_size,), nprocs=world_size)
```

**Explanation:**  
`DistributedDataParallel` synchronises gradients across processes using all‑reduce. This scales linearly with the number of GPUs.

### 60.4.2 Asynchronous SGD

In asynchronous SGD, workers update a central parameter server independently, without waiting for others. This can be faster but may lead to stale gradients and convergence issues. It is less common today.

### 60.4.3 Distributed Optimisers for Large‑Scale NEPSE

If we were to train a deep neural network on decades of tick data for all 200+ NEPSE stocks, distributed optimisation would be essential. Using PyTorch's `DistributedDataParallel` on a multi‑GPU server or across a cluster would be the practical approach.

---

## 60.5 Multi‑Objective Optimization

In the NEPSE prediction system, we often have multiple, possibly conflicting objectives. For example, we might want to:

- Maximise prediction accuracy
- Minimise model complexity (to avoid overfitting)
- Maximise trading profit
- Minimise risk (drawdown)

**Multi‑objective optimisation** seeks to find a set of solutions that trade off between objectives, known as the **Pareto front**.

### 60.5.1 Scalarisation

The simplest approach is to combine objectives into a single scalar loss:

`L_total = λ₁ L₁ + λ₂ L₂ + ...`

The weights `λ` reflect preferences. For example, in training a trading agent, we could combine profit and risk:

`L = -profit + λ * volatility`

**Example: Training a neural network with L1 regularisation (sparsity)**

```python
def loss_fn(output, target, model, lambda_reg):
    mse = nn.MSELoss()(output, target)
    l1 = sum(p.abs().sum() for p in model.parameters())
    return mse + lambda_reg * l1
```

### 60.5.2 Pareto Optimisation

When we want to explore the entire trade‑off surface, we can use evolutionary algorithms like **NSGA‑II** (Non‑dominated Sorting Genetic Algorithm). These maintain a population of solutions and evolve them towards the Pareto front.

**Example with `pymoo` library:**

```python
from pymoo.algorithms.nsga2 import NSGA2
from pymoo.factory import get_problem
from pymoo.optimize import minimize
from pymoo.core.problem import Problem

class NEPSEProblem(Problem):
    def __init__(self):
        super().__init__(n_var=10, n_obj=2, xl=-5, xu=5)

    def _evaluate(self, x, out, *args, **kwargs):
        # x: decision variables (e.g., model hyperparameters)
        # Compute two objectives: negative profit and risk
        profit = compute_profit(x)
        risk = compute_risk(x)
        out["F"] = np.column_stack([-profit, risk])

problem = NEPSEProblem()
algorithm = NSGA2(pop_size=100)
res = minimize(problem, algorithm, ('n_gen', 50), seed=42, verbose=True)
```

**Explanation:**  
NSGA‑2 evolves a population of solutions, each represented by a vector of decision variables (e.g., learning rate, network depth, regularisation strength). The result is a set of Pareto‑optimal solutions, each offering a different trade‑off between profit and risk.

---

## 60.6 Constrained Optimization

Sometimes we have constraints: the portfolio weights must sum to 1, or the predicted price must be positive, or the model's complexity must stay below a threshold.

### 60.6.1 Projected Gradient Descent

For simple constraints like box constraints or norm constraints, we can project the parameters back into the feasible set after each gradient step.

```python
def project_weights(weights):
    # Ensure weights sum to 1 and are non‑negative
    weights = np.maximum(weights, 0)
    return weights / weights.sum()

# In training loop
optimizer.step()
with torch.no_grad():
    model.fc.weight.data = project_weights(model.fc.weight.data)
```

### 60.6.2 Lagrange Multipliers

For more complex constraints, we can use the method of Lagrange multipliers, converting a constrained problem into an unconstrained one by adding penalty terms.

**Example: Training with a constraint on model size (number of non‑zero weights)**

We can add a Lagrangian term: `L_total = L + λ (||θ||₀ - target)` but the L0 norm is non‑differentiable. Instead, we often use L1 regularisation as a proxy.

### 60.6.3 Augmented Lagrangian and ADMM

The **Augmented Lagrangian Method** and **Alternating Direction Method of Multipliers (ADMM)** are powerful techniques for constrained optimisation, often used in distributed settings.

---

## 60.7 Hyperparameter Optimization

Hyperparameters (learning rate, batch size, network architecture) are not learned during training but must be tuned. This is itself an optimisation problem.

### 60.7.1 Grid Search and Random Search

Grid search exhaustively tries all combinations in a predefined grid. Random search samples combinations randomly and often finds good configurations faster.

```python
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform, randint

param_dist = {
    'learning_rate': uniform(1e-4, 1e-2),
    'batch_size': randint(16, 128),
    'num_layers': randint(1, 5)
}

random_search = RandomizedSearchCV(estimator, param_dist, n_iter=20, cv=3)
random_search.fit(X_train, y_train)
print(random_search.best_params_)
```

### 60.7.2 Bayesian Optimization

Bayesian optimisation builds a probabilistic model (e.g., Gaussian process) of the objective function and uses it to select the most promising hyperparameters to evaluate next. It is more efficient than random search.

**Example with `scikit-optimize`**

```python
from skopt import BayesSearchCV
from skopt.space import Real, Integer

search_spaces = {
    'learning_rate': Real(1e-4, 1e-2, prior='log'),
    'batch_size': Integer(16, 128),
    'num_layers': Integer(1, 5)
}

bayes_search = BayesSearchCV(estimator, search_spaces, n_iter=30, cv=3)
bayes_search.fit(X_train, y_train)
print(bayes_search.best_params_)
```

### 60.7.3 Population‑Based Training (PBT)

PBT, used in DeepMind's AlphaStar, trains a population of models in parallel, periodically copying the weights of better‑performing models and mutating their hyperparameters. It dynamically adapts hyperparameters during training.

**Simplified PBT loop:**

```python
# Population of models with different hyperparameters
population = [Model(hp) for hp in initial_hps]

for step in range(total_steps):
    for model in population:
        model.train_one_step()
    if step % exploit_interval == 0:
        # Sort by performance
        population.sort(key=lambda m: m.performance)
        # Replace worst 20% with mutated copies of best 20%
        for i in range(len(population) // 5):
            best = population[-i-1]
            worst = population[i]
            worst.load_state_dict(best.state_dict())
            worst.hp = mutate(best.hp)
```

---

## 60.8 Neural Architecture Search (NAS)

NAS automates the design of neural network architectures. It is computationally expensive but can find architectures that outperform manual designs.

### 60.8.1 Reinforcement Learning‑based NAS

A controller (e.g., an RNN) proposes architectures, which are trained and evaluated. The validation accuracy is used as a reward to update the controller.

### 60.8.2 One‑shot / Weight‑sharing NAS

To reduce cost, one‑shot methods train a super‑network that contains all possible architectures, then search for the best sub‑network without retraining.

**Example with `autokeras`**

```python
import autokeras as ak

clf = ak.StructuredDataClassifier(max_trials=10)  # tries 10 architectures
clf.fit(X_train, y_train)
model = clf.export_model()
```

**Explanation:**  
AutoKeras searches over different network architectures (number of layers, units, activation functions) using Bayesian optimisation and weight sharing.

---

## 60.9 Meta‑Learning

Meta‑learning, or "learning to learn", aims to design optimisers or initialisations that generalise across tasks. Two popular approaches:

- **Optimiser learning**: Learn an update rule (e.g., a neural network) that outperforms hand‑designed optimisers like Adam.
- **Model‑agnostic meta‑learning (MAML)**: Learn an initialisation that can be fine‑tuned quickly to new tasks with few gradient steps.

### 60.9.1 Learned Optimisers

Instead of using a fixed formula like Adam, we can train an RNN to propose parameter updates. The RNN itself is trained on many optimisation tasks to minimise total loss.

**Simplified concept:**

```python
# Optimizer network (e.g., LSTM) takes gradient and state, outputs update
optimizer_net = LSTMOptimizer()
state = None

for step in range(T):
    loss = model(data)
    loss.backward()
    grads = [p.grad for p in model.parameters()]
    update, state = optimizer_net(grads, state)
    apply_update(model, update)
```

### 60.9.2 MAML for Fast Adaptation in NEPSE

Imagine we have many stocks, each with limited data. MAML could learn an initial model that adapts quickly to a new stock with just a few gradient steps. This is a form of transfer learning.

**MAML implementation sketch:**

```python
def maml_inner_loop(model, support_set, inner_steps=5, inner_lr=0.01):
    adapted_model = clone(model)
    for _ in range(inner_steps):
        loss = compute_loss(adapted_model, support_set)
        grad = torch.autograd.grad(loss, adapted_model.parameters())
        adapted_model = update(adapted_model, grad, inner_lr)
    return adapted_model

def maml_outer_loop(model, tasks, outer_lr=0.001):
    meta_grads = []
    for task in tasks:
        support, query = task
        adapted = maml_inner_loop(model, support)
        query_loss = compute_loss(adapted, query)
        meta_grads.append(torch.autograd.grad(query_loss, model.parameters()))
    # Average gradients and update meta-model
    avg_grad = average_gradients(meta_grads)
    model = update(model, avg_grad, outer_lr)
```

**Explanation:**  
MAML trains the initial model such that after one (or a few) gradient steps on a new task, it performs well. For NEPSE, tasks could be individual stocks, and we want a model that can quickly adapt to a new stock's price dynamics.

---

## 60.10 Practical Tips and Pitfalls

1. **Start with Adam**: For most deep learning problems on NEPSE, Adam with default settings is a solid baseline. Tune learning rate first.
2. **Learning rate is the most important hyperparameter**: Always include it in hyperparameter searches.
3. **Use learning rate schedules**: Step decay or cosine annealing often improve final performance.
4. **Monitor loss curves**: If loss oscillates wildly, reduce learning rate or increase batch size.
5. **Gradient clipping**: For RNNs, clip gradients to avoid explosion.
   ```python
   torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
   ```
6. **Batch normalisation**: Helps with optimisation by normalising layer inputs.
7. **Weight decay (L2 regularisation)**: Often included in optimisers (AdamW is Adam with correct weight decay).
8. **Distributed training**: Ensure that the effective batch size (batch per worker × workers) matches the learning rate (linear scaling rule: double batch size, double learning rate).
9. **Early stopping**: Stop training when validation loss plateaus to avoid overfitting.
10. **Reproducibility**: Set random seeds and log all hyperparameters.

---

## Chapter Summary

In this chapter, we explored advanced optimisation techniques that can significantly improve the training of models for the NEPSE prediction system. We covered:

- Gradient descent variants: SGD with momentum, NAG, AdaGrad, RMSProp, and Adam.
- Second‑order methods: Newton, BFGS, L‑BFGS for faster convergence on smaller problems.
- Adaptive optimisation and learning rate scheduling.
- Distributed optimisation for scaling to large datasets and models.
- Multi‑objective optimisation for trading off conflicting goals like profit and risk.
- Constrained optimisation using projections and Lagrange multipliers.
- Hyperparameter optimisation via random search, Bayesian optimisation, and PBT.
- Neural architecture search to automate network design.
- Meta‑learning for learning to learn and fast adaptation.

Choosing the right optimisation algorithm and tuning it properly is essential for achieving state‑of‑the‑art performance. For the NEPSE system, a combination of Adam with a cosine annealing schedule and hyperparameter tuning via Bayesian optimisation is a powerful and practical approach.

This chapter concludes **Part XIII: Emerging Technologies and Future Trends**. In the next part, we will move to **Appendices**, which provide reference material, checklists, and templates for building and deploying time‑series prediction systems.

---

**End of Chapter 60**