In [None]:
# === Environment Setup ===
import os, sys, math, time, random, json, textwrap, warnings
from typing import Callable
import numpy as np, pandas as pd
import matplotlib.pyplot as plt
from scipy.optimize import minimize, rosen, rosen_der, rosen_hess, linprog, basinhopping, differential_evolution
from mpl_toolkits.mplot3d import Axes3D
from IPython.display import Markdown, display

# JAX for automatic differentiation
try:
    import jax
    import jax.numpy as jnp
    from jax import grad, hessian, jit
    JAX_AVAILABLE = True
except ImportError: JAX_AVAILABLE = False

# CVXPY for disciplined convex programming
try:
    import cvxpy as cp
    CVXPY_AVAILABLE = True
except ImportError: CVXPY_AVAILABLE = False

# --- Configuration ---
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams.update({'figure.dpi': 130, 'font.size': 12, 'axes.titlesize': 'x-large',
    'axes.labelsize': 'large', 'xtick.labelsize': 'medium', 'ytick.labelsize': 'medium'})
np.set_printoptions(suppress=True, linewidth=120, precision=4)

# --- Utility Functions ---
def note(msg, **kwargs):
    display(Markdown(f"<div class='alert alert-info'>📝 {textwrap.fill(msg, width=100)}</div>"))
def sec(title):
    print(f"\n{100*'='}\n| {title.upper()} |\n{100*'='}")

note(f"Environment initialized. JAX: {JAX_AVAILABLE}, CVXPY: {CVXPY_AVAILABLE}")

# Part 2: Core Numerical Methods
## Chapter 2.5: Optimization

### Table of Contents
1.  [The Landscape of Unconstrained Optimization](#1.-The-Landscape-of-Unconstrained-Optimization-Algorithms)
    *   [1.1 First-Order vs. Second-Order vs. Derivative-Free Methods](#1.1-First-Order-vs.-Second-Order-vs.-Derivative-Free-Methods)
    *   [1.2 Quasi-Newton Methods: The Workhorse (BFGS)](#1.2-Quasi-Newton-Methods:-The-Workhorse-(BFGS))
2.  [Convex Optimization](#2.-Convex-Optimization)
    *   [2.1 The Power of Convexity](#2.1-The-Power-of-Convexity)
    *   [2.2 Disciplined Convex Programming with CVXPY](#2.2-Disciplined-Convex-Programming-with-CVXPY)
3.  [Constrained Optimization](#3.-Constrained-Optimization)
    *   [3.1 Geometric Interpretation of KKT Conditions](#3.1-Geometric-Interpretation-of-KKT-Conditions)
    *   [3.2 Application: The Consumer's Problem](#3.2-Application:-The-Consumer's-Problem)
4.  [Application: Mean-Variance Portfolio Optimization](#4.-Application:-Mean-Variance-Portfolio-Optimization)
5.  [Global Optimization](#5.-Global-Optimization)
    *   [5.1 Basinhopping](#5.1-Basinhopping)
    *   [5.2 Differential Evolution](#5.2-Differential-Evolution)
6.  [Chapter Summary](#6.-Chapter-Summary)
7.  [Exercises](#7.-Exercises)

### Introduction: The Heart of Economic Reasoning
Optimization is the language of modern economics. The principle of constrained optimization is the foundation of our models of agent behavior: consumers maximize utility subject to a budget, firms maximize profits subject to a production technology, and social planners choose policies to maximize welfare. While analytical, "pen-and-paper" solutions are possible for simple, stylized models, most real-world problems in econometrics, structural estimation, and dynamic modeling are far too complex to be solved by hand. For these, we must rely on **numerical optimization**.

This notebook provides a deep dive into the fundamental algorithms that power modern computational economics. We will explore the theory behind how optimizers work, apply professional-grade tools to solve economic models, and introduce the theory of constrained optimization.

### 1. The Landscape of Unconstrained Optimization Algorithms
For a smooth function $f: \mathbb{R}^n \to \mathbb{R}$, we can use its derivatives to find a local minimum. The **Gradient**, $\nabla f(x)$, is a vector of first partial derivatives that points in the direction of steepest local ascent. The **Hessian**, $\nabla^2 f(x)$, is a matrix of second partial derivatives that describes the local curvature of the function.

Most optimization algorithms are iterative. They start with an initial guess $\mathbf{x}_0$ and generate a sequence $\mathbf{x}_1, \mathbf{x}_2, \dots$ that converges to a solution $\mathbf{x}^*$ where $\nabla f(\mathbf{x}^*) = 0$. The core of any algorithm is how it chooses the **search direction** ($\mathbf{p}_k$) and the **step size** ($\alpha_k$) to move from $\mathbf{x}_k$ to the next point: $\mathbf{x}_{k+1} = \mathbf{x}_k + \alpha_k \mathbf{p}_k$.

#### 1.1 First-Order vs. Second-Order vs. Derivative-Free Methods
- **First-Order (Gradient Descent):** Uses only the gradient $\nabla f(x)$. Simple but can be slow.
- **Second-Order (Newton's Method):** Uses both the gradient and the Hessian $\nabla^2 f(x)$. Converges very fast but is computationally expensive.
- **Derivative-Free (Nelder-Mead):** Uses only function values. Necessary for non-differentiable functions, but often slow.

#### 1.2 Quasi-Newton Methods: The Workhorse (BFGS)
Computing and inverting the Hessian is often computationally prohibitive. **Quasi-Newton methods** provide a powerful compromise. They avoid the explicit Hessian by instead building up an *approximation* to its inverse at each step, using only gradient information. The most popular of these methods is **BFGS** (Broyden–Fletcher–Goldfarb–Shanno), which is the workhorse of unconstrained optimization and the default in many scientific libraries.\nIntuitively, you can think of BFGS as a multi-dimensional generalization of the Secant method from root-finding. Just as the Secant method uses the slope between two points to approximate a derivative, BFGS uses the change in the gradient between two iterates to update its approximation of the inverse Hessian.

![Optimizer Paths on Rosenbrock Function](../images/02-Numerical-Methods/optimizer_paths.png)

### 2. Convex Optimization

#### 2.1 The Power of Convexity
A general non-linear optimization problem can be very hard to solve. The function may have many local minima, and algorithms can get stuck. **Convex optimization** is a special subfield that deals with minimizing a **convex function** over a **convex set**.

A function $f$ is convex if the line segment between any two points on its graph lies on or above the graph. A set is convex if the line segment between any two points in the set is also in the set.

The power of convexity comes from a simple, profound result: **for a convex optimization problem, any local minimum is also a global minimum.** This eliminates the central challenge of multi-modal functions and allows for the development of extremely efficient and reliable solvers.

#### 2.2 Disciplined Convex Programming with CVXPY
**Disciplined Convex Programming (DCP)** is a framework for formally verifying that a problem is convex. The `CVXPY` library implements this framework, allowing users to express a problem in a natural, high-level syntax. `CVXPY` then automatically checks if the problem satisfies the DCP ruleset and, if so, converts it into a standard form that can be passed to highly-optimized convex solvers (like `ECOS` or `SCS`).

This approach is more robust and often simpler than using a general-purpose non-linear solver for problems that can be formulated as convex.

In [None]:
sec("The Consumer Problem via Convex Optimization")
if not CVXPY_AVAILABLE:
    note("CVXPY not installed. Skipping this section.")
else:
    # 1. Define parameters and CVXPY variables
    alpha, p1, p2, M = 0.5, 1, 2, 100
    x1 = cp.Variable(pos=True) # pos=True is a constraint x1 >= 0
    x2 = cp.Variable(pos=True)
    
    # 2. Define the convex objective and constraints
    # To maximize log(u(x)), we maximize alpha*log(x1) + (1-alpha)*log(x2),
    # which is a concave function (a convex minimization problem).
    utility = alpha * cp.log(x1) + (1 - alpha) * cp.log(x2)
    objective = cp.Maximize(utility)
    constraints = [p1*x1 + p2*x2 <= M]
    
    # 3. Formulate and solve the problem
    problem = cp.Problem(objective, constraints)
    problem.solve()
    
    note(f"Problem status: {problem.status}")
    note(f"Optimal consumption: x1={x1.value:.2f}, x2={x2.value:.2f}")
    note(f"Max utility (log-utility): {problem.value:.2f}")

### The Advantage of Supplying Gradients

For complex problems, providing the optimizer with analytically computed gradients and Hessians (e.g., via JAX) can lead to significant performance improvements and better accuracy compared to relying on finite-difference approximations. Let's demonstrate this:

In [None]:
import timeit\nsec("Performance: Finite Differences vs. JAX Gradients")\nif JAX_AVAILABLE:\n    # Define the Rosenbrock function in JAX\n    def rosen_jax(x): return jnp.sum(100.0 * (x[1:] - x[:-1]**2.0)**2.0 + (1.0 - x[:-1])**2.0)\n    # JIT-compile the function and its gradient\n    rosen_jit = jit(rosen_jax)\n    rosen_grad_jit = jit(grad(rosen_jax))\n    x0_large = np.random.rand(100) * 2 - 1\n    # Timing with finite-difference approximation\n    time_no_grad = timeit.timeit(lambda: minimize(rosen, x0_large, method='BFGS'), number=10)\n    # Timing with JAX-provided gradient\n    time_with_grad = timeit.timeit(lambda: minimize(rosen_jit, x0_large, method='BFGS', jac=rosen_grad_jit), number=10)\n    note(f'Time without JAX gradient: {time_no_grad:.4f}s\nTime with JAX gradient: {time_with_grad:.4f}s')

### 3. Constrained Optimization

#### 3.1 The Importance of Constraints in Economics

Nearly every interesting economic problem is a constrained optimization problem. Agents maximize utility subject to a budget constraint. Firms maximize profits subject to a production technology. Planners maximize welfare subject to resource constraints.

While we have studied the theoretical conditions for optimality (the Karush-Kuhn-Tucker conditions), we have not yet explored the practical algorithms used to solve these problems computationally. This notebook fills that gap by introducing two of the most powerful and widely used algorithms for non-linear constrained optimization: Sequential Quadratic Programming (SQP) and Interior-Point methods.

#### 3.2 Karush-Kuhn-Tucker (KKT) Conditions Recap\nThese conditions are named after Harold W. Kuhn and Albert W. Tucker, who formalized them in 1951, and William Karush, who had stated the conditions in his unpublished master's thesis in 1939. They generalize the method of Lagrange multipliers to include inequality constraints.\n

Recall the standard non-linear programming problem:
$$ \min_{x} f(x) \quad \text{s.t.} \quad g_i(x) \le 0, \quad h_j(x) = 0 $$

The KKT conditions are the necessary conditions for a solution to be optimal. They state that at an optimal point $x^*$, there must exist multipliers $\mu_i^*$ and $\lambda_j^*$ such that:
1.  **Stationarity:** $\nabla f(x^*) + \sum_i \mu_i^* \nabla g_i(x^*) + \sum_j \lambda_j^* \nabla h_j(x^*) = 0$
2.  **Primal Feasibility:** $g_i(x^*) \le 0$, $h_j(x^*) = 0$
3.  **Dual Feasibility:** $\mu_i^* \ge 0$
4.  **Complementary Slackness:** $\mu_i^* g_i(x^*) = 0$

Modern optimization algorithms can be thought of as sophisticated methods for finding a point $(x^*, \mu^*, \lambda^*)$ that satisfies this system of equations and inequalities.

#### 3.3 Algorithm 1: Sequential Quadratic Programming (SQP)

SQP is one of the most successful methods for non-linear constrained optimization. The core idea is to solve a sequence of simpler, quadratic programming (QP) subproblems that approximate the original problem.

At each iteration $k$, the algorithm:
1.  **Approximates the objective function** with a quadratic function around the current iterate $x_k$. This is done using a Taylor expansion of the Lagrangian function.
2.  **Linearizes the constraints** around the current iterate $x_k$.
3.  **Solves the resulting QP subproblem.** This QP is easier to solve than the original non-linear problem. The solution to the QP gives a search direction, $p_k$.
4.  **Performs a line search** to find a step size $\alpha_k$ that makes sufficient progress.
5.  **Updates the iterate:** $x_{k+1} = x_k + \alpha_k p_k$.

This process is repeated until the KKT conditions are satisfied to a desired tolerance. SQP is particularly effective for problems with expensive function evaluations, as it can make rapid progress.

##### Implementation with `scipy.optimize`

In [None]:
import numpy as np
from scipy.optimize import minimize

# --- Problem Setup ---
alpha = 0.5
px, py = 2, 5
I = 100

# Objective function (to be minimized)
fun = lambda x: -( (x[0]**alpha) * (x[1]**(1-alpha)) )

# Constraints
constraints = ({'type': 'ineq', 'fun': lambda x: I - px*x[0] - py*x[1]})
bounds = ((0, None), (0, None)) # x and y must be non-negative

# Initial guess
x0 = (10, 10)

# --- Solve with SQP ---
result_sqp = minimize(fun, x0, method='SLSQP', bounds=bounds, constraints=constraints)

if result_sqp.success:
    print("SQP Solution:")
    print(f"  Optimal x: {result_sqp.x[0]:.2f}")
    print(f"  Optimal y: {result_sqp.x[1]:.2f}")
    print(f"  Max Utility: {-result_sqp.fun:.2f}")
else:
    print(f"SQP failed: {result_sqp.message}")

#### 3.4 Algorithm 2: Interior-Point Methods

Interior-Point methods take a different approach. They were originally developed for linear programming but have been extended to non-linear problems. The key idea is to handle inequality constraints by adding them to the objective function as a **barrier term**.

For example, a constraint $g(x) \le 0$ can be replaced by adding a term like $-\mu \log(-g(x))$ to the objective function. This barrier term is small when $x$ is far from the boundary ($g(x) \ll 0$) but shoots to infinity as $x$ approaches the boundary ($g(x) \to 0$).

The algorithm then solves a sequence of unconstrained (or equality-constrained) problems with a decreasing barrier parameter $\mu$. As $\mu \to 0$, the solution to the subproblem converges to the solution of the original constrained problem.

These methods are called "interior-point" because they always maintain iterates that are strictly feasible with respect to the inequality constraints.

##### Implementation with `scipy.optimize`

In [None]:
# --- Solve with Interior-Point (trust-constr) ---
result_ip = minimize(fun, x0, method='trust-constr', bounds=bounds, constraints=constraints)

if result_ip.success:
    print("Interior-Point Solution:")
    print(f"  Optimal x: {result_ip.x[0]:.2f}")
    print(f"  Optimal y: {result_ip.x[1]:.2f}")
    print(f"  Max Utility: {-result_ip.fun:.2f}")
else:
    print(f"Interior-Point failed: {result_ip.message}")

#### 3.5 Comparison: SQP vs. Interior-Point

Both methods found the correct analytical solution, which is $x = \frac{\alpha I}{p_x} = 25$ and $y = \frac{(1-\alpha)I}{p_y} = 10$.

So when should you choose one over the other?

| Feature | Sequential Quadratic Programming (SQP) | Interior-Point Methods |
|---|---|---|
| **Best For** | Problems with expensive function/gradient evaluations. | Large-scale problems (many variables and constraints). |
| **Feasibility** | Iterates are not necessarily feasible. | Iterates are always feasible w.r.t. inequality constraints. |
| **Speed** | Can be very fast if good QP solver is available. | Often require more, but cheaper, iterations. |
| **Robustness** | Can be sensitive to the initial guess. | Generally very robust. |

For most problems encountered in economics, both are excellent choices. `scipy`'s `SLSQP` is often a great first choice due to its speed and versatility.

### 4. Application: Mean-Variance Portfolio Optimization
A cornerstone of modern finance is portfolio optimization. Given a set of assets with expected returns $\boldsymbol{\mu}$ and a covariance matrix $\boldsymbol{\Sigma}$, the classic Markowitz problem is to find the portfolio weights $\mathbf{w}$ that minimize risk (variance, $\mathbf{w}^T \boldsymbol{\Sigma} \mathbf{w}$) for a given target level of expected return ($"mu"^T \mathbf{w} \ge R_0$). By solving this for a range of target returns, we can trace out the **efficient frontier**.

![Mean-Variance Efficient Frontier](../images/02-Numerical-Methods/efficient_frontier.png)

### 5. Global Optimization
A critical weakness of the methods discussed so far is that they are **local optimizers**. They are only guaranteed to find a **local minimum**. If a function has multiple minima, the result will depend entirely on the starting point. Global optimization is a much harder problem.

#### 5.1 Basinhopping
`basinhopping` combines a local optimizer (like BFGS) with a random perturbation step, allowing it to "hop" between different basins of attraction in the function landscape.

#### 5.2 Differential Evolution
**Differential Evolution** is a powerful, population-based global optimization algorithm. It maintains a population of candidate solutions and iteratively creates new candidates by combining existing ones. It is often very effective for difficult, non-convex, and non-differentiable problems.

In [None]:
sec("Global Optimization with Differential Evolution")
# The Eggholder function is a very difficult test case with many local minima
def eggholder(x):
    return (-(x[1] + 47) * np.sin(np.sqrt(abs(x[0]/2 + (x[1]  + 47))))
            -x[0] * np.sin(np.sqrt(abs(x[0] - (x[1]  + 47))))) 

bounds = [(-512, 512), (-512, 512)]
result = differential_evolution(eggholder, bounds)
note(f"Global minimum found by Differential Evolution at x={result.x}, f(x)={result.fun:.4f}")

### 6. Chapter Summary
- **Know Your Algorithm:** Optimization involves a trade-off between speed and robustness. The choice of algorithm (gradient-based, Newton, derivative-free) depends on the problem structure.
- **Convexity is a Superpower:** Convex problems are special because any local minimum is global. For such problems, specialized tools like `CVXPY` are more robust and reliable than general non-linear solvers.
- **Constraints are Key:** The KKT conditions provide the theoretical foundation for constrained optimization, with a rich geometric interpretation.
- **Local vs. Global:** Most standard optimizers are *local*. For functions with multiple minima, global optimization techniques like `basinhopping` or `differential_evolution` are necessary to avoid getting stuck in a suboptimal solution.
- **Use the Right Tool:** Modern scientific computing relies on a combination of tools: `scipy.optimize` for general-purpose optimization, `JAX` for providing fast and accurate derivatives, and `CVXPY` for disciplined convex programming.

### 7. Exercises

1.  **CVXPY for Cost Minimization:** A firm has a production function $Q = K^\alpha L^{1-\alpha}$. It wants to produce a target quantity $Q_0=100$ at minimum cost $C = rK + wL$. This is a convex optimization problem. Formulate and solve it using `CVXPY` for $\alpha=0.5, r=1, w=2$. The dual variable (Lagrange multiplier) on the production constraint can be interpreted as the marginal cost. What is the marginal cost at the optimum?

2.  **Portfolio with Short-Selling:** Modify the mean-variance portfolio optimization problem to allow for short-selling (i.e., remove the `bounds` that constrain weights to be positive). Re-calculate and plot the efficient frontier. How does it compare to the one with the no-short-selling constraint? Provide an intuition for the difference.

3.  **Structural Estimation:** A researcher observes an agent making a consumption choice `c` when faced with income `M` and a wage `w`. The agent's utility is $U(c, l) = \log(c) + \theta \log(l)$, where leisure is $l=1-h$ and hours worked are $h$. The budget constraint is $c = wh$. The researcher believes the agent is maximizing utility. 
    - **Task:** Write a function `solve_agent(theta, w)` that solves the agent's problem to find optimal consumption `c_star`. Then, write a function `objective(theta, observed_c, w)` that calculates the squared error `(c_star - observed_c)**2`. If the observed choice was `c=1.5` when `w=2`, use `scipy.optimize.minimize` to find the value of `theta` that makes your model's prediction match the observed data.

4.  **Diet Problem (Linear Programming):** A student is trying to create the cheapest possible lunch that meets minimum nutritional requirements. The available foods are 'Pizza' and 'Salad'.
    - Pizza: 500 calories, 10g protein, costs $3.
    - Salad: 150 calories, 5g protein, costs $2.
    The student needs at least 800 calories and 20g of protein. Formulate this as a linear programming problem and use `scipy.optimize.linprog` to find the optimal number of pizza slices and salad servings.

5.  **Global Optimization:** The `six_hump_camel` function is a standard test function for global optimization. Find its documentation in `scipy.optimize` (or an equivalent library). Use `differential_evolution` to find its global minimum. How does the result compare to what you get from a standard local optimizer like `BFGS` starting from `x0 = (1, 1)`?