# Notebook 1: Dynamic Programming and Stochastic Control

**Author:** Divyansh Atri

## Overview

This notebook introduces the fundamental concepts of stochastic optimal control:
- Controlled stochastic differential equations (SDEs)
- Cost functionals and value functions
- The dynamic programming principle
- Optimal feedback control policies

These concepts form the foundation for deriving and solving the Hamilton-Jacobi-Bellman (HJB) equation in subsequent notebooks.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.integrate import odeint
import sys
sys.path.append('..')

# Set plotting style
plt.style.use('seaborn-v0_8-darkgrid')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 11

print("Libraries imported successfully")

## 1. Stochastic Differential Equations

### 1.1 Definition

A **controlled stochastic differential equation (SDE)** describes the evolution of a state process $X_t \in \mathbb{R}^d$ under the influence of:
1. **Deterministic drift** $b(X_t, u_t)$ - controlled dynamics
2. **Stochastic diffusion** $\sigma(X_t)$ - random fluctuations

Mathematically:

$$
dX_t = b(X_t, u_t) \, dt + \sigma(X_t) \, dW_t, \quad X_0 = x_0
$$

where:
- $X_t$ is the state at time $t$
- $u_t$ is the control input at time $t$
- $W_t$ is a standard Brownian motion (Wiener process)
- $b: \mathbb{R}^d \times \mathbb{R}^m \to \mathbb{R}^d$ is the drift coefficient
- $\sigma: \mathbb{R}^d \to \mathbb{R}^{d \times d}$ is the diffusion coefficient

### 1.2 Interpretation

The SDE can be understood as:
- **Drift term** $b(X_t, u_t) dt$: Deterministic evolution we can influence via control $u_t$
- **Diffusion term** $\sigma(X_t) dW_t$: Random perturbations we cannot control

The control $u_t$ allows us to steer the system, but we cannot eliminate the randomness entirely.

### 1.3 Example: Brownian Motion with Drift

Consider the simplest case: 1D Brownian motion with constant drift and diffusion:

$$
dX_t = \mu \, dt + \sigma \, dW_t
$$

This has **no control** yet. Let's simulate it to build intuition.

In [None]:
def simulate_brownian_motion(mu, sigma, T, dt, n_paths=5, seed=42):
    """
    Simulate Brownian motion with drift using Euler-Maruyama
    
    dX_t = μ dt + σ dW_t
    """
    np.random.seed(seed)
    
    t = np.arange(0, T + dt, dt)
    n_steps = len(t)
    
    X = np.zeros((n_paths, n_steps))
    
    for i in range(n_paths):
        dW = np.sqrt(dt) * np.random.randn(n_steps - 1)
        for j in range(n_steps - 1):
            X[i, j+1] = X[i, j] + mu * dt + sigma * dW[j]
    
    return t, X

# Simulate
mu, sigma = 0.1, 0.5
T, dt = 5.0, 0.01
t, X = simulate_brownian_motion(mu, sigma, T, dt, n_paths=10)

# Plot
plt.figure(figsize=(12, 5))
for i in range(X.shape[0]):
    plt.plot(t, X[i, :], alpha=0.6, linewidth=1.5)
plt.axhline(0, color='black', linestyle='--', linewidth=1, alpha=0.5)
plt.xlabel('Time $t$', fontsize=12)
plt.ylabel('State $X_t$', fontsize=12)
plt.title(f'Brownian Motion with Drift: $dX_t = {mu} dt + {sigma} dW_t$', fontsize=14)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"Simulated {X.shape[0]} paths over time [0, {T}]")
print(f"Mean at T={T}: {np.mean(X[:, -1]):.3f} (theoretical: {mu*T:.3f})")
print(f"Std at T={T}: {np.std(X[:, -1]):.3f} (theoretical: {sigma*np.sqrt(T):.3f})")

**Observation:** The paths exhibit random fluctuations around the deterministic trend $\mu t$. The spread increases with time due to accumulation of randomness.

## 2. Controlled SDEs and Admissible Controls

### 2.1 Control Processes

A **control process** $u = (u_t)_{t \geq 0}$ is a stochastic process that influences the drift of the SDE.

**Admissibility:** A control $u$ is admissible if:
1. **Adapted:** $u_t$ depends only on information available up to time $t$ (no future peeking)
2. **Integrability:** $\mathbb{E}\left[\int_0^T |u_t|^2 dt\right] < \infty$

We denote the set of admissible controls by $\mathcal{U}_{ad}$.

### 2.2 Feedback Controls

A particularly important class is **Markov feedback controls**:

$$
u_t = \alpha(t, X_t)
$$

where $\alpha: [0,T] \times \mathbb{R}^d \to \mathbb{R}^m$ is a deterministic function.

**Why feedback?** 
- The control depends only on current state $X_t$, not the entire history
- Easier to implement in practice
- Optimal controls often have this form (verification theorem)

### 2.3 Example: Controlled Brownian Motion

Consider:

$$
dX_t = u_t \, dt + \sigma \, dW_t
$$

The drift is **directly controlled**. Let's compare different control strategies.

In [None]:
def simulate_controlled_bm(u_func, sigma, T, dt, x0=0, seed=42):
    """
    Simulate controlled Brownian motion: dX_t = u_t dt + σ dW_t
    
    Args:
        u_func: Control function u(t, x)
    """
    np.random.seed(seed)
    
    t = np.arange(0, T + dt, dt)
    n_steps = len(t)
    
    X = np.zeros(n_steps)
    u = np.zeros(n_steps)
    X[0] = x0
    
    dW = np.sqrt(dt) * np.random.randn(n_steps - 1)
    
    for i in range(n_steps - 1):
        u[i] = u_func(t[i], X[i])
        X[i+1] = X[i] + u[i] * dt + sigma * dW[i]
    
    u[-1] = u_func(t[-1], X[-1])
    
    return t, X, u

# Define different control strategies
sigma = 0.5
T, dt = 5.0, 0.01

# Strategy 1: Zero control (do nothing)
u_zero = lambda t, x: 0.0

# Strategy 2: Constant positive drift
u_const = lambda t, x: 0.5

# Strategy 3: Proportional feedback (stabilizing)
u_feedback = lambda t, x: -0.5 * x

# Simulate
t, X_zero, u_z = simulate_controlled_bm(u_zero, sigma, T, dt)
t, X_const, u_c = simulate_controlled_bm(u_const, sigma, T, dt)
t, X_feedback, u_f = simulate_controlled_bm(u_feedback, sigma, T, dt)

# Plot trajectories
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].plot(t, X_zero, label='Zero control', linewidth=2)
axes[0].plot(t, X_const, label='Constant drift', linewidth=2)
axes[0].plot(t, X_feedback, label='Feedback control', linewidth=2)
axes[0].axhline(0, color='black', linestyle='--', linewidth=1, alpha=0.5)
axes[0].set_xlabel('Time $t$', fontsize=12)
axes[0].set_ylabel('State $X_t$', fontsize=12)
axes[0].set_title('State Trajectories', fontsize=13)
axes[0].legend(fontsize=11)
axes[0].grid(True, alpha=0.3)

axes[1].plot(t, u_z, label='Zero control', linewidth=2)
axes[1].plot(t, u_c, label='Constant drift', linewidth=2)
axes[1].plot(t, u_f, label='Feedback control', linewidth=2)
axes[1].axhline(0, color='black', linestyle='--', linewidth=1, alpha=0.5)
axes[1].set_xlabel('Time $t$', fontsize=12)
axes[1].set_ylabel('Control $u_t$', fontsize=12)
axes[1].set_title('Control Signals', fontsize=13)
axes[1].legend(fontsize=11)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Control Strategies Comparison:")
print(f"Zero control: Final state = {X_zero[-1]:.3f}")
print(f"Constant drift: Final state = {X_const[-1]:.3f}")
print(f"Feedback control: Final state = {X_feedback[-1]:.3f}")

**Observation:** 
- Zero control: Pure random walk
- Constant drift: Systematic upward trend
- Feedback control: Stabilizes around origin (mean-reverting behavior)

The feedback control $u_t = -0.5 X_t$ creates a restoring force toward zero.

## 3. Cost Functionals

### 3.1 Definition

To formulate an optimal control problem, we need a **cost functional** that quantifies performance.

The standard form is:

$$
J(u) = \mathbb{E}\left[ \int_0^T L(X_t, u_t) \, dt + g(X_T) \right]
$$

where:
- $L(x, u)$ is the **running cost** (instantaneous cost at each time)
- $g(x)$ is the **terminal cost** (cost at final time $T$)
- The expectation is over all randomness (Brownian motion)

### 3.2 Interpretation

- **Running cost** $L(x, u)$: Penalizes undesirable states and expensive controls
  - Example: $L(x, u) = \frac{1}{2}(q x^2 + r u^2)$ penalizes large states and large controls
  
- **Terminal cost** $g(x)$: Penalizes ending far from target
  - Example: $g(x) = \frac{1}{2} q_T (x - x_{target})^2$

### 3.3 Quadratic Cost (LQR)

The most important example is the **Linear-Quadratic Regulator (LQR)**:

$$
L(x, u) = \frac{1}{2}(q x^2 + r u^2), \quad g(x) = \frac{1}{2} q_T x^2
$$

where $q, r, q_T > 0$ are weights.

**Interpretation:**
- $q$: Penalty on state deviation (want $x$ small)
- $r$: Penalty on control effort (want $u$ small)
- $q_T$: Terminal penalty (want $x_T$ small)

**Tradeoff:** Large $r$ means expensive control → accept larger state deviations

In [None]:
def compute_cost(t, X, u, q, r, q_T, dt):
    """
    Compute quadratic cost: J = ∫(1/2)(qx² + ru²)dt + (1/2)q_T x_T²
    """
    # Running cost (trapezoidal rule)
    running_cost = 0.5 * (q * X**2 + r * u**2)
    integral = np.trapz(running_cost, dx=dt)
    
    # Terminal cost
    terminal_cost = 0.5 * q_T * X[-1]**2
    
    return integral + terminal_cost

# Compute costs for different strategies
q, r, q_T = 1.0, 1.0, 10.0

cost_zero = compute_cost(t, X_zero, u_z, q, r, q_T, dt)
cost_const = compute_cost(t, X_const, u_c, q, r, q_T, dt)
cost_feedback = compute_cost(t, X_feedback, u_f, q, r, q_T, dt)

print("Cost Comparison (q=1, r=1, q_T=10):")
print(f"Zero control:     J = {cost_zero:.4f}")
print(f"Constant drift:   J = {cost_const:.4f}")
print(f"Feedback control: J = {cost_feedback:.4f}")
print(f"\nBest strategy: {'Feedback' if cost_feedback < min(cost_zero, cost_const) else 'Other'}")

**Observation:** The feedback control achieves lower cost by balancing state deviation and control effort.

## 4. The Optimal Control Problem

### 4.1 Problem Statement

Given:
- Controlled SDE: $dX_t = b(X_t, u_t) dt + \sigma(X_t) dW_t$
- Cost functional: $J(u) = \mathbb{E}\left[\int_0^T L(X_t, u_t) dt + g(X_T)\right]$

**Find:** The optimal control $u^* \in \mathcal{U}_{ad}$ that minimizes the cost:

$$
u^* = \arg\min_{u \in \mathcal{U}_{ad}} J(u)
$$

### 4.2 Value Function

The **value function** is the minimum achievable cost starting from state $x$ at time $t$:

$$
V(t, x) = \inf_{u \in \mathcal{U}_{ad}} \mathbb{E}_{t,x}\left[ \int_t^T L(X_s, u_s) ds + g(X_T) \right]
$$

where $\mathbb{E}_{t,x}$ denotes expectation conditioned on $X_t = x$.

**Key Properties:**
1. $V(T, x) = g(x)$ (terminal condition)
2. $V(0, x_0)$ is the optimal cost starting from $x_0$
3. The optimal control can be extracted from $V$ (feedback form)

### 4.3 Intuition

Think of $V(t, x)$ as:
- **"Cost-to-go"**: How much it will cost (on average) to run optimally from $(t, x)$ to the end
- **Depends on both time and state**: Earlier times and worse states typically have higher cost-to-go
- **Decreases with time**: As we approach terminal time $T$, only terminal cost $g(X_T)$ remains

## 5. The Dynamic Programming Principle

### 5.1 Statement

The **Dynamic Programming Principle (DPP)** is the cornerstone of stochastic control theory.

**Informal Statement:** An optimal policy has the property that, no matter what the initial state and decision, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.

**Formal Statement:** For any $0 \leq t \leq s \leq T$:

$$
V(t, x) = \inf_{u \in \mathcal{U}_{ad}} \mathbb{E}_{t,x}\left[ \int_t^s L(X_r, u_r) dr + V(s, X_s) \right]
$$

### 5.2 Interpretation

The DPP says:
1. **Break the problem into pieces**: Cost from $t$ to $T$ = Cost from $t$ to $s$ + Cost from $s$ to $T$
2. **Optimize both pieces**: The control must be optimal on both $[t, s]$ and $[s, T]$
3. **Value function appears recursively**: $V(s, X_s)$ is the cost-to-go from intermediate state

This is analogous to Bellman's principle in discrete-time dynamic programming.

### 5.3 Infinitesimal Version

Taking $s = t + h$ for small $h > 0$:

$$
V(t, x) = \inf_{u} \mathbb{E}_{t,x}\left[ \int_t^{t+h} L(X_r, u_r) dr + V(t+h, X_{t+h}) \right]
$$

Expanding and taking $h \to 0$ leads to the **Hamilton-Jacobi-Bellman equation** (next notebook).

### 5.4 Why DPP Matters

1. **Reduces infinite-dimensional problem**: Instead of optimizing over all control functions $u: [0,T] \to \mathbb{R}$, we solve a PDE for $V(t,x)$
2. **Enables feedback synthesis**: Optimal control is $u^*(t,x) = \arg\min_u \{\text{Hamiltonian}\}$
3. **Provides verification**: If we guess a solution $V$, we can verify optimality via DPP

## 6. Summary and Next Steps

### Key Concepts Introduced

1. **Controlled SDEs**: $dX_t = b(X_t, u_t) dt + \sigma(X_t) dW_t$
   - Drift $b$ is influenced by control $u_t$
   - Diffusion $\sigma$ represents uncontrollable randomness

2. **Cost Functional**: $J(u) = \mathbb{E}\left[\int_0^T L(X_t, u_t) dt + g(X_T)\right]$
   - Running cost $L$ penalizes states and controls
   - Terminal cost $g$ penalizes final state

3. **Value Function**: $V(t,x) = \inf_u \mathbb{E}_{t,x}[J(u)]$
   - Minimum achievable cost from $(t,x)$
   - Central object of study

4. **Dynamic Programming Principle**:
   - Recursive structure of optimal control
   - Foundation for deriving HJB equation

### What We've Learned

- Different control strategies lead to different costs
- Feedback controls can stabilize stochastic systems
- The value function encodes optimal cost-to-go
- DPP provides a recursive characterization

### Next Notebook

In **Notebook 2**, we will:
1. Derive the Hamilton-Jacobi-Bellman equation from the DPP
2. Understand it as a nonlinear PDE
3. Interpret the Hamiltonian and optimality conditions
4. Discuss viscosity solutions

This will transform the infinite-dimensional optimization problem into a PDE that we can solve numerically.

## References

1. Fleming, W. H., & Soner, H. M. (2006). *Controlled Markov Processes and Viscosity Solutions*. Springer.
2. Yong, J., & Zhou, X. Y. (1999). *Stochastic Controls: Hamiltonian Systems and HJB Equations*. Springer.
3. Pham, H. (2009). *Continuous-time Stochastic Control and Optimization with Financial Applications*. Springer.