# Week 2 - Notes



These are some concise notes representing the most important ideas discussed.

--------

## Part 1: MDP for Discrete-Time BS Model

### MDP Formulation

- We aim to reformulate the discrete-time BSM as MDP model, where the *system* being controlled is a hedge portfolio, and *control* is a stock position.

- We solve the problem by a **sequential maximization of rewards** (negatives of hedge portfolio one-step variance multiplied by the risk aversion $\lambda$, plus a drift term).

- *In principle*, we can consider either discrete state or continuous state problems. Continuous state formulation is practically irrelevant.


- **State Variable**: instead of using stock prices $S_t$, we shall used $X_t$ which is varied function of $S_t$. $dX_t$ shall be a standard Brownian Motion scaled by volatility $\sigma$.

- Actual hedging decision $a_t(x_t)$ are determined by a time-dependent policy $\pi(t, X_t)$. We consider such policy determinstic.

- Notion of determinstic policy $\pi$ is mapping state $X_t$ and time $T$ into action $A_t$

- **Value function** $V_t$ is defined as the negative of option price. More on the equations in the corresponding notebook.


- **Bellman Equation** for such model is 
$$V_t^\pi(X_t) =  \mathbb{E}_t^\pi [R(X_t, a_t, X_{t+1}) + \gamma V_{t+1}^\pi(X_{t+1})] $$

- Here $R(X_t, a_t, X_{t+1})$ is a one-step time-dependent random reward. 
- The expected reward in the present MDP is quadratic in action $a_t$.

**N.B.** when $\lambda \to 0$, the expected reward becomes linear in $a_t$, so it does NOT have maximum i.e. there is no risk aversion.
In this framework, quadratic risk is incorporated in a standard (risk-neutral) MDP formulation.



### Action-Value Function

- As usual the optimal policy $\pi_t^*$ is determined as policy that maximizes the value function $V_t^\pi$.
- Needless to say, the optimal value fn also satisfies the Bellman Optimality Equation.

- If system dynamics are known, we can solved them using Dynamic Programming.
- If system dynamics are unknown, optimal policy should be computed using *samples*.


- The **Action-Value Fn** is defined by an expectation of the same expression as in the definition of the value fn, but conditioned on both the current state $X_t$ and the initial actions $a=a_t$, while following policy $\pi$ afterwards.

- **Bellman Eq** for Q-fn: $$Q_t^\pi(X,a) =  \mathbb{E}_t[R(X_t, a_t, X_{t+1}|x,a] + \gamma \mathbb{E}_t\pi [V_{t+1}^\pi(X_{t+1})|x] $$


- **Optimal Value fn** and **Optimal Q-fn** are related as $V_t^*(x)=max_a(Q_t^*(x,a))$

- A *greedy* policy always maximize the current-time Q-fn.


### Optimal Action from Q-Fn


- The BS results can be recovered from the DP formulation in the limit $\lambda \to 0 $ and $\Delta t \to 0 $.

- In the **DP** model, hedging comes ahead of pricing. In the **BS** model, it is the other way around.


- The quadratic hedging of the **discrete-time BSM** model only looks at risk of a hedge portfolio. However here  the expected reward has both a drift and variance parts, similar to the Markowitz risk-adjusted portfolio return analysis.

- *Hence,* The optimal hedge in an **MDP** model differs from the optimal hedge in the **risk minimization** approach. The former's objective fn is based on a risk-adjusted return, while for the later it is based purely on risk.

- For a pure risk-focused quadratic hedge, we can set $\mu = r$ or $\lambda \to \infty$ in the $a_t^*(X_t)$ eq (48). In general, such eq gives hedges that can be applied for both hedgin and investment with options.

### Backward Recursion for Q-Star

- Since we have the analytical formula for optimal action $a_t^*$, we can do the backward recursion from $T-1$ to $0$ directly.

- **N.B.** as the backward recursion is applied directly to the optimal Q-fn $Q_t(S_t, a_t^*)$, neither continuous nor discrete action space representation is required in our setting, as the action in this equation is always just one optimal action.


- The *ask* price becomes a negative of the Q-fn: $$C_t^{ask}(S_t) = - Q_t(S_t, a_t^*)$$


- In the **DP** formulation, both optimal price and hedge are parts of the same value $Q_t^*(X_t,a_t)$. In **BS** model, we have 2 separate formulas for the price and the hedge.

- In **DP**, hedgin comes ahead of pricing, while in **BS** it is the other way around.


- Vanishing optimization:
    - a quadratic objective fn is the **DP** setting with $\lambda>0$, $\Delta t>0$. The link between price and hedge is explicit.
    - In case $\lambda>0$, no quadratic optimization in the DP sense.
    - If both $\lambda=0$ and $\Delta t=0$, then risk is lost i.e. nothing to optimize anymore.


**N.B.** having both $\lambda>0$ and $\Delta t>0$, the DP method gives a consistent hedging and pricing scheme that takes into account residual risk that persists in options under discrete hedging.

-------
## Part 2: Monte Carlo Solution

### Basis Functions

- The **optimal hedge**: $$a_t^*(X_t)= \frac{\mathbb{E}_t [\Delta \hat{S}_t {\Pi}_{t+1} + \frac{1}{2\sigma\lambda}\Delta S_t]}{\mathbb{E}_t [(\Delta \hat{S}_t)^2] }$$

- For the discrete state formulation, there is a finite set of nodes ${\{X_n\}}_{n=1}^M$, with values $Q_n$ of the optimal Q-function at these nodes. $$Q(X) = \sum_{n=1}^M Q_n \delta _x, x_n$$

- We shall replace the kronecker symbol with "one-hot" basisi function $\Phi_n(X)$, where it is equal to 1 whenever $X=X_n$ and 0 elsewhere.


- For our convenience now, we shall be using **B-Splines** as Basis Functions. B-Splines are non-negative, integrate to one, and $B_{i,n}$ is only non-zero on the interval $[x_i, x_{i+n+1}]$

**N.B.** A cubic B-spline is a piece-wise local polynomial of a third degree.

### Optimal Hedge with Monte Carlo

- **Idea:** is using all MC paths simultaneously to learn optimal actions. Learning optimal actions of all states simultaneously means learning a policy, which is basically our objective.

- Our optimal equations will be functions of the basis:
    - Optimal Action (Hedge) $$a_t^*(X_t) = \sum_n^M \phi_{nt} \Phi_n(X_t) $$
    - Optimal Q-Fn: $$Q_t^*(X_t,a_t^*) = \sum_n^M \omega_{nt} \Phi_n(X_t) $$
    
    
- The coefficients $\phi_{nt}$ and $\omega{nt}$ are computed recursively backward in time for $t=T-1, ..., 0$.

- More about the full explanation about how to get coefficient $\phi_t^*$ can be found in the notebook.


- The **DP** solution expands both the optimal action and optimal Q-fn in the same set of basis functions.

**N.B.** The optimal policy in the DP solution is computed by analyzing all MC paths simultaneously.

### Optimal Q-fn with Monte Carlo

- Once the coefficient $\phi_t^*$ of the optimal action $a_t^*$ are found, we shift to focus on finding coefficients $\omega_{nt}$.

- For **Bellman optimality** equation, the optimal action $a_t^*$ can be interpreted as regression of form: $$R_t(X_t,a_t^*,X_{t+1}) + \gamma max_{a_{t+1} \in\mathcal{A}} Q_{t+1}^* (X_{t+1}, a_{t+1})$$
which is equal to $$Q_t^*(X_t, a_t^*) + \mathcal{E}_t$$

- $\mathcal{E}_t$ is a random noise at time t with zero mean.

- **Reward** is then computed from simulated paths: $$R_t = \gamma\Pi_{t+1} - \Pi_{t} - \lambda Var_t[\Pi_{t}]$$


- The coefficients $\omega_{nt}$ can then be found by solving **least-square optimization** problem.



- Another 2 pairs are also introduced matrix $C_t$ and vector $D_t$.
- Full summary about **MC Backward Recursion** in the notebook.


- **Note** 
  - Coefficients of expansion of the Q-fn in basis function are obtained in the DP solution from the Bellman equation interpreted as a regression problem which is solved using Least Square Minimization.
  - The DP solution computes rewards as a part of hedge optmization.