### Merton's Portfolio problem as an MDP
##### Continuous time
- The State $(t,W_t), t \in [0,T]$
- The Action $[\pi_t , c_t], c_t \in [0,W_t]$
- The Reward per unit time $U(c_t) = \frac{c_t^{1-\gamma}}{1-\gamma}$ for $t <T$, and $\frac{B(T) W_T^{1-\gamma}}{1-\gamma}$
- Discount factor: $\rho$
- The Return: accumulated discounted Reward
$$V(t,W_t) = \mathbb{E}[\int_{t}^T \frac{e^{-\rho(s-t)}c_s^{1-\gamma}}{1-\gamma} ds +\frac{e^{-\rho (T-t)} B(T) W_T^{1-\gamma}}{1-\gamma}|W_t]$$
- Obj: Find Policy $(t,W_t) \rightarrow [\pi_t , c_t]$ that maximizes the Expected Return
$$V^{\ast}(t,W_t) = \max_{\pi_t, c_t}\mathbb{E}[\int_{t}^T \frac{e^{-\rho s}  c_s^{1-\gamma}}{1-\gamma} ds +\frac{e^{-\rho T} B(T) W_T^{1-\gamma}}{1-\gamma}|W_t]$$
Bellman Equation:
$$V^{\ast}(t,W_t) = \max_{\pi_t, c_t}\mathbb{E}[\int_{t}^{t_1} \frac{e^{-\rho s} c_s^{1-\gamma}}{1-\gamma} ds +V^{\ast}(t_1,W_{t_1})]$$
i.e.,
$$0 = \max_{\pi_t, c_t}\mathbb{E}[dV^{\ast}(t,W_t) + \frac{e^{-\rho t} c_t^{1-\gamma}}{1-\gamma}]$$
i.e.,
$$\max_{\pi_t, c_t}\Phi(t,W_t; \pi_t, c_t) = 0$$
which leads to (two partial derivatives + function = 0):
$$\pi_t^{\ast} = \frac{\mu-r}{\sigma^2 \gamma}$$
$$c_t^{\ast} = \frac{\nu W_t}{1+ (\nu \epsilon -1) e^{-\nu(T-t)}}, \nu \neq 0$$
$$c_t^{\ast} = \frac{W_t}{T-t+\epsilon}, \nu = 0$$
$$\nu = \frac{\rho - (1-\gamma)(\frac{(\mu-r)^2}{2\sigma^2 \gamma}) + r}{\gamma}$$

##### Discrete time finite state space
- The State $(t,\bar{W}_t)$
    - Discrete time, $t =[0,1,2,...,T]$
    - Discrete wealth, $\bar{W}_t = round(W_t, 0.001) | W_0 = 1$, where $W_t$ is the real wealth
$$\bar{W}_t = [0,0.001,0.002,...,10]$$
- The Action $[\pi_t , c_t], \pi_t \in [0,0.1,0.2,...1], c_t \in [0,0.001,0.002,...W_t]$ (Note here I put constraint on $\pi_t$)
$$\bar{W}_{t+1} = \bar{W}_t + (1-\pi_t)R^{rf} + \pi_t R^{s}_t - c_t; W_0 = 1$$
where $R^{rf}$ $R^{s}_t$ are discrete numbers $n*0.01$, $n \in \mathbb{N}$
- The Reward per unit time $U(c_t) = \frac{c_t^{1-\gamma}}{1-\gamma}$ for $t <T$, and $\frac{B(T) W_T^{1-\gamma}}{1-\gamma}$
- Discount factor: $\rho$
- The Return: accumulated discounted Reward
$$V(t,W_t) = \mathbb{E}[\sum_{s =0}^{T-1}\frac{c_s^{1-\gamma}}{(1-\gamma)(1+\rho)^s} ds +\frac{B(T) W_T^{1-\gamma}}{(1-\gamma)(1+\rho)^T}|W_t]$$
- Obj: Find Policy $(t,W_t) \rightarrow [\pi_t , c_t]$ that maximizes the Expected Return

- Try recovering the closed-form solution with a DP algorithm that you implemented previously<br>
<font color='blue'> **I save this question until implementing RL. I feel it make more sense to implement the problem on infinite, discrete state space. (vs. finite discrete). Simple discretization of the continuous probelm is computationally too expensive.** </font>

### Implement this MDP model in code 
##### Discrete time infinite state space
Ref: https://github.com/ranvirranaiitb/cip_mdp