### MS&E 346 Assignment 5
#### January 25

#### Question
Model Merton's Portfolio problem as an MDP (write the model in LaTeX)  
Implement this MDP model in code  
Try recovering the closed-form solution with a DP algorithm that you implemented previously  
#### Answer:

The portfolio can be described as follows: one can invest an arbitrary amount of money to an risky asset and a risk-free asset in continous time ${0 \leq t \leq T}$, whereas each asset has a known mean of returns and variance. The goal of the constructed portfolio is to maximize the life-time aggregated utility of consumption.

The problem then could be seen to find the optimal discounted value function. Balancing constraints implies the following process for wealth ${W_t}$ and at any time ${t}$ determine the optimal position to maximize $${E[\int_{t}^{T}\frac{e^{-\rho(s-t)c_s^{1-\gamma}}}{1-\gamma} ds + \frac{e^{-\rho(T - t)\epsilon^{\gamma}W_T^{1-\gamma}}}{1-\gamma}]}$$ and assume the Bequest function as ${\epsilon^\gamma}$

Think of this as a continuous-time Stochastic Control problem, with the state ${(t, W_t)}$, and the action is ${[\pi_t, c_t]}$, and the reward is ${U(c_t)}$. In the following reasoning, we discretize the problem, and it could be seen as in finite steps to maximize the expected return ${[\pi_t, c_t]}$

In [2]:
# for simplicity, we state and solve the problem for 1 risky asset
# where the random variable dS_t is written in the formula shown above
# denote pi as the fraction of the risky asset allocated
import numpy as np
import math
from typing import NamedTuple, Sequence, Tuple

Think of this as a continous-time stocastic **Control** problem  
* The state is ${(t, W_t)}$
* the action is ${\pi_t, c_t}$
* the Reward per unit time is ${U(c_t)}$
* The Return is the usual accumulated discounted Reward  

And the task is to find the optimal policy to maximize the expected return

In [3]:
class optimal_allocation_consumption:
    """
    This function discreterize the state space and
    limit the policy space to be finite
    in this toy example, we discretize the W/c value to 10 (0 - 9)
    and in terms of policy, we can only invest the multiple of 10% to the risky asset
    """
    def calculate_next_w(self, curt_t, curt_W, action):
        (pi, c) = action
        next_return_risk = np.random.normal(self.mu, self.sigma, size = None)
        next_W = curt_W + ((1 - pi) * self.r + pi * next_return_risk) * curt_W - c * curt_W
        # in order to fit in the discretized state space, we make the next_W also a integer
        next_W = math.ceil(next_W)
        next_t = curt_t + 1
        return (next_t, next_W)
    
    def state_reward_gen(self, state, action, num_samples: int):
        t, W = state
        next_states = [(
            t + 1,
            action[0] * (1. + rr) + (W - action[0]) * (1. + self.r)
        ) for rr in self.risky_returns(num_samples)]
        return [(
            x,
            self.gamma * self.utility(x[1]) if t == self.time_steps - 1 else 0.
        ) for x in next_states]
    
    # passing in the 
    def risky_returns(self, size):
        return np.random.normal(loc=self.mu, scale=self.sigma, size=size)
    
    def __init__(self, T, gamma, mu, r, sigma):
        self.num_of_money_states = 10
        self.num_of_action_states = 11 # (0%, 10%, ..., 100%)
        self.num_of_states = T * self.num_of_money_states
        self.T = T
        self.gamma = gamma
        
        # the mean return/ standard deviation of risky asset
        self.mu = mu
        self.sigma = sigma
        
        # the treasure bond rate (risk-free rate)
        self.r = r
        
        # initialize the transition/ policy matrix
        self.transition = np.zeros((self.num_of_states, self.num_of_states))
        self.policy_matrix = np.zeros(self.num_of_states, self.num_of_action_states)
        self.policy_set = list([0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1])
        