#  Mathematics of Reinforcement Learning: Homework Sheet 4

## Exercise 1
The theoretical exercise 1 can be found in the PDF file. 

## Exercise 2
In this exercise we use the Bellman optimality equation to compute the optimal policy in the optimal investment problem.

Make sure you have all the necessary packages (numpy, matplotlib) installed. You can use `conda install <package>` or `pip install <package>`.

In [1]:
import matplotlib.pyplot as plt
import numpy as np

In [2]:
#We define the start state, which will be used in some test functions.
start_state = (0., 8., 100.)

#market parameters
T = 3     #Time horizon
u = 1.0   #CRR parameter, up factor change, slightly different from the other exercises!
d = -.5   #CRR parameter, down factor change
q = .5    #CRR parameter, probability of up change

**Task 1:** Implement a function `get_admissible_actions(p, w)` that takes as input float values `p` and `w` representing the current asset price `p` and wealth `w` and returns a list of all possible action values in the form of float values.

*Reminder:* It is only possible to buy integer amounts of the financial asset. The amount must be non-negative and can't exceed the current wealth.

In [3]:
def get_admissible_actions(p, w):
    
    max_num_assets = int(np.floor(w/p))
    
    return [p * i for i in range(max_num_assets + 1)]

assert get_admissible_actions(10,20) == [0,10,20]

**Task 2:** Implement a function `get_all_states()` that returns a list of all possible states in the Optimal Investment Markov Decision Model up to the timepoint `T` in the form of tuples. The structure of the tuple should be `(timestep, asset_price, wealth)`.

In [4]:
def get_all_states():
    states = [start_state]
    
    for state in states:
        if state[0] < T:
            for a in get_admissible_actions(state[1],state[2]):
                for R in [d,u]:
                    
                    new_state = ( state[0] + 1,
                                  state[1] * (1+R),
                                  state[2] + R * a
                                )
                    states.append(new_state)
    return states

states = get_all_states()
print(len(states))
# remove duplicate states
states = list(set(states))
print(f'Number of calculated states: {len(states)}. This number should be equal to 482')
assert len(states) == 482

30545
Number of calculated states: 482. This number should be equal to 482


**Task 3:** Implement a function `bellman_optimality_equation(states)` that takes as input the list of all possible states and returns a dictionary, where the keys are all possible states (in form of tuples), and the value for each key is a dictionary containing the optimal action and the value of that state (with string keys `"opt_a"` and `"V"`) computed with the Bellman optimality equation given in Exercise 1.

*Hints:*
1. Start by computing the value function of all terminal wealths (the optimal action can be set to `None`).
2. Working backwards in time, starting at time `t = T-1`, iterate over all states with timepoint `t`, then iterate over all admissible actions and compute the value following that action. The optimal action is the action with the highest expected value in the next state. 

In [5]:
def bellman_optimality_equation(states):
    
    bellman_dict = {}
    
    for state in states:
        if state[0] == T:
            bellman_dict[state] = {"opt_a" : None, "V" : np.log(state[2])}
    
    for t in range(T-1,-1,-1):
        for state in states:
            if state[0] == t:
                opt_a = None
                V = -np.inf
                
                for a in get_admissible_actions(state[1], state[2]):
                    next_state_1 = (state[0] + 1,
                                  state[1] * (1 + u),
                                  state[2] + u * a
                                )
                    next_state_2 = ( state[0] + 1,
                                  state[1] * (1 + d),
                                  state[2] + d * a
                                )
                    value = q * bellman_dict[next_state_1]["V"] + \
                            (1-q) * bellman_dict[next_state_2]["V"]
                        
                    if value > V:
                        V = value
                        opt_a = a
                

                V = max(value_t_a)
                opt_a = value_t_a.index(V)
                bellman_dict[state] = {"opt_a" : opt_a, "V" : V}
    return bellman_dict
    
bellman_dict = bellman_optimality_equation(states)

#Test function
assert bellman_dict[start_state]['opt_a'] == 48.0
assert np.isclose(bellman_dict[start_state]['V'], 4.781250991199082)

NameError: name 'value_t_a' is not defined

**Task 4:** Visualise the optimal strategy. To do this, create a Matplotlib scatter plot with the current wealth on the x-axis and the amount of  money invested in the financial asset on the y-axis. First, plot all possible wealth action pairs. Next, highlight the optimal wealth action pairs. 

Try to reproduce Figure 1.5 from the script. The formula for the black line in Figure 1.5 is given by
    $$\mathrm{black line}(w) = \left(-\frac{q}{d}-\frac{1-q}{u}\right) \cdot w.$$

In [None]:
plot_states = []
plot_states_opt = []
min_wealth = + np.inf
max_wealth = - np.inf
for state in states:
    for action in get_admissible_actions(p = state[1], w = state[2]):
        if state[0] < T:
            plot_states.append((state[2],action))
            if action == bellman_dict[state]["opt_a"]:
                plot_states_opt.append((state[2],action))
            if min_wealth > state[2]:
                min_wealth = state[2]
            if max_wealth < state[2]:
                max_wealth = state[2]

# plot the results
fig, ax = plt.subplots()
ax.scatter(*zip(*plot_states), c='#0065bd')
ax.scatter(*zip(*plot_states_opt), c='#F7811E')
opt_frac = -q/d-(1-q)/u
ax.plot([min_wealth, max_wealth], [opt_frac*min_wealth, opt_frac*max_wealth], c='black')
plt.xlabel("Total Wealth")
plt.ylabel("Wealth in Financial Asset")
plt.savefig("opt_policy")
plt.show()
