In [1]:
%load_ext autoreload
%autoreload 2
import sys
sys.path.insert(0,'../../modules')

In [2]:
import numpy as np
from maze_problem import Maze
import policy_iteration

# Value Iteration
An alternative to policy iteration is value iteration. <br>
Whereas policy itertion proceeds by first finding the value function and then getting the best policy and vice versa, value iteration works by updating just the value function. It does this by looking at what the optimal policy will be. <br>
The optimal policy will be such that:
$$U^\pi(s)=\max_a\bigg(R(s,a) + \lambda \sum_{s'} T(s'|s,a)U^\pi(s') \bigg)$$
If we have a current estimation of $U^\pi$, we can update it by applying the above iteratively. 
The bellman equation is: <br>
$$U_{m+1}^\pi(s)=\max_a\bigg(R(s,a) + \lambda \sum_{s'} T(s'|s,a)U_{m}^\pi(s') \bigg)$$
As this is what is known as a contraction mapping the equation will tend to a single steady state, which by definition is the optimal policy equation above is true. 
### Aside: Contraction mappings
A contraction mapping is a function $f$ on a metric space which has a distance function $d$. $f$ must be such that the distance between two points is greater than the distance between the function $f$ of those two points, including a constant $k$:
$$d(f(x),f(y))<k d(x,y)$$
Where $k$ must be between 0 and 1. <br>
E.g: <br>
$$f(x)=\frac{x}{2}$$
The distance between two points (in 1d euclidian space) is:
$$ |x-y|$$
The distance between the function points is:
$$ \bigg|\frac{x}{2}-\frac{y}{2}\bigg|$$
Which becomes:
$$ \bigg|\frac{x-y}{2}\bigg|$$
So, of course:
$$ \bigg|\frac{x-y}{2}\bigg| <= k|x-y|$$
If $k$ is greater than 0.5<br>
The smallest possible value of $k$ where the above holds is called the Lipschitz constant of $f$. <br>
**An example of value iteration:**

In [3]:
world = np.array([['W','W','W','W','W','W','W'],
                  ['W','B','B','B','B','B','W'],
                  ['W','S','F','B','B','F','W'],
                  ['W','B','B','B','B','B','W'],
                  ['W','B','B','B','F','B','W'],
                  ['W','B','W','W','W','G','W'],
                  ['W','B','F','B','B','B','W'],
                  ['W','B','B','F','B','B','W'],
                  ['W','B','B','G','B','B','W'],
                  ['W','W','W','W','W','W','W'],])

prob_correct_step = 0.7
maze = Maze(world,prob_correct_step)

In [4]:
print(maze)

[[W   W   W   W   W   W   W ] 
 [W   B   B   B   B   B   W ] 
 [W   S   F   B   B   F   W ] 
 [W   B   B   B   B   B   W ] 
 [W   B   B   B   F   B   W ] 
 [W   B   W   W   W   G   W ] 
 [W   B   F   B   B   B   W ] 
 [W   B   B   F   B   B   W ] 
 [W   B   B   G   B   B   W ] 
 [W   W   W   W   W   W   W ]] world map
[[                          ] 
 [    0   1   2   3   4     ] 
 [    5   6   7   8   9     ] 
 [    10  11  12  13  14    ] 
 [    15  16  17  18  19    ] 
 [    20              21    ] 
 [    22  23  24  25  26    ] 
 [    27  28  29  30  31    ] 
 [    32  33  34  35  36    ] 
 [                          ]] state map


### Policy Iteration

In [5]:
action_to_matrix = {'L':maze.left_transition_matrix,
                    'R':maze.right_transition_matrix,
                    'U':maze.up_transition_matrix,
                    'D':maze.down_transition_matrix}
reward = maze.get_reward({'S':-1,'F':-30,'B':-1,'G':100})
PI_utility,PI_policy = policy_iteration.run_infinite_policy_iteration(action_to_matrix,reward,0.95,100)

In [6]:
PI_policy_transition_matrix = maze.get_policy_matrix(PI_policy)
maze.make_animation(PI_policy_transition_matrix,100)

### Value iteration

In [7]:
def value_iteration(rewards_array,action_transition_dict,discount,iterations):
    current_utility = np.zeros(rewards_array.shape[1])
    action_names = list(action_transition_dict.keys())
    
    for i in range(iterations):
        action_utility_matrix = np.zeros((len(action_transition_dict),rewards_array.shape[1]))
        for a,action in enumerate(action_names):
            action_utility_matrix[a]=discount*action_transition_dict[action].T.dot(current_utility)+rewards_array[a]
        current_utility = np.max(action_utility_matrix,axis=0)    

    # now get policy
    expected_returns = [action_transition_dict[action].T.dot(current_utility) for action in action_names]
    best_policy = np.argmax(np.concatenate([r.reshape(-1,1) for r in expected_returns],axis=1),axis=1)
    new_policy = [action_names[d] for d in best_policy]
    return current_utility,new_policy
    
# first repeat reward. Could have rewards different for different actions
# (but here the reward is just based on the current state)
reward_repeat = reward.reshape(1,-1).repeat(4,axis=0)
VI_utility,VI_policy = value_iteration(reward_repeat,action_to_matrix,0.95,100)
VI_policy_transition_matrix = maze.get_policy_matrix(VI_policy)
maze.make_animation(VI_policy_transition_matrix,100)

### Linear programming
An alternative way to solve for the best policy given the value state is with linear programming. A linear program has a linear object function, and linear inequalities. The Bellman equation:
$$U^\pi(s)=\max_a\bigg(R(s,a) + \lambda \sum_{s'} T(s'|s,a)U^\pi(s') \bigg)$$
can be transformed:
$$Minimize \sum_s U^\pi(s)$$
With the constraints:
$$U^\pi(s)>\max_a\bigg(R(s,a) + \lambda \sum_{s'} T(s'|s,a)U^\pi(s') \bigg)$$
This way the inqequality will become an equality as the program converges in fitting $U$.
The maximum can be replaced by a constraint over all $s$ and $a$, as the maximum is guaranteed to be in that set:
$$U^\pi(s)>R(s,a) + \lambda \sum_{s'} T(s'|s,a)U^\pi(s')$$
(so long as the above is true for all $s$ and $a$)