In [1]:
%load_ext autoreload
%autoreload 2
import sys
sys.path.insert(0,'../../modules')

In [2]:
import numpy as np
from maze_problem import Maze

# Markov Decision Processes
Many problems require making a set of sequential decisions (e.g chess, go, card games). <br>
One of the approaches to dealing with sequential problems is to assume a markov condition. In a markov decision process(MDP) we have a set of actions $A$ and a set of states $S$. Iteratively at each timestep we make an action which transitions the current state $S_t$ to a new state $S_{t+1}$ with a probability $P(S_{t+1}|S_t,A_t)$. At each timestep a reward value $R_t$ is given based on the current state and action with probability $P(R_t|S_t,A_t)$. The markov condition is that the next state only depends on the current state and current action.

When we make a finite number of decisions the utility (which we want to maximize) is given as the sum of rewards at all timesteps, $\sum_{t=1}^n R_t$. With an infinite set of steps this becomes infinite, so often a discount factor is included which grows with time, $\sum_{t=1}^\infty \lambda^{t-1} R_t$. Alternatively we can use an average instead, but that can be computationally difficult. 

Because we don't know the reward at each given time step we need to take an expectation over all possible rewards so the above becomes $\sum_{t=1}^\infty \lambda^{t-1} \sum_i P(R_t^i)R_t^i$. Because of the markov condition we know the probability of $R_t$ given $S_t$ and $A_t$ and can rewrite this as:
$$\sum_{t=1}^\infty \lambda^{t-1} \sum_i P(R_t^i|S_t,A_t)R_t^i$$

While there are non-stationary MDPs it is useful to assume $P(S_{t+1}|S_t,A_t)$ and $P(R_t|S_t,A_t)$ are the same for all $t$. Stationary MDPs can have the transition from one state to another determined by a function $T(s'|s,a)$ which doesn't depend on $t$. The expectation over the reward given the current state $s$ and executing action function $a$ is $R(s,a)$. So the above is:
$$\sum_{t=1}^\infty \lambda^{t-1} R(S_t,A_t)$$

In a MDP problem we are trying to find a good policy, which tells us which action to take given previous actions and the current state. With an infinite time stationary MDP we get policies not depending on $t$. We call the stationary policy $\pi$, which maps state to action $\pi(s)$. We also have a stationary transition function $T(s'|s,a)$ which is the probability of moving from a state and action to a new state.

With this in mind the above overall utility function can be written:
$$\sum_{t=1}^\infty \lambda^{t-1} R(S_t,\pi(S_t))$$
Let $U_k^\pi(s)=\sum_{t=1}^k \lambda^{t-1} R(S_t,\pi(S_t))$ with starting state $S_1$ (counting $t$ from 1 **relative** to $s$, so $S_1$=$s$). <br>
So, $U_1^\pi(s)=R(s,\pi(s))$<br>
Then $U_\infty^\pi(s)$ is the value we want to maximize. <br>
$$
\begin{aligned}
    U_{k+1}^\pi(s)&=\sum_{t=1}^{k+1} \lambda^{t-1} R(S_t,\pi(S_t)) \\
    &= R(s,\pi(s)) + \lambda \sum_{t=2}^{k+1} \lambda^{t-1} R(S_t,\pi(S_t)) \\
    &= R(s,\pi(s)) + \lambda U_k^\pi(s')
\end{aligned}
$$
Where $s'$ is the state that follows $s$. However, as we don't know the actual value of $s'$ we use the expectation:
$$U_{k+1}^\pi(s)=R(s,\pi(s)) + \lambda \sum_{s'} T(s'|s,\pi(s)) U_k^\pi(s')$$
This formula is very intuitive. The value of a state ($U^\pi(s)$) is the immediate reward plus the expected value of the probable future states.

For an infinite horizon the solution for $U^\pi$ can be found iteratively:
$$U^\pi=R(s,\pi(s))+\lambda\sum_{s'}T(s'|s,\pi(s))U^\pi(s')$$

### Evaluating a simple decision:
**W** is a wall <br>
**F** is a fire <br>
**G** is gold <br>
**B** is empty <br>
**S** is start. <br>
Actions are moving in a direction. Moving into a wall leaves you in the same place. <br>
You have a $70\%$ chance of going the way you choose to move, $15\%$ chance of slipping and falling to the left and also $15\%$ chance of falling to the right. <br>
The possible states are just where you are.

In [158]:
world = np.array([['W','W','W','W','W'],
                  ['W','B','B','G','W'],
                  ['W','B','F','B','W'],
                  ['W','S','F','B','W'],
                  ['W','B','B','B','W'],
                  ['W','W','W','W','W']])

maze = Maze(world)
print(maze)

[[W  W  W  W  W ] 
 [W  B  B  G  W ] 
 [W  B  F  B  W ] 
 [W  S  F  B  W ] 
 [W  B  B  B  W ] 
 [W  W  W  W  W ]] world map
[[              ] 
 [   0  1  2    ] 
 [   3  4  5    ] 
 [   6  7  8    ] 
 [   9  10 11   ] 
 [              ]] state map


**A good policy:** avoid fire, go to gold

In [159]:
good_policy = ['R','R','R','U','U','U','U','R','U','R','R','U']
good_policy_transition_matrix = maze.get_policy_matrix(good_policy)
maze.make_animation(good_policy_transition_matrix,20)

[[              ] 
 [   R  R  R    ] 
 [   U  U  U    ] 
 [   U  R  U    ] 
 [   R  R  U    ] 
 [              ]] policy map


**A bad policy:** go to fire.

In [161]:
bad_policy = ['R','D','L','R','D','L','R','U','L','U','U','L']
bad_policy_transition_matrix = maze.get_policy_matrix(bad_policy)
maze.make_animation(bad_policy_transition_matrix,20)

[[              ] 
 [   R  D  L    ] 
 [   R  D  L    ] 
 [   R  U  L    ] 
 [   U  U  L    ] 
 [              ]] policy map
