# Markov Decision Process
Suppose that the environment that our agent is in is not deterministic (which means it is stochastic).
This means that an action that the agent take may not result in the same resultant state each time the action is taken.
For example, an AI that is moving along a line, where it can choose to move forwards or backwards. But when he decides on an action, it has 90% chance moving as per decided, but a 5% chance to move the opposite direction, and another 5% to stay in place.
This somewhat simulates real life as physical equipment may not always give a deterministic output.

Thus, to model this problem, we can use a **Markov Decision Process**

## Definition
1. States: $S$ ($S$ is the set of states, while $s$ is a specific state)
2. Actions: $A$ ($A$ is the set of actions, while $a$ is a specific action)
3. Transition Model: $T(s,a)$
    * The set of probability distribution over the states that agent will transition to upon action $a$ in state $s$
4. Reward function
    * The reward the agent will receive when it reaches a certain state
    * Reward can be defined as $\mathbb{R}$ : State → $\mathbb{R}$,
    * We can also define it as $\mathbb{R}$ : State × Action → $\mathbb{R}$ which is equivalent
5. Initial State
6. Goal
7. Terminal State
    * no action is taken after reaching this state

### Example
```
---------------------------------
|       |       |       |      1|
|       |       |       |       |
|8      |9      |10     |11     |
---------------------------------
|       |       |       |     -1|
|       |       |       |       |
|5      |6      |X      |7      |
---------------------------------
|       |       |       |       |
|       |       |       |       |
|1      |2      |3      |4      |
---------------------------------
```

In the above diagram, the bottom left numbers are the state numbers (where X is an unreachable state), and the top right number is the reward (all unlabeled grid have a reward of -0.4 to incentivize the agent to reach in the shortest path).
The initial state is state 1.
The agent can only move in one of the 4 cardinal directions.

The transition probability is 80% to move in the desired direction, and an equal 5% for the other direction or not moving.
We can imagine it as a broken robot, where it performs the issued command 80% of the time, and chooses a random move otherwise.

Suppose we were in a deterministic world, then the obvious **plan** would be "UURRR".
However if we gave that instruction to the agent, it would succeed only with a probability of $0.8^5\approx 32.8\%$, which is rather poor.
Thus, an obvious solution is to tell the agent to make a certain action when it is in a certain state.
This is what we call a **policy**.
This allows our agent to still try to find a path if it were to find itself in one of the states that we did not "plan" for.
Thus, we wish to find a policy that maximizes the reward obtained.

### Utility of Sequence
Suppose our agent traverses a sequence of states $s_0, \dots$.
A simple metric to determine the utility of the sequence is additive, where we compute $R(s_0) + R(s_1) \dots $.
However, since our agent can potentially navigate the problem infinitely, it means our utility could be infinite.
This makes it hard to compare two utility if both of their utility do not converge.

Thus, we use a discountive model instead, $R(s_0) + \gamma R(s_1) + \gamma^2 R(s_1) \dots, \gamma < 1 $.
Hence, we reward future rewards less in order to ensure convergence.

The best policy would be one that gives the largest reward across all possible sequences that the agent can traverse from the starting state, weighted by probability.

### Inspiration

Suppose we were at state 1, and deciding to go to state 5 or state 2.
Suppose that we somehow knew the "potential value" of state 2 and state 5.
From the grid, we can see that state 5 is probably "better" than state 2 because it is less likely to enter the right path into the terminal state of -1 (however, the agent does not know this).
Thus, the potential value of state 5 (if we calculated it) would be higher than state 2.
Hence, it makes sense that our agent should try to take the action that gives the higher probability to enter state 5 than state 2.

We can formalize the intuition as per below

### Formalization
$U^\pi$: Utility function of the policy

$U^\pi(s)$: Utility of policy at state s, or also the utility of the policy when the agent starts as s.

$\pi^∗(s)$ = best policy for state (s) = $argmax_\pi U^\pi(s)$

$P(s_0|s, \pi(s))$: Probability that we will get to s’ from s after taking an action from policy π 


$\pi^*(s) = argmax Pr(s'|s, a)U^{\pi^*}(s')$ : (The best action to take at state (s) is the one that has the highest expected utility across all possible resultant states)


It is important to note that the optimal policy is independent of starting state, since the policy needs to map all state to an action.

We denote $U(s) = U^{\pi^*}(s)$ as shorthand for the utility of the state using the optimal policy.


$$
\begin{align}
U^\pi(s) &= E\left[\sum _{t=0} ^ \infty \gamma^t R(S_t)\right] & \text{Infinite horizon} \\
&=E\left[R(S_0) + \sum _{t=1} ^ \infty \gamma^t R(S_t)\right]&\\
&=R(s) + E\left[\sum _{t=1} ^ \infty \gamma^t R(S_t | S_0 = s)\right]&\\
&=R(s) + \gamma \sum P(s' | s, \pi(s)) (R(s') + E\left[\sum _{t=2} ^ \infty \gamma^{t-1}  R(S_t | S_1 = s')\right])&\\
&=R(s) +\gamma  \sum P(s' | s, \pi(s)) (R(s') + E\left[\sum _{t'=1} ^ \infty \gamma^{t'}  R(S_t' | S_0 = s')\right])& \text{(Notice that the 2 series is identical)}
\end{align}
$$

Thus, we get the following recurrence

$U^\pi(s) = R(s) +  \gamma \sum P(s' | s, \pi(s)) U^\pi(s')$

Equivalently

$U(s) = R(s) +  \gamma \sum P(s' | s, \pi^*(s)) U(s')$

$= R(s) + \gamma \max \sum P(s' | s,a) U(s')\text{ for a }\in A(s)$ as the optimal policy will pick the state with the highest utility

Which is similar to our inspiration.

This is also known as the **Bellman Equation**.

$$U(s) = R(s) + \gamma \max _{a \in A(s)} \sum _{s'} P(s' | s, a) U(s')$$

We can write expression for all the states so that we can solve the utility of every state.

However, this is rather difficult because of non-linear equations (Max is non-linear)

## Value Iteration
1. Initialize utilities to some value, for instance 0.
2. $U_i(s)$: Utility value at iteration i
3. $U_{i+1}(s) \leftarrow R(s) + \gamma \max \sum _{s'} P(s'|s, a) U_i (s')$
4. Repeat until insignificant change to utility

It can be proven that $lim_{i \rightarrow \infty} U_i(s) = U(s)$, which is useful for us since we can just set some arbitrary initial value and converge towards the correct answer by iterating.

In [63]:
actions = {
    1: "UR",
    2: "LUR",
    3: "LR",
    4: "UL",
    5: "URD",
    6: "ULD",
    7: "",
    8: "RD",
    9: "LDR",
    10: "LR",
    11: "",
}

def next_state(state, action):
    if action == 'U':
        return state + 4 if state <= 2 else state + 3
    if action == 'D':
        return state - 4 if state <= 6 else state - 3
    if action == 'L':
        return state - 1
    if action == 'R':
        return state + 1

next_states = {state: [next_state(state, a) for a in action] for state,action in actions.items()}    
    
T = {state: {a:[(0.8, new_state) if next_state(state, a) == new_state else (0.2/(len(next_states[state]) - 1), new_state) for new_state in next_states[state]] for a in action} for state,action in actions.items()}

In [64]:
from collections import defaultdict

R = defaultdict(int)
R[7] = -1
R[11] = 1

states = [i for i in range(1, 12)]

U = defaultdict(int)

def bellman(U, states, actions, R, T, gamma=0.9):
    U_p = defaultdict(int)
    for s in states:
        arr = [sum(p * U[new_s] for p, new_s in future) for future in (T[s][a] for a in actions[s])]

        next_u = max(arr) if arr else 0
        U_p[s] = R[s] + gamma * next_u
    return U_p

for _ in range(100):
    U = bellman(U, states, actions, R, T)

In [65]:
def best_action(U, actions):
    result = {}
    for state in U:
        arr = [(U[next_state(state, a)], a) for a in actions[state]]
        if not arr:
            continue
        result[state] = max(arr)[1]
    return result
        
best_action(U, actions)

{1: 'U', 2: 'U', 3: 'L', 4: 'L', 5: 'U', 6: 'U', 8: 'R', 9: 'R', 10: 'R'}

```
---------------------------------
|       |       |       |      1|
|   >   |   >   |   >   |       |
|8      |9      |10     |11     |
---------------------------------
|       |       |       |     -1|
|   ^   |   ^   |       |       |
|5      |6      |X      |7      |
---------------------------------
|       |       |       |       |
|   ^   |   ^   |   <   |   <   |
|1      |2      |3      |4      |
---------------------------------
```

Hence, we get the policy above, which fits our intuition.

## Policy Iteration

Find $\pi^*(s): state \rightarrow actions$

Initialize policy: $\pi_0(s) : State \rightarrow Action$. Arbritrary mapping

We can evaluate the value of each state of that given policy as per below

$$
U^\pi_i(s) = R(s) + \gamma \sum _{s'} P(s' | s, \pi_i(s)) U^{\pi_i}(s')
$$

If we have $n$ states, then we have $n$ linear equations

we can use $O(n^3)$ Gaussian elimination

Then, we update our policy

$$\pi_{i+1}(s) = argmax_{a \in A(s)} \sum _{s'} P(s' | s,a) U^{\pi_i} (s')$$

For a given state s, select the best action that gives the highest utility probabilistically

Do until convergence

$\forall s, \lim_{i \rightarrow \infty} \pi_i(s) = \pi^*(s)$

---

For all our problems up till now, we had to iterate $T(s,a)$, transition of state given an action. 
This may be too large for certain problems.
For example, the coordinate of the robot in real life can take any real values.


## Q Learning
Instead of trying every action at every state, we learn an action at every state instead.

By the Bellman Equation

$$U(s) = R(s) + \gamma \max _{a \in A(s)} \sum _{s'} P(s' | s, a) U(s')$$

We want to move the max outside, such that our utility function is simply:

$$
\begin{align}
U(s) &= R(s) + \gamma \max _{a \in A(s)} \sum _{s'} P(s' | s, a) U(s') \\
&= \max _{a \in A(s)}  \left( R(s) + \gamma \sum _{s'} P(s' | s, a) U(s') \right)\\
&= \max _{a \in A(s)} Q(s,a)\\
\end{align}
$$

Hence, getting our (desired) definition of $Q(s,a) = R(s) + \gamma \sum _{s'} P(s' | s, a) U(s') $.

Expanding this definition:

$$
\begin{align}
Q(s,a) &= R(s) + \gamma \sum _{s'} P(s' | s, a) U(s') \\
&= R(s) + \gamma \sum _{s'} P(s' | s, a) \max_{a'} Q(s',a')
\end{align}
$$

Hence, our iterative process is as follows:

1. Initialize $\hat Q_0(s,a)$
2. Choose action a to get new state s'
3. Update $\hat Q_i(s,a)$

for everything else, $\hat Q_i(s', a') = \hat Q_{i-1}(s',a')$

### Choose action(s)
* $\hat a \leftarrow argmax \hat Q(s,a)$ "best action"
* With probability $\beta$, we choose $\hat a$ else we choose a random move  or base on $\hat Q(s,a)$ 

### Update $\hat Q(s,a)$
This identity is the rewards if we take the random action $a$:

$$R(s) + \gamma \max_{a'} Q(s', a') $$

However, after some time, we want our agent to simply pick the most optimal move that it has learnt, thus our update function would be

$$\hat Q(s,a) = r( R(s) + \gamma \max_{a'} Q(s',a')) + (1-r) \hat Q(s,a)$$

where $r$ is the learning rate

#### Learning Rate
When $r =1$, we get back our definition of Q

When $r =0$, we stop updating our Q's

Initially, we want high alpha but tapers to 0 over time

Similar to [simulated annealing](./local_search#simulated-annealing.ipynb), we want to explore a lot in the early phase and less later, so we can use $\frac{1}{t}$ for example.

However, we would like the function to depend on (s,a), so we define

$N(s,a)$: How many times action a was taken at state s.

And we want a function $r (N(s,a))$ that is decreasing in $N(s,a)$

#### Implementation

In [128]:
from random import randint
from copy import deepcopy

Q = defaultdict(lambda :defaultdict(int))

def random_action(actions, state):
    arr = actions[state]
    if not arr:
        return None
    return arr[randint(0, len(arr)-1)]

def q_learning(Q, states, R, action_func, r=0.9, gamma=0.9):
    new_Q = deepcopy(Q)
    for state in states:
        action = action_func(state)
        
        if action:
            s_prime = next_state(state, action)
            future_states = Q[s_prime].values()
            best_reward = max(future_states) if future_states else 0
        else:
            best_reward = 0
        new_Q[state][action] = r * (R[state] + gamma * best_reward) + (1-r) * Q[state][action]

    return new_Q

for i in range(10000):
    Q = q_learning(Q, states, R, lambda s: random_action(actions, s),r=1/(i+1))

In [133]:
[max(actions.items(), key=lambda x: x[1]) for actions in Q.values()]

[('U', 0.19607454360750393),
 ('U', 0.27901898160925537),
 ('L', 0.16644153675358755),
 ('L', 0.08908041460559306),
 ('U', 0.3219448214254002),
 ('U', 0.48120766924900515),
 (None, -1.0),
 ('R', 0.5415381022026509),
 ('R', 0.7052872671551181),
 ('R', 0.8889535139134304),
 (None, 1.0)]

And indeed, we get the same decision as when we computed the utility values.

### Approximate Q-function

$\hat Q(s,a), |\{(s,a)\}|$ can be very large

The space of (S,a) can be very large, so we want to reduce the search space

Thus instead, we define Q as 

$Q(s,a) = \sum ^n _{i=1} f_i(s,a) w_i$

where $f_i$ are the different feature function that when given a state and action, returns a value for that feature. $w_i$ is the weight for that feature.

#### Dimensionality reduction

$\hat w_i = \hat w_i + \alpha [R(s) + \gamma \max_{a'} \hat Q(s', a') - \hat Q(s,a)] \frac{\partial \hat Q(s,a)}{\partial \hat w_i} $

$difference = R(s) + \gamma \max_{a'} \hat Q(s', a') - \hat Q(s,a)$

$\hat w_i = \hat w_i + \alpha(difference) f_i(s,a)$

So the problem reduces to trying to find the correct weights for each of the feature to obtain the optimal policy.