# MDP Markov Decision Process

**Definition: A MDP is a tuple <S, A, P, R, r>**

- (Plus from MRP) A is a finite set of actions
    - $P_{ss'}^a = P[S_{t+1} = s' | S_t = s, A_t = a]$
    - $R_s^a = E[R_{t+1} | S_t = s, A_t = a]$

*Environment -> States are Markov*

Environmnet - Markov Chain  
Agent - Action을 취할 수 있는 권한이 있다.

Now, it is Active not Passive

Important Process of MDP
![MDP_process](./MDPprocess.png)

### Deterministic VS Stochastic

Deterministic : Cirtain 갈 곳이 하나로 정해져 있다. (p = 1)  

Stochastic : Uncirtain  여러 State로 갈 수 있다. (p1 = 0.2, p2 = 0.8)

## Policy  $\pi$

policy 란 어떤 State에서 어떤 Action을 취할 것인지 Mapping 하는 것  

-> 이때, 각 State 에서 Action이 고정되기 때문에 MDP -> MRP 가 된다.  

**Policy 는 Deterministic 할 수도, Stochastic 할 수 도 있다.**  
이때, Policy 행렬 -> $\pi(a|s)$

- Policy 는 현재 상태에 의존적이다.
- Policy 는 시간에 독립적이고, Policies are stationary ->(Optimal)

$$P^\pi \ is \ a \ matrix \ containing \ probabilities \ for \ each \ transition \ under \ policy \ \pi$$

즉, A + P -> $P^\pi$

### Question?
- How many possible policies in our example?
Action 에 따라 가능 한 policy는 여러가지가 될 수 있다!

- Which of the above two policies is best?

- How do you compute the optimal policy?

## Value Function

### State-Value Function $v_{\pi}(s)$
State s 로부터 policy $\pi$를 취하여 진행한 value function  

$v_{\pi}(s) = E_{\pi}[G_t \ | \ S_t = s]  \\
\ \ \ \ \ \  \ \ = E_{\pi}[R_{t+1} + \gamma v_{\pi}(S_{t+1}) \ | \ S_t = s]$

### Action-Value Function $q_{\pi}(s)$

현재 스테이트에서는 Action a 를 선택하고 다음 s' 부터는 policy $\pi$ 따르도록 한다.  


$q_{\pi}(s) = E_{\pi}[G_t \ | \ S_t = s, A_t = a]  \\
\ \ \ \ \ \  \ \ = E_{\pi}[R_{t+1} + \gamma q_{\pi}(S_{t+1}, A_{t+1}) \ | \ S_t = s, A_t = a]$

The Flow of transition between states following the Policy
![Policy_process](resource/PolicyProcess.png)

**Relationship between $v_{\pi}(s) \ And \ q_{\pi}(s)$**
$$v_{\pi}(s) = \sum_{a\in A}{\pi (a|s)q_{\pi}(s,a)}$$
![State_to_Action](resource/StateToActionVF.png)
State Value Function 은 State s 부터 취할 수 있는 모든 Action을 따지지만, Action Value Function 은 State s 에서는 Action a를 취하고 그 이후부터 policy 를 따르게 한다.  

$$ q_{\pi}(s,a) = R(s) + \gamma \sum_{s'\in S}{P(s'|s, a)v_{\pi}(s')} $$
![State_to_Action](resource/ActionToStateVF.png)
R(s) = 해당 State에 대한 보상! immediate Reward  
Value function -> 해당 State s로부터 끝까지 얻어진 Return(G) 에대한 기댓값  

이때, P 가 State Transition Matrix 이다. 

Conclude, *State(Reward) -> Policy -> Transition -> State'*  

## Bellman Optimality Equation

### SVF $v_{\pi}(s)$

**The Optimal state-value function $v_*(s)$ = maximum value function over policies**  

### AVF $q_{\pi}(s)$

**The Optimal state-value function $q_*(s)$ = maximum value function over policies**  

### Optimal Policy
$\pi_*(s) = argmax_{\pi}v_{\pi}(s)$
argmax 란 State Value Function 이 최적일때의 pi 값을 반환한다는 것이다.   
따라서, $v_*(s), q_*(s,a)$를 알면 최적의 policy pi 도 구할 수 있다.   

$$\pi_*(a | s) = 1 \ if \ a = argmax_{a\in A} q_*(s, a) \ / \ 0 \ otherwise$$  

- There is always a deterministic optimal policy for an MDP  

- If we know $q_*(s,a)$, we can have the optimal policy

## Value Iteration
- Algorithm

1. Initialize an estimate for the value function arbitarily (or zeros)
$$ v(s) \leftarrow 0 \ s \in S $$
2. Repeat, update
$$ v(s) \leftarrow R(s) + \gamma max_a \sum_{s'\in S}{P(s' \ | \ s,a)v(s')}, s \in S $$

## Policy Iteration
- Given a policy $\pi$ then evaluate the policy $\pi$
- Improve the policy by acting greedily with respect to $v_{\pi}$

**Update $\pi$ to be *greedy policy* with respect to $v_{\pi}$**

$$\pi(s) \leftarrow argmax_a \sum_{s'\in S}{P(s' \ | \ s,a)v_{\pi}(s')}$$

## Principle of Optimality

- Shortest Path  

# Lab

1st Example Image
![Example1](resource/1stExample_Lab.png)

In [1]:
# Naive
import numpy as np

P = [[1, 0, 0, 0],
    [0, 1, 0, 0],
    [0.5, 0, 0.5, 0],
    [0, 1, 0, 0]]

R = [0, 0, 10, 10]

P = np.asmatrix(P)
R = np.asmatrix(R)
R = R.T
gamma = 0.9
v = (np.eye(4) - gamma*P).I*R

print(v)

[[ 0.        ]
 [ 0.        ]
 [18.18181818]
 [10.        ]]


$$ v(s) \leftarrow R(s) + \gamma \sum_{s'\in S}{P(s' \ | \ s,a)v(s')} $$

In [2]:
# Iterative
v = np.zeros([4,1])

for _ in range(10):
    v = R + gamma*P*v
    
print(v)

[[ 0.        ]
 [ 0.        ]
 [18.17562716]
 [10.        ]]


### Bellman Optimality!

In [3]:
states = [0, 1, 2, 3]
actions = [0, 1]
P = {
    0: {
        0: [(1, 0)],
        1: [(0.5, 0), (0.5, 1)]
    },
    1: {
        0: [(0.5, 0), (0.5, 3)],
        1: [(1, 1)]
    },
    2: {
        0: [(0.5, 0), (0.5, 2)],
        1: [(0.5, 0), (0.5, 1)]
    },
    3: {
        0: [(0.5, 2), (0.5, 3)],
        1: [(1, 1)]
    }
}

R = [0, 0, 10, 10]

gamma = 0.9

In [4]:
P[2][0]

[(0.5, 0), (0.5, 2)]

$$\sum_{s'\in S}{P(s' \ | \ s,a)v(s')} $$

In [5]:
# compute the above summation

#s = 2, a = 0

v = [0, 0, 10, 10]

tmp = 0

for trans in P[2][0]:
    tmp += trans[0]*v[trans[1]]
    
print(tmp)

5.0


In [6]:
# shorten

sum(trans[0]*v[trans[1]] for trans in P[2][0])

5.0

### Value Iteration
$$ v(s) \leftarrow R(s) + \gamma max_a \sum_{s'\in S}{P(s' \ | \ s,a)v(s')}, s \in S $$

In [7]:
# optimal value function

# v = [0]*4
v = [0, 0, 0, 0]

for _ in range(100):
    for s in states:
        q_0 = sum(trans[0]*v[trans[1]] for trans in P[s][0])
        q_1 = sum(trans[0]*v[trans[1]] for trans in P[s][1])

        v[s] = R[s] + gamma*max(q_0, q_1)
    
print(v)

[31.58508953413495, 38.60400287377479, 44.02416232966445, 54.20158563176306]


In [8]:
#shorten
v = [0] * 4

for _ in range(100):
    for s in states:
        v[s] = R[s] + gamma*max([sum(trans[0]*v[trans[1]] for trans in P[s][a]) for a in actions])
    
print(v)

[31.58508953413495, 38.60400287377479, 44.02416232966445, 54.20158563176306]


$$\pi(s) \leftarrow argmax_a \sum_{s'\in S}{P(s' \ | \ s,a)v_{\pi}(s')}$$

In [9]:
# optimal policy
# once v computed

optPolicy = [0]*4

for s in states:
    q_0 = sum(trans[0]*v[trans[1]] for trans in P[s][0])
    q_1 = sum(trans[0]*v[trans[1]] for trans in P[s][1])
    
    optPolicy[s] = np.argmax([q_0, q_1])
    print(q_0, q_1)
    
print(optPolicy)

31.58508953413495 35.09454620395487
42.893337582949 38.60400287377479
37.8046259318997 35.09454620395487
49.11287398071376 38.60400287377479
[1, 0, 0, 0]


In [24]:
# shorten
optPolicy = [0]*4

for s in states:
    optPolicy[s] = np.argmax([sum(trans[0]*v[trans[1]] for trans in P[s][a]) for a in actions])
    
print(optPolicy)

[1, 0, 0, 0]


### Policy Iteration

In [36]:
policy = np.random.randint(0, 2, 4) # start, end, numbers
policy

array([1, 0, 0, 1])

In [37]:
def cal_value(policy):
    v = [0]*4
    
    for _ in range(100):
        for s in states:
            v[s] = R[s] + gamma*sum(trans[0]*v[trans[1]] for trans in P[s][policy[s]])
    return v

In [38]:
v = cal_value(policy)
print(v)

[16.232464222377942, 19.839678744431364, 31.462925166669155, 27.85571086998823]


In [39]:
for _ in range(100):
    for s in states:
        policy[s] = np.argmax([sum(trans[0]*v[trans[1]] for trans in P[s][a]) for a in actions])
    
    v = cal_value(policy)
    
print(v)
print(policy)

[31.58508953413495, 38.60400287377479, 44.02416232966445, 54.20158563176306]
[1 0 0 0]


# Questions!!!

1. 랜덤 policy를 주었을때, Value Function 이 해당 State의 Reward인 이유는? 
그럴 일은 없다. 해당 문제에서 1번 State가 0 Action을 하게 되면, 계속 자리에 남기 때문에, Value가 갱신 되지 않는 것이다.   

** Value Function -> 현 State 에서 Transition 이 계속적으로 일어날때, 쌓이는 Reward 값의 총 합의 기대치 즉 E(G), Expected Return