# Markov Decision Processes

This tutorial is motivated by the awesome lecture of Dorsa Sadigh and Percy Liang [Stanford CS221: Artificial Intelligence: Principles and Techniques | Autumn 2019](
https://www.youtube.com/playlist?list=PLoROMvodv4rO1NB9TD4iUZ3qghGEGtqNX).


1. $S:$ The set of states
2. $A(s)$:Possible actions from state
3. $T(s,a,s'):$
4. $R(s,a,s')$
5. $IsEnd(s)$
6. $0\ge \gamma \ge 1$

## Motivating Example :Transportation Problem
**Problem Setup:**
1. Street with blocks numbered 1 to n
2. Walking from s to s+1 takes 1 minute.
3. Taking a magic tram from s to 2s takes 2 minutes.
4. __Traim fails with probability .5__

=> How to go  from 1 to n in the least time ?

In [1]:
class TransportationMDP:
    def __init__(self, N,fail_prob=.5):
        self.N = N
        self.fail_prob=fail_prob
    def startState(self)-> int:
        return 1
    def isEnd(self, state)-> bool:
        return state == self.N
    def actions(self,state):
        result = []
        if state +1 <= self.N:
            result.append('walk')
        if state*2 <= self.N:
            result.append('tram')
        return result    
    def succProbReward(self, state,action):
        result = []
        if action =='walk':
            result.append((state+1, 1.,-1.))
        elif action =='tram':
            result.append((state*2,1-self.fail_prob ,-2.))
            result.append((state  ,self.fail_prob ,-2.))
        return result
    def discount(self):
        return 1.
    def states(self):
        return range(1,self.N+1)
    
    
mdp=TransportationMDP(10)
print(mdp.actions(3))
print(mdp.actions(9))
print(mdp.succProbReward(3,'walk'))
print(mdp.succProbReward(4,'tram'))

['walk', 'tram']
['walk']
[(4, 1.0, -1.0)]
[(8, 0.5, -2.0), (4, 0.5, -2.0)]


# Policy

+ A policy $\pi : S \to A$ is a prescription of which possible action to take $A(s)$ in a given state $s \in S$. Hence, a policy yields/generates a path.

### Utility of a Policy

+ __The utility of a $\pi$__ is the discounted sum of the rewards on a path

### Value of a Policy

+ __The value of a policy__ is the expected utility.

$$ V_\pi = \mathbb E [U]$$

$$ V_\pi = \frac{1}{N} \sum_{i=1} ^N U_i ,$$
where $u_i$ denotes the discounted sum of the rewards on $i.$th path yielded by $\pi$.

$$ U_i = R_1 + \gamma R_2 + \gamma^2 R_3 $$



## State-Value functon 

## $V_\pi (s) = \sum_{s'} T(s,\pi(s),s') \; [R(s,\pi(a),s') + \gamma V_\pi (s')]$

A value of a state under a policy $\pi$ is the expected utility of following $\pi$ on a given state $s$ onwards.


$$
V_\pi (s) = \begin{cases}
0 & \text{ if } IsEnd(s)\\
Q_\pi (s,a) & \text{otherwise} \\
\end{cases}
$$
where


## State-Action-Value functon 

## $Q_\pi (s,a) = \sum_{s'} T(s,a,s') \; [R(s,a,s') + \gamma V_\pi (s')]$


# Policy Evaluation

1. $\forall s \in S \; \text{Initialize value of a state under a policy } V_\pi ^0 (s)\leftarrow 0$
2. For each iteration $t \dots t_{end}$
3. For each state $s$
$$ V_\pi ^t (s) \leftarrow \sum_{s'} T(s,\pi(s),s') \; [R(s,\pi(s),s') + \gamma V_\pi ^{t-1} (s')]$$




Repeat until $$ \text{max}_{s \in S} \mid V_\pi ^t (s) - V_\pi ^{t-1} (s) \mid \leq \epsilon $$

### Time Compliexty $O(t_{end} S S')$
where
1. $t_{end}$ number of iteration
2. $S'$ successors ( number of s' with T(s,a,s')>0)

# Optimal Value Function


### $$
V_* (s) = \begin{cases}
0 & \text{ if } IsEnd(s)\\
\text{max}_{a \in A(s)} Q_* (s,a) & \text{otherwise} \\
\end{cases}
$$
where
## $Q_* (s,a) = \sum_{s'} T(s,a,s') \; [R(s,a,s') + \gamma V_* (s')]$




## Optimal Policy

An optimal policy is a prespecition of selecting best action for in a given state $\pi_*$

### $$ \pi_* (s) = \text{argmax}_{a \in A(s)} Q_* (s,a)$$


# Value Iteration [Bellman 1957]

1. $\forall s \in S \; \text{Initialize value of a state } V_* ^0 (s)\leftarrow 0$
2. For each iteration $t \dots t_{end}$
3. For each state $s$
$$ V_* ^t (s) \leftarrow max_{a \in A(s)}\sum_{s'} T(s,a,s') \; [R(s,a,s') + \gamma V_* ^{t-1} (s')]$$



### Time Compliexty $O(t_{end} S A S')$


In [2]:
def display_params(count,mdp,V,pi):
    
    print(f'State \tV(s) \tpi(s)\t at {count}.th iteration')
    for s in mdp.states():
        print(f'{s}\t{V[s]:.3} \t{pi[s]}')
    print('\n')

def value_iteration(mdp):
    V = {s:0. for s in mdp.states()}
    def Q(s,a):
        return sum( p*(r + (mdp.discount()*V[s_next])) for s_next, p, r in mdp.succProbReward(s,a))    
    def pi():
        pi={}
        for s in mdp.states():
            pi[s] = max((Q(s,a),a) for a in mdp.actions(s))[1] if not mdp.isEnd(s) else 'End'
        return pi
    count=0    
    display_params(count,mdp,V,pi())
    while True:
        Vnew=dict()
        for s in mdp.states():
            if mdp.isEnd(s):
                Vnew[s]=0.
            else:
                Vnew[s]=max(Q(s,a)for a in mdp.actions(s))
        if max(abs(V[state]-Vnew[state]) for state in mdp.states()) < 1e-10:
            print('converged!')
            break
        V=Vnew
        count+=1
        display_params(count,mdp,V,pi())
value_iteration(mdp)        

State 	V(s) 	pi(s)	 at 0.th iteration
1	0.0 	walk
2	0.0 	walk
3	0.0 	walk
4	0.0 	walk
5	0.0 	walk
6	0.0 	walk
7	0.0 	walk
8	0.0 	walk
9	0.0 	walk
10	0.0 	End


State 	V(s) 	pi(s)	 at 1.th iteration
1	-1.0 	walk
2	-1.0 	walk
3	-1.0 	walk
4	-1.0 	walk
5	-1.0 	walk
6	-1.0 	walk
7	-1.0 	walk
8	-1.0 	walk
9	-1.0 	walk
10	0.0 	End


State 	V(s) 	pi(s)	 at 2.th iteration
1	-2.0 	walk
2	-2.0 	walk
3	-2.0 	walk
4	-2.0 	walk
5	-2.0 	walk
6	-2.0 	walk
7	-2.0 	walk
8	-2.0 	walk
9	-1.0 	walk
10	0.0 	End


State 	V(s) 	pi(s)	 at 3.th iteration
1	-3.0 	walk
2	-3.0 	walk
3	-3.0 	walk
4	-3.0 	walk
5	-3.0 	tram
6	-3.0 	walk
7	-3.0 	walk
8	-2.0 	walk
9	-1.0 	walk
10	0.0 	End


State 	V(s) 	pi(s)	 at 4.th iteration
1	-4.0 	walk
2	-4.0 	walk
3	-4.0 	walk
4	-4.0 	walk
5	-3.5 	tram
6	-4.0 	walk
7	-3.0 	walk
8	-2.0 	walk
9	-1.0 	walk
10	0.0 	End


State 	V(s) 	pi(s)	 at 5.th iteration
1	-5.0 	walk
2	-5.0 	walk
3	-5.0 	walk
4	-4.5 	walk
5	-3.75 	tram
6	-4.0 	walk
7	-3.0 	walk
8	-2.0 	walk
9	-1.0 	walk
10	0.0 	