# Markov Decision Process & Bellman Equation

## 1. Markov Decision Process(MDP)

MDP is a **discrete time stochastic control process**. It is a mathematical definition of a sequential decision making process.
    
+ stochastic process : a collection of **random variables**

By definition, a MDP is a 5-tuple $(\mathcal{S},\mathcal{A},\mathcal{P}_a,\mathcal{R}_a,\gamma)$.

1. **State** ($S \in \mathcal{S}$)

2. **Action** ($A \in \mathcal{A}$)

    - $A_s$ is the finite set of actions available from state

3. **(State) Transition Probability** ($\mathcal{P}_a$)

    - $\mathcal{P}^a_{ss'} = Pr[S_{t+1} = s' \mid S_t = s, A_t = a]$ 
    
    - the probability that action $a$ in state $s$ at time $t$ will lead to state $s'$ at time $(t+1)$.

4. **Reward Function** ($\mathcal{R}^a_s$)

    - $\mathcal{R}^a_s = \mathbb{E}[R_{t+1} \mid S_t = s, A_t = a]$

    - the expected immediate reward received after transitioning from state $s$ to state $s'$, due to action $a$
    
5. **Discount Factor** ($\gamma$)

    - it represents the difference in importance between future rewards and present rewards.

    - used to avoid the problem of infinite return ($G_t = R_{t+1} + \gamma R_{t+1} + \cdots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}$)
    
**Policy** ($\pi$)

- The core problem of MDPs is to find a **policy** for the decision maker.

- $\pi(a \mid s) = Pr [A_t = a \mid S_t =s ]$

- the mapping from states to probabilities of selecting each possible action.

## 2. Value Function

Almost all RL algorithms involve estimating **value functions**.

**Value function** is the function of states that estimated how good it is for the agent to be in a given state.

The notion of "how good" is defined in terms of **expected return**.

$$
\begin{aligned}
v_{\pi}(s) &= \mathbb{E}_{\pi}[G_t \mid S_t =s ] \\[10pt]
&= \mathbb{E}_{\pi}[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots \mid S_t =s] \\[10pt]
&= \mathbb{E}_{\pi}[\sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \mid S_t =s ] \\[10pt]
&= \mathbb{E}_{\pi}[R_{t+1} + \gamma v_{\pi}(S_{t+1}) \mid S_t =s ] \\[15pt]
\end{aligned}
$$

Let us define the value of taking action $a$ in state $s$ under a policy $\pi$, denoted $q_{\pi}(s,a)$

$$
\begin{aligned}
q_{\pi}(s,a) &= \mathbb{E}_{\pi}[G_t \mid S_t =s, A_t = a] \\[10pt]
&= \mathbb{E}_{\pi}[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots \mid S_t =s, A_t = a] \\[10pt]
&= \mathbb{E}_{\pi}[\sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \mid S_t =s, A_t = a] \\[10pt]
&= \mathbb{E}_{\pi}[R_{t+1} + \gamma q_{\pi}(S_{t+1}, A_{t+1}) \mid S_t =s, A_t = a]
\end{aligned}
$$

## 3. Bellman Equation

### Bellman Expectation Equation

It elucidates the **relationship** between the present state value function and the next state value function.

$$
\begin{aligned}
v_{\pi}(s)  &= \mathbb{E}_{\pi}[R_{t+1} + \gamma G_{t+1} \mid S_t =s ] \\[10pt]
&= \sum_{a \in \mathcal{A}} \pi (a \mid s) \sum_{s' \in \mathcal{S}} \sum_{r \in \mathcal{R}_a} Pr(s', r \mid s, a) \bigl[ r + \gamma \mathbb{E}_{\pi} [G_{t+1} \mid S_{t+1} = s'] \bigr] \\[10pt]
&= \sum_{a \in \mathcal{A}} \pi (a \mid s) \sum_{s' \in \mathcal{S}} \sum_{r \in \mathcal{R}_a} Pr(s', r \mid s, a) \bigl[ r + \gamma v_{\pi}(s') \bigr] \\[10pt]
& \text{(if we assume the identical structure of reward function)} \\[10pt]
&= \sum_{a \in \mathcal{A}} \pi (a \mid s) \bigl(R_{t+1} + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}^a_{ss'} v_{\pi}(s')\bigr) \\[10pt]
& \text{(if we further assume the identical structure of transition probability)} \\[10pt]
&= \sum_{a \in \mathcal{A}} \pi (a \mid s) \bigl(R_{t+1} + \gamma \sum_{s' \in \mathcal{S}} v_{\pi}(s')\bigr)
\end{aligned}
$$

### Bellman Optimality Equation

Bellman expectation equation only provides **the true value function under the present policy**. 

When the goal is to find **"optimal policy"**, we would like to find the **optimal value function**.

$$
\begin{aligned}
v^{*}(s) &= \max_{a \in \mathcal{A}_s} q_{\pi^{*}} (s, a) \\[10pt]
&= \max_{a \in \mathcal{A}_s} \mathbb{E}_{\pi^{*}} [R_{t+1} + \gamma v^{*}(S_{t+1}) \mid S_t = s, A_t = a] \cdots \text{ (#)} \\[10pt]
&\text{(#) } q^{*} (s, a) = \mathbb{E} [R_{t+1} + \gamma v^{*} (S_{t+1}) \mid S_t = s, A_t = a] \\[10pt]
&= \max_{a \in \mathcal{A}_s} \sum_{s' \in \mathcal{S}} \sum_{r \in \mathcal{R}_a} Pr(s', r \mid s, a) \bigl[ r + \gamma v^{*}(s') \bigr] \\[10pt]
\end{aligned}
$$

**Bellman optimality equation for Q functions** is as follows.

$$
\begin{aligned}
q^{*}(s,a) &= \mathbb{E} [R_{t+1} + \gamma \max_{a'} q^{*} (S_{t+1}, a') \mid S_t = s, A_t = a] \\[10pt]
&= \sum_{s' \in \mathcal{S}} \sum_{r \in \mathcal{R}_a} Pr(s', r \mid s, a) \bigl[ r + \gamma \max_{a'} q^{*} (s', a') \bigr] \\[10pt]
\end{aligned}
$$

**Dynamic programming** (DP) solves the problems in the form of MDP **by calculating** with Bellman expectation & optimality equations.