# Markov decision process

A Markov decision process (MDP) model is composed of four elements

1. State space $\mathcal{S}$
2. Action space $\mathcal{A}$
3. Transition dynamics $p(\mathbf{s}', r|\mathbf{s}, \mathbf{a})$
4. Reward dynamics $r(\mathbf{s}, \mathbf{a})$

An agent interacts with the Markov Decision Process (MDP) by starting in state $\mathbf{s} \in \mathcal{S}$, taking an action $\mathbf{a} \in \mathcal{A}$, receiving a reward $r(\mathbf{s}, \mathbf{a})$, and transitioning to a new state $\mathbf{s}'$ according to the transition dynamics $p(\mathbf{s}', r \mid \mathbf{s}, \mathbf{a})$. As this process continues, we obtain a trajectory $\tau = (\mathbf{s}_1, \mathbf{a}_1, r_1, \mathbf{s}_2, \mathbf{a}_2, r_2, \ldots)$, which may potentially go on forever. Given any trajectory $\tau$, we define the reward associated with it as

$$r(\tau) = \sum_{t\geq 0} \gamma^t r_t$$

where $\gamma \in (0, 1)$ is the discount factor that ensures the reward $r(\tau)$ remains finite. A trajectory is a random variable induced by a policy $\pi(\mathbf{a} \mid \mathbf{s})$, which maps a state to a distribution over actions. When following $\pi$ in the MDP, the trajectory distribution is given by

$$p_{\pi}(\tau) = p(\mathbf{s}_0)\prod_{t\geq 0}\pi(\mathbf{a}_t|\mathbf{s}_t) p(\mathbf{s}_{t+1}|\mathbf{s}_t, \mathbf{a}_t)$$

Under the trajectory distribution, the expected reward associated with the policy $\pi$ is defined as

$$\eta(\pi) = \mathbb{E}_{\tau\sim p_{\pi}(\tau)}[r(\tau)]$$

The goal in a MDP is to find a policy that maximizes the expected reward. 

## Value function and action value function

Two quantities of particular interest in a Markov Decision Process (MDP) are the value function $V(\mathbf{s})$ and the state-action value function $Q(\mathbf{s}, \mathbf{a})$. The value function is defined as the expected reward starting from state $\mathbf{s}$:

$$V_{\pi}(\mathbf{s}) = \mathbb{E}_{\tau\sim p_{\pi}(\tau)|\mathbf{s}_0=\mathbf{s}}[r(\tau)|\mathbf{s}_0=\mathbf{s}]$$

Applying the law of total expectation, we can relate this to the expected reward:

$$
\begin{align*}
\eta(\pi) &= \mathbb{E}_{\mathbf{s}_0\sim p(\mathbf{s}_0)}[\mathbb{E}_{\tau\sim p_{\pi}(\tau)|\mathbf{s}_0=\mathbf{s}}[r(\tau)|\mathbf{s}_0=\mathbf{s}]]\\
&= \mathbb{E}_{\mathbf{s}_0\sim p(\mathbf{s}_0)}[V_{\pi}(\mathbf{s}_0)]
\end{align*}
$$

Since we cannot optimize the prior distribution $p(\mathbf{s}_0)$, maximizing the expected reward is equivalent to maximizing the value function for all states $\mathbf{s} \in \mathcal{S}$. The state-action value function, on the other hand, is defined as the expected reward starting from state $\mathbf{s}$ and taking action $\mathbf{a}$:

$$Q_{\pi}(\mathbf{s}, \mathbf{a}) = \mathbb{E}_{\tau\sim p_{\pi}(\tau)|\mathbf{s}_0=\mathbf{s}, \mathbf{a}_0=\mathbf{a}}[r(\tau)|\mathbf{s}_0=\mathbf{s}, \mathbf{a}_0=\mathbf{a}]$$

Applying the law of total expectation again, we can relate the state-action value function to the value function:

$$
\begin{align*}
V_{\pi}(\mathbf{s}) &= \mathbb{E}_{\mathbf{a}_0\sim \pi(\cdot|\mathbf{s}_0)}[ \mathbb{E}_{\tau\sim p_{\pi}(\tau)|\mathbf{s}_0=\mathbf{s}, \mathbf{a}_0=\mathbf{a}}[r(\tau)|\mathbf{s}_0=\mathbf{s}, \mathbf{a}_0=\mathbf{a}]]\\
&= \mathbb{E}_{\mathbf{a}_0\sim \pi(\cdot|\mathbf{s}_0)}[Q_{\pi}(\mathbf{s}, \mathbf{a})]
\end{align*}
$$

In the next section, we will introduce a simple algorithm for estimating the optimal policy using these two functions.