# Chapter 3 

# Finite Markov Decision Processes

The problem of finite MDPs involves evaluative feedback, as in bandits, but also an associative aspect &mdash; choosing different actions in different situations. They are a classical formalisation of seqeunential decisions making, where actions influence not just immediate rewards, but also subsequent situations, or states, and through those future rewards.

## 3.1 The Agent-Environment Interface

MDPs are meant to be a straightforward framing of the problem of learning from interaction to acheive a goal. 

![](https://drive.google.com/thumbnail?id=1GRXp8d1oNqq3vCJ6TqiAQWVr9VtXnf6P)

The MDP and agent together give rise to a sequence or _trajectory_ like this:

$$
S_0, A_0, R_1, S_1, A_1, R_2, S_2, A_2, R_3, \dots .
$$

In a _finite_ MDP, the sets of states, actions, and rewards ($\mathcal{S}$, $\mathcal{A}$, and $\mathcal{R}$) all have a finite number of elements. 

Given $s' \in \mathcal{S}$ and $r \in \mathcal{R}$:

$$
p(s', r | s, a) \doteq Pr\{S_t = s', R_t = r | S_{t-1} = s, A_{t-1} = a \},
$$

for all $s' \in \mathcal{S}$, $r \in \mathcal{R}$, and $a \in \mathcal{A(s)}$.*

_*$\mathcal{A(s)}$ since action selected based on state $s$_ 

The function _p_ defines the _dynamics_ of the MDP. 

_State-transition probabilities_, $p : \mathcal{S} \times \mathcal{S} \times \mathcal{A} \rightarrow [0, 1]$, where

$$
p(s' | s, a) \doteq Pr\{S_t = s' | S_{t-1} = s, A_{t-1} = a \} = \sum_{r \in \mathcal{R}} p(s', r | s, a).
$$

_Expected rewards for state-action pairs_, $r : \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}$, where

$$
r(s, a) \doteq \mathbb{E}[R_t | S_{t-1} = s, A_{t-1} = a] = \sum_{r \in \mathcal{R}} r \sum_{s' \in \mathcal{S}} p(s', r | s, a).
$$

Expected rewards for state-action-next-state triples, $r : \mathcal{S} \times \mathcal{A} \times \mathcal{S} \rightarrow \mathbb{R}$,

$$
r(s, a, s') \doteq \mathbb{E}[R_t | S_{t-1} = s, A_{t-1} = a, S_t = s'] = \sum_{r \in \mathcal{R}} r \frac{p(s', r | s, a)}{p(s' | s, a)}
$$

Sensory receptors of an agent should be considered part of the environment rather than part of the agent. Rewards, too, are computed inside the artificial learning system but are considered external to the agent. 

Anything that cannot be changed arbitrarily by the agent is considered to be outside of it and this part of its environment. The agent-environment boundary represents the limit of the agent's _absolute control_, not of its knowledge 

## 3.3 Returns and Episodes

__Return__:

$$
G_t \doteq R_{t+1} + R_{t+2} + \dots + R_{T},
\tag{3.7}
$$

where $T$ is a final time step.

__Episodes__: natural subsequences of agent-environemnt interaction, e.g. plays of a game.

__Terminal state__: special state that ends an episode. This is followed by a reset to a standard starting state.

__Episode task__: tasks with episodes that all end in the same terminal state, with different rewards for the different outcomes.

__$\mathcal{S}$__: set of all nonterminal states

__$\mathcal{S}^+$__: set of all states plus the terminal state

__Discounted return__:

$$
\begin{align*}
G_t & \doteq R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots \\
& = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1},
\end{align*}
\tag{3.8}
$$

where $\gamma$, $0 \leq \gamma \leq 1$, is the _discount rate_.

__Discount rate__: parameter that determines the present value of future rewards.



## 3.5 Policies and Value Functions 

The _value function_ of a state $s$ under a policy $\pi$, denoted $v_\pi(s)$, is the expected return when starting in $s$ and following $\pi$ thereafter. For MDPs, we can define $v_\pi$ formally by

$$
v_\pi(s) \doteq \mathbb{E}_\pi[G_T | S_t = s] = \mathbb{E}_\pi \left[ \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \middle| S_t = s \right], \text{for all $s \in \mathcal{S}$,} 
$$

where $\mathbb{E}_\pi[\cdot]$ denotes the expected value of a random variable given that the agent follows policy $\pi$, and $t$ is any time step. Note that the value of the terminal state, if any, is always zero. We call the function $v_\pi$ the _state-value function for policy $\pi$_.

Define the value of taking action $a$ in state $s$ under a policy $\pi$, denoted $q_\pi(s, a)$, as the expected return starting from $s$, taking the action $a$, and thereafter following policy $\pi$:

$$
q_\pi(s, a) \doteq \mathbb{E}_\pi[G_t | S_t = s, A_t = a] = \mathbb{E} \left[ \sum_{k=0}^{\infty}\gamma^k R_{t+k+1} \middle| S_t = s, A_t = a \right].
\tag{3.13}
$$

Call $q_\pi$ the _action-value function for policy $\pi$_.

For any policy $\pi$ and any state $s$, the following consistency condition holds between the value of $s$ and the value o its possible successor states: (The _Bellman equation for $v_\pi$_)

$$
\begin{align*}
v_\pi(s) & \doteq \mathbb{E}_\pi \left[ G_T \middle| S_t = s \right] \\
& = \sum_{a} \pi \left( a \middle| s \right) \sum_{s', r} p \left( s', r \middle| s, a \right) \left[ r + \gamma v_\pi(s') \right], \text{ for all $s \in \mathcal{S}$,} 
\tag{3.14}
\end{align*}
$$

The equation states that the value of the start state must equal the (discounted) value of the expected next state, plus the reward expected along the way. 

### Example 3.6: Golf

- Reward: -1 for each stroke until we hit the ball into the hole.
- State: location of the ball.
- Value of a state: negative of the number of strokes to the hole from that location.
- Actions: how we aim and swing at the ball and which club we select.