# Chapter 3 

# Finite Markov Decision Processes

The problem of finite MDPs involves evaluative feedback, as in bandits, but also an associative aspect &mdash; choosing different actions in different situations. They are a classical formalisation of seqeunential decisions making, where actions influence not just immediate rewards, but also subsequent situations, or states, and through those future rewards.

## 3.1 The Agent-Environment Interface

MDPs are meant to be a straightforward framing of the problem of learning from interaction to acheive a goal. 

![](https://drive.google.com/thumbnail?id=1GRXp8d1oNqq3vCJ6TqiAQWVr9VtXnf6P)

The MDP and agent together give rise to a sequence or _trajectory_ like this:

$$
S_0, A_0, R_1, S_1, A_1, R_2, S_2, A_2, R_3, \dots .
$$

In a _finite_ MDP, the sets of states, actions, and rewards ($\mathcal{S}$, $\mathcal{A}$, and $\mathcal{R}$) all have a finite number of elements. 

Given $s' \in \mathcal{S}$ and $r \in \mathcal{R}$:

$$
p(s', r | s, a) \doteq Pr\{S_t = s', R_t = r | S_{t-1} = s, A_{t-1} = a \},
$$

for all $s' \in \mathcal{S}$, $r \in \mathcal{R}$, and $a \in \mathcal{A(s)}$.*

_*$\mathcal{A(s)}$ since action selected based on state $s$_ 

The function _p_ defines the _dynamics_ of the MDP. 

_State-transition probabilities_, $p : \mathcal{S} \times \mathcal{S} \times \mathcal{A} \rightarrow [0, 1]$, where

$$
p(s' | s, a) \doteq Pr\{S_t = s' | S_{t-1} = s, A_{t-1} = a \} = \sum_{r \in \mathcal{R}} p(s', r | s, a).
$$

_Expected rewards for state-action pairs_, $r : \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}$, where

$$
r(s, a) \doteq \mathbb{E}[R_t | S_{t-1} = s, A_{t-1} = a] = \sum_{r \in \mathcal{R}} r \sum_{s' \in \mathcal{S}} p(s', r | s, a).
$$

Expected rewards for state-action-next-state triples, $r : \mathcal{S} \times \mathcal{A} \times \mathcal{S} \rightarrow \mathbb{R}$,

$$
r(s, a, s') \doteq \mathbb{E}[R_t | S_{t-1} = s, A_{t-1} = a, S_t = s'] = \sum_{r \in \mathcal{R}} r \frac{p(s', r | s, a)}{p(s' | s, a)}
$$

Sensory receptors of an agent should be considered part of the environment rather than part of the agent. Rewards, too, are computed inside the artificial learning system but are considered external to the agent. 

Anything that cannot be changed arbitrarily by the agent is considered to be outside of it and this part of its environment. The agent-environment boundary represents the limit of the agent's _absolute control_, not of its knowledge 