# Finite Markov Decision Processes

**Markov Decision Processes** (MDPs) are a classical formalization of sequential decision making,  
where actions influence not just immediate rewards, but also subsequent situations,  
or states, and through those future rewards.  

Thus MDPs involve delayed reward and the need to tradeoff immediate and delayed reward.  
Whereas in bandit problems we estimated the value $q_*(a)$ of each action $a$,  
in MDPs we estimate the value $q_*(s, a)$ of each action a in each state $s$,  
or we estimate the value $v_*(s)$ of each state given optimal action selections.

These state-dependent quantities are essential to accurately assigning credit for long-term  
consequences to individual action selections.

MDPs are a mathematically idealized form of the reinforcement learning problem  
for which precise theoretical statements can be made. We introduce key elements of  
the problem’s mathematical structure, such as returns, value functions, and Bellman  
equations.

## The Agent–Environment Interface

MDPs are meant to be a straightforward framing of the problem of learning from  
interaction to achieve a goal. The learner and decision maker is called the **agent**.  
The thing it interacts with, comprising everything outside the agent, is called the **environment**.  

These interact continually, the agent selecting actions and the environment responding to  
these actions and presenting new situations to the agent. The environment also gives  
rise to rewards, special numerical values that the agent seeks to maximize over time  
through its choice of actions.

At each time step $t = 0, 1, 2, 3...$ there is a sequence (or a **trajectory**):

$S_0$ -> agent -> $A_0$ -> env -> $R_1$, $S_1$ -> agent -> $A_1$ -> env -> ...

In a finite MDP, the sets of states, actions, and rewards ($S$, $A$, and $R$) all have a finite  
number of elements. In this case, the random variables R_t and S_t have well defined  
discrete probability distributions dependent only on the preceding state and action.  
That is, for particular values of these random variables, $s' \ S$ and r 2 R, there is a probability  
of those values occurring at time t, given particular values of the preceding state and
action:

$p(s', r | s, a) = Pr\{S_t=s', R_t = r | S_{t-1} = s, A_{t-1} = a\}$  
$\forall s', s \in S, \forall r \in R, \forall a \in A(s)$

The function p defines the **dynamics** of the MDP.

The dynamics function p : S x R x S x A ⇥ [0, 1] is an ordinary deterministic function of four arguments.  
The ‘|’ in the middle of it comes from the notation for conditional probability,  
but here it just reminds us that p specifies a probability distribution for each choice of s and a, that is, that  

$\sum_{s'=S} \sum_{r=R} p(s', r | s, a) = 1, \forall s \in S, \forall a \in A(s)$

In a Markov decision process, the probabilities given by p completely characterize the
environment’s dynamics.

The state must include information about all aspects of the past agent–environment interaction that make a difference for the future.  
If it does, then the state is said to have the **Markov property**.

From the four-argument dynamics function, p, one can compute anything else one might
want to know about the environment, such as the state-transition probabilities (which we
denote, with a slight abuse of notation, as a three-argument function p : S x S x A ⇥ [0, 1]):

State-Transition probabilities:

$p(s' | s, a) = Pr\{S_t = s' | S_{t-1} = s, A_{t-1} = a\} = \sum_{r \in R} p(s', r | s, a), \forall s \in S, \forall a \in A(s)$

Expected Rewards for state-action:

$r(s, a) = E\{R_t | S_{t-1} = s, A_{t-1} = a\} = \sum_{r \in R} r \sum_{s' \in S} p(s', r | s, a), \forall s \in S, \forall a \in A(s)$

State action next state:

$r(s, a, s') = E\{R_t | S_{t-1} = s, A_{t-1} = a, S_t = s'\} = \sum_{r \in R} r * { p(s', r | s, a) \over p(s' | s, a)}, \forall s \in S, \forall a \in A(s)$

The MDP framework is abstract and flexible and can be applied to many di↵erent
problems in many different ways.