## Markov Decision Process

MDPs (**Markov Decision Process**) are a classical formalization of sequential decision making, where actions influence not just immediate rewards, but also sunsequent situations, or states, and through those future rewards. Thus, MDPs invlove delayed reward and the need to tradeoff imemdiate and delayed reward.

Unlike in cases with Bandit problems, in MDPs we estimate value $q_*(s, a)$ for each action $a$ in each state $s$, or we estimate value $v_*(s)$ of each state given optimal action selections. The state-dependent quantities are essential to accurately assigning credit for long-term consequences to individual action selections.

In MDP problem Agent, who is trying to learn and maximize obtained cumulative reward over time, is interacting with environment. The Agent selects actions and the environment responds to those actions, presenting the new situation, as well as the reward, to the agent.

In *finaite* MDP, the sets of states, actions and possbible rewards ($S$, $A$, and $R$) all have a finite number of elements, where the random variables $R_t$ and $S_t$ have well defined discrete probabilitiy distributions dependent only on the preceding state and action:

$$p(s',r|s,a) = Pr\left\{S_t=s', R_t=r | S_{t-1}=s, A_{t-1}=a\right\}$$

which basically gives the probability of new state $s'$ and reward $r$, given the current environment's state $s$ and agent's action $a$. Function *p* defined the *dynamics* of the MDP - the probability of each possible value for $S_t$ and $R_t$ depends only on the immediately preceding state and action $S_{t-1}$ and $A_{t-1}$ and not anything earlier, which can be viewed as a restriction on the *state*, as it must include information about all aspects of the past agent-environment interactions that make a difference for the future. Those states are said to have *Markov property*. 

From these four-argyument dynamics function $p$, we can compute everything about the environment, including the *state-transition probabilities*:

$$p(s'|s,a) = Pr\left\{S_t=s'|S_{t-1}=s,A_{t-1}=a\right\} = \sum_{r \in R}p(s',r|s,a)$$

or *expected rewards* for state-action pairs:

$$r(s,a) = E[R_t|S_{t-1}=s, A_{t-1}=a]=\sum_{r \in R}r\sum_{s' \in S}p(s', r|s,a)$$

and the expected rewards for state-action-next-state triples:

$$r(s, a, s') = E[R_t|S_{t-1}=s, A_{t-1}=a, S_t=s'] = \sum_{r \in R}r\frac{p(s',r|s,a)}{p(s'|s,a)}$$

In general, there is a clear boundary between agent and environment, although what is being stored and recorded where in the code can depend on a problem. The boundary represents the limit of agent's *absolute control*, not its knowledge and the general rule is that anything that cannot be changed arbitrarily by the agent is considered to be outside of it and thus part of its environment. This include the possible actions and structure of the agent (muscles, skeleton, sensors) as well as the rewards for each action. 

An example of simple MDP of a recycling robot is shown on the graph below:

![](./resources/recycling-robot-mdp.jpg)

### Rewards and espisodes

In general, the aim of the agent is to maximize the cumulative reward it receives in the long run. This can be defined as a sequence of rewards received after time step $t$. In fact, we actually want to maximize the *expected return*, as the reward signal can be noisy and not always defined for every time-step. The expected return is usually denoted as $G_t$:

$$G_t = R_{t+1}+R_{t+2}+R_{t+3}+\dots+R_T$$

where $T$ is a final step, often known as *termial step*. When terminal step is met, the environment *resets* to its starting state and agent has to chance to interact with it again, hopefully this time taking better actions and accumulating more rewards. The next episode begins independently of how the previous one ended - the tasks with these kind of episodes are known as *episodic tasks*, which is different from *continuing tasks*, where agent-environment interaction does not break naturally into identifiable episodes, but goes on continually without limit. 

In addition to the concept of rewards, we can add *dicounting*, where the agents tries to select actions so that the sum of the discounted rewards it receives over the future is maximized - in other terms, the most immediate rewards are weighted heaver than the ones expected in the future.

$$G_t=R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+\dots = \sum^{\infty}_{k=0}\gamma^k R_{t+k+1}$$

where $\gamma$ is a parameter $0 \leq \gamma \leq 1$ called the *discount rate*. With $\gamma = 0$ agent is only concerned with the most immediate reward and will act as to maximize $R_{t+1}$, while as $\gamma \rightarrow 1$ the return objective takes future rewards into account more strongly and the agent becomes more farsighted.

### Policies and value functions

Almost all reinforcement learning algorithms involve estimating *value functions* - functions of states (or state-action pairs) that estimate how good it is for the agent to be in a given state. By *how good* we mean the expected future rewards in terms of expected return, which of course depends on what actions the agent will take. Thus, value functions are defined with respect to particular ways of acting, called *policies*. 

Policy is a mapping from states to probabilities of selecting each possible action. Mathematically, we mean that the agent is following policy $\pi$ at time $t$, then $\pi(a|s)$ is the probability that $A_t=a$ if $S_t=s$. Reinforcement learning methods specify how the agent's policy is changed as a result of its experience. 

The value function of a state $s$ under a policy $\pi$ is denoted by $v_\pi(s)$, and it specified the expected return when starting in $s$ and following $\pi$ thereafter. The function $v_\pi$ is called **state-value function for policy $\pi$**.

Similarily, we can defined **action-value function for policy $\pi$**, denoted as $q_\pi$, which is defined as the value of taking action $a$ in state $s$ under a policy $\pi$ in terms of expected return starting from $s$ and taking the action $a$ thereafter following policy $\pi$.

Both $v_\pi$ and $q_\pi$ can be estiamted from experience. We can, for example, follow a policy $\pi$ and maintain the average, for each state encountered, of the actual returns that have followed that state - the average will eventually converge to the state's value $v_\pi(s)$. If separated averages are kept for each action taken in each state, then this will eventually converge to $q_\pi(s, a)$. We call these estimation methods **Monte Carlo** methods, because they involve averaging over many random samples of actual returns. 

We can calculate $v_\pi$ using **Bellman equation**:

$$v_\pi(s)=\sum_a \pi(a|s) \sum_{s',r}p(s',r|s,a)[r+\gamma v_\pi(s')]$$

The equation expresses a relationship between the value of a state and the values of its successor states. It averages over all possibilities, weighting each by its probability of occurring. By possibilities, we mean expected rewards from each new state, given different possibilities of actions. The equation states that the value of the start state must equal to the (discounted) value of the expected next state, plus the reward expected along the way: $r+\gamma v_\pi(s')$. This equation forms the basis of a number of ways to compute, approximate and learn $v_\pi$. The equations update the expected value of a current state using *backup* operations, which transfer value *back* to the state (or state-action pair) from its successor states and they are at heart of reinforcement learning.

### Optimal policies and optimal value functions

Value functions define a partial ordering over policies. A policy $\pi$ is defined to be better than or equal to a policy $\pi'$ if its expected return is greater than or equal to that of $\pi'$ for all states. In other words $\pi \geq \pi'$ iff $v_\pi(s) \geq v_{\pi'}(s)$ for all $s \in S$. There is always one or more policies that are better than all the others and they are called *optimal policies*. All opitmal policies are denoted as $\pi_*$ and they share the same state-value function, called *optimal state-value function* denoted by $v_*$ and defined as: $v_*(s)=max_\pi v_\pi(s)$

Optimal policies also share the same *optimal action-value functions*, denotes by $q_*$ and defined as: $q_*(s, a)=max_\pi q_\pi(s,a)$

Having optimal value function $v_*$, it is easy to determine an optimal policy, as for each state $s$ there will be one or more actiosn at which the maximum is obtained in the Bellman optimality equation - any action that appears best after a one-step search will be optimal action as it returns highest expected return. The thing is with $v_*$ is that it automatically incorporates reward consequences of all possible future behavior due to the recursive nature. Thus, the optimal expected long-term return is turned into a quantity that is locally and immediately available for each state. 

With $q_*$ finding optimal actions is even easier, as we do not have to do one-step-ahead search: for any state $s$ it is enough to simply find any action that maximizes $q_*(s,a)$, as the function effectively caches the results of all one-step-ahead searches.

Solving the Bellman optimality equations is one way of finding an optimal policy, however this solution is rarely useful, as it's similar to exhaustive search, looking ahead at all possibilities, computing their probabilities of occurrence and their desirabilities in terms of expected rewards. Bellman optimality equations also rely on 3 assumptions that are rarely true in practice: (1) we accurately know the dynamics of the environment; (2) we have enough computational resources to complete the computation of the solution; (3) Markov property. Instead, many reinforcement learning methods can be clearly understood as approximately solving the Bellman optimality equation, using actual experienced transitions in place of knowledge of the expected transitions.