# Markov decision processes

In the previous chapter with bandits, every time we made an action the environment did not change. In this chapter we will consider environments where the state of the world changes after we make an action. We will call these environments Markov decision processes (MDPs). We will also consider the problem of finding the best policy for an MDP, which is called the policy optimization problem.

![](media/chapter-3/MDP-schema.png)

The above figure shows the schema of a genral RL process: 

* An agent makes and action 

* The action perturbs the environment and the environment returns a reward and a new state

* The agent uses the reward and the new state to update its policy

* The cycle continues. 

Compared to the k - bandit examples, in MDP scenarios, each taken action alters the state of the environment that the agent operates in. The sequance of any MDP is: 

$$S_{0}, A_{0}, R_{1}, S_{1}, A_{1}, R_{2}, ...$$

In an MDP, we can define three sets: 

$\mathbb{S}$ - the set of all possible states 

$\mathbb{A}$ - the set of all possible actions

$\mathbb{R}$ - the set of all possible rewards

Then, the dynamics of an MDP process can be defined as a probability: 

$p(s^{*}, r| s, a) = P(S_{t} = s^{*}, R_{t} = r | S_{t-1} = s, A_{t-1}=a)$

$S_{t} \in \mathbb{S}, R_{t} \in \mathbb{R}, A_{t} \in \mathbb{A}$  $\forall t$

Because we are dealing with probabilities, then: 

$$ \sum_{s^{*} \in \mathbb{S}} \sum_{r \in \mathbb{R}} p(s^{*}, r| s, a) = 1 $$



# Full MDP example - vacuum robot 

One way to visualize and MDP is to use `transition graphs`. A transition graph is a graph where the nodes are states and the edges are actions. The edges are labeled with the probability of transitioning to a new state and the reward received. 

Let us assume that our robot has two states - low and high battery. The robot can recharge, wait or vacuum. The robot can only recharge when it is in the low battery state. The robot can only vacuum when it is in the high battery state. The total reward is the amount of vacuumed dust measured in grams. Alternitavely, the robot can always wait.

$$ \mathbb{S} = \{low, high\} $$

The full action space is:

$$ \mathbb{A} = \{recharge, wait, vacuum\} $$

![](media/chapter-3/graph.png)

The probabilities seen in the graph get estimated using many trial runs. 

In our example, a trial ends when the robot runs out of battery. The reward is the amount of dust collected.