## Markov Decision Process

Markov Decision Process (MDP) is the framework we use to model these sequential decision-making problems. Building on that theory, Dynamic Programming (DP) is the field that proposes solution methods for MDPs. RL, in some sense, is a collection of approximate DP approaches that enable us to obtain good (but not necessarily optimal) solutions to very complex problems that are unfeasible to solve with exact DP methods.

In an MDP, the actions an agent takes have long-term consequences, which is what differentiates it from the Multi-Armed Bandit (MAB) problems

### Markov Chains

They only model a special type of stochastic processes that are governed by some internal transition dynamics.

A Markov chain is usually depicted using a directed graph. A Markov chain diagram for the robot example in a 2x2 grid world:
![](img/markov_chain.png)

**Tip**: Many systems can be made Markovian by including historical information in the state. Consider a modified robot example where the robot is more likely to continue in the direction it moved in the previous time step. Although such a system seemingly does not satisfy the Markov property, we can simply redefine the state to include the visited cells over the last two time steps, such as = x_t = (s_t , s_t−1 ) = ((0,1), (0,0)) . The transition probabilities would be independent of the past states under this new state definition and the Markov property
would be satisfied.


#### Classification of states in a Markov chain

If the environment can transition from state *i* to state *j* after some number of steps with a positive probability, we say *j* is **reachable** from *i*. If *i* is also reachable from *j*, those states are said to **communicate**. If all the states in a Markov chain **communicate** with each other, we say that the Markov chain is **irreducible**, which is what we had in our robot example.

A state *s* is an **absorbing** state if the only possible transition is to itself, which is *P(s_t+1=s | s_t=s )=1*. An absorbing state is equivalent to a **terminal** state that marks the end of an **episode**. In addition to terminal states, an episode can also terminate after a time limit *T*.
![](img/absorbing.png)

#### Transient and recurrent states

A state is called a **transient** state, if there is another state ′ , that is reachable from ,
but not vice versa. Provided enough time, an environment will eventually move away from
transient states and never come back.
![](img/transient.png)
So, wherever the robot is on the light side, it will eventually transition into the dark side and won't be able to come back. All the states on the light side are transient.
Finally, a state that is not transient is called a **recurrent** state. The states on the dark side are recurrent in this example.

### Periodic and aperiodic states

We call a state, s, periodic if all of the paths leaving s come back after some multiple
of k > 1 steps. Consider the example in Figure 4.5, where all the states have a period
of k = 4 :

![](img/periodic.png)

### Ergodicity

## Transitionary and steady state behavior

### Semi-Markov processes and continuous-time Markov chains

All of the examples and formulas we have provided so far are related to discrete-time
Markov chains, which are environments where transitions occur at discrete time steps,
such as every minute or every 10 seconds. But in many real-world scenarios, when the
next transition will happen is also random, which makes them a semi-Markov process.
In those cases, we are usually interested in predicting the state after amount of time
(rather than after steps).

One example of a scenario where a time component is important is queuing systems – for
instance, the number of customers waiting in a customer service line. A customer could
join the queue anytime and a representative could complete the service with a customer at
any time – not just at discrete time steps. Another example is a work-in-process inventory
waiting in front of an assembly station to be processed in a factory. In all these cases,
analyzing the behavior of the system over time is very important to be able to improve the
system and take action accordingly.