# Markov Decision Processes

Notes for the following lecture:    
[Lecture Notes](https://www.youtube.com/watch?v=lfHX2hHRMVQ)     
[Lecture Slides](https://davidstarsilver.wordpress.com/wp-content/uploads/2025/04/lecture-2-mdp.pdf)


## What is an MDP?

A markhov desicion process is a process that describes an environment in which an agent can act. An MDP is **fully observable** (meaning that the agent knows everything about the state of the environment).

Almost all RL problems can be converted into some form of MDP (even partially observable ones).


## The Markov Property

As described in the previous lecture, the markov property of the state where:

$$
P[S_{t+1}|S_t] = P[S_{t+1}|S_1, ..., S_t]
$$

Meaning that the future is only conditioned on the present not the past. The future only depends on the present and is independent of the past. The present state contains all the relevent information to determine the future. Once the present state is known the entire history can be thrown away (i.e. the current state is a sufficient statistic of the future).

## The Markov Process

A Markov process is a random process with some transition dynamics. A Markov process involves state transitions from one state ot the other. $S_1 \rightarrow S_2 \rightarrow ... S_n$

The definition of the Markhov process involes defining:
1. $S$ is a finite set of states
2. $P$ is a state transition probability matrix: $P_{ss'} = P[S_{t+1} = s' | S_t = s]$


Below is an example of a markov process. Our state is the location of the agent on the grid (x, y indices).

In [2]:
# Let's create an example of a markov
import numpy as np
import random

N = 10 # grid size nxn

actions = [ np.array([1, 0]), # right
            np.array([0, 1]), # up
            np.array([-1, 0]), # left
            np.array([0, -1])] # down

# Random policy that equally balances between actions
action_probabilities = [0.25, 0.25, 0.25, 0.25]


def markov_process(s, N, actions, action_probabilities):
    index = np.random.choice(len(actions), p=action_probabilities)
    action = actions[index]
    return np.clip(s + action, 0, N -1)
    

# Initialize state randomly
state = np.array([random.randint(0, N - 1), random.randint(0, N - 1)])
print(state)
for _ in range(3):
    # Transtion from state s to state s'
    state = markov_process(state, N, actions, action_probabilities)
    print(state)

[3 2]
[4 2]
[4 3]
[4 4]


## Markov Reward Process

Similar to the Markov process, but with the addition of rewards. The markov reward process involves the following:
1. $S$ is a finite set of states
2. $P$ is a state transition probability matrix: $P_{ss'} = P[S_{t+1} = s' | S_t = s]$
3. $R$ is a reward function where $R_s = E[R_{t+1} | S_t = s]$
4. $\gamma$ is a discount factor, $\gamma \in [0,1]$


One could ask what's the need for the discount factor:
1. Makes mathematics easy (reward doesn't explode to infinity for non-terminating or cyclical MDPs
2. Discount factor represents uncertainty in future rewards, this mimics natural human behavior
3. If the sequences are guaranteed to terminate then you could use a discount factor equal to 1

Below is an example of a markhov reward process

In [3]:
rewards = [-1, # reward for moving right
           -1, # reward for moving up
           -1, # reward for moving left
           -1] # reward for moving down

def markov_reward_process(s, N, actions, action_probabilities, rewards):
    index = np.random.choice(len(actions), p=action_probabilities)
    action = actions[index]
    reward = rewards[index]
    return np.clip(s + action, 0, N -1), reward


# Initialize state randomly
state = np.array([random.randint(0, N - 1), random.randint(0, N - 1)])
print(state)
for _ in range(3):
    # Transtion from state s to state s'
    state, reward = markov_reward_process(state, N, actions, action_probabilities, rewards)
    print(state, "reward for just taking this action",reward)

[9 6]
[9 5] reward for just taking this action -1
[9 4] reward for just taking this action -1
[9 4] reward for just taking this action -1


Now let's define the return or the value

$$
G_t = R_{t+1} + \gamma R_{t+2} + ... = \Sigma_{k=0}^{\inf} \gamma^k R_{t+k+1}
$$

if the value of $\gamma$ is close to zero this leads to "myopic" evaluation    
if the valye of $\gamma$ is close to 1 this leads to "far-sighted" evaluation

We can now define the value function,

## Value Function
The value function is defined as

$$
v(s) = E[G_t|S_t = s]
$$

The above is called the state value function.

The value function represents the goodness of being at a specific state, it means the sum of discounted rewards that could be achieved at that state.

A recursive relationship can be defined using the value function forming the bellman equation

## Bellman Equation
The bellman equation defines a recursive relationship between the value function and itself.

$$
v(s) = E[R_{t+1} + \gamma v(S_{t+1}) | S_{t} = s]
$$

You could think of it using the following backup diagram


![mdp_branching](../images/mdp_branching.png)

The Bellman equation can also be defined in matrix form:

$$
v = R + \gamma P v
$$

Where $P$ is the transition probability

You could then solve the bellman equation:

$$
v = (1 - \gamma P)^{-1} R
$$

In [32]:
# Let's attempt solving for the value function in matrix form
# Consider the below example where you have 3 states
#  S1 -> S2 -> S3


# Let's build the state vector

R = np.array([-10, -10, 0])
R = R.reshape((3,1))

gamma = 0.999999999

P = np.array([[0, 1, 0],
              [0, 0, 1],
              [0, 0, 1]])


v = np.linalg.inv(np.eye(3,3) - gamma * P).dot(R)

v

array([[-19.99999999],
       [-10.        ],
       [  0.        ]])

## Markov Decision Process

The Markov decision process is the same as the markov reward process but with the introduction of actions into the picture. You have the following elements in a markov reward process:
1. $S$ is a finite set of states
2. $A$ is a finite set of actions
3. $P$ is a state transition probability matrix $P_{ss'}^a = P[S_{t+1} = s' | S_t = s, A_t = a]$
4. $R$ is a reward function, $R_s^a = E[R_{t+1} | S_t = s, A_t = a]$
5. $\gamma$ is a discount factor $\gamma \in [0,1]$


### Policy
In a markov decision process we're trying to determine the policy.

$$
\pi (a|s) = P[A_t = a | S_t = s]
$$


### The Value Function and Action Value functions

![mdp_value_function](../images/mdp_value_function.png)
![mdp_action_value_function](../images/mdp_action_value_function.png)
![mdp_optimal_policy](../images/mdp_optimal_policy.png)

