# How to use the MDP class

This notebook briefly describes how to use the MDP class that is provided with the code.

**The notebook format is used to give short examples, do not implement the algorithms inside a Jupyter notebook**

In [1]:
import numpy as np
from rl_mdp.mdp.reward_function import RewardFunction
from rl_mdp.mdp.transition_function import TransitionFunction
from rl_mdp.policy.policy import Policy
from rl_mdp.mdp.mdp import MDP

Recall that environments can be modelled as a Markov Decision Process (MDP), which is defines as a tuple $(\mathcal{S}, \mathcal{A}, p, r, \gamma)$ where
* $\mathcal{S}$ is the set of states.
* $\mathcal{A}$ is the set of actions.
* $p : \mathcal{S} \times \mathcal{A} \times \mathcal{S} \rightarrow [0,1]$  is a transition function.
* $r : \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}$ is a reward function.
* $\gamma \in [0,1]$ is a discount factor.

The ```mdp``` class can be used to implement a simple mdp with a discrete state and action space.

For example, suppose we wanted to implement a simple mdp with 3 states and 2 actions:

In [2]:
states = [0, 1, 2]    # Set of states actions represented as a list of integers.
actions = [0, 1]

Now we must specify the reward function, which should give a reward for every state action pair

In [3]:
# Define rewards using a dictionary
rewards = {
    (0, 0): -1.0,           # state 0, action 0 gets reward -1.
    (0, 1): -1.0,
    (1, 0): 5.0,
    (1, 1): -1.0,
    (2, 0): -1.0,
    (2, 1): 10.0
}

# Create the RewardFunction object
reward_function = RewardFunction(rewards)

# Now calling the reweard function
print(reward_function(2, 1))           # This should return 10.0.

10.0


Next lets specify the transition function:

In [4]:
# Define transition probabilities using a dictionary
transitions = {
    (0, 0): np.array([0.7, 0.2, 0.1]),      # For state one, action one we get probability vector (0.7, 0.2, 0.1) representing the probability to transition to state 0, 1, 2 respectively.
    (0, 1): np.array([0.1, 0.8, 0.1]),
    (1, 0): np.array([0.4, 0.5, 0.1]),
    (1, 1): np.array([0.3, 0.3, 0.4]),
    (2, 0): np.array([0.9, 0.05, 0.05]),
    (2, 1): np.array([0.2, 0.2, 0.6])
}

# Create the TransitionFunction object
transition_function = TransitionFunction(transitions)

# Example usage: Get the full transition probabilities array for state 0 and action 1
print(transition_function(0, 1))  # Output: [0.1 0.8 0.1]
print(transition_function(0, 1)[2])  # Probability P(S'=2|S=0, A=1)

[0.1 0.8 0.1]
0.1


Now all you need to do is pass each component to the MDP class:

In [5]:
# Create the MDP object
mdp = MDP(states, actions, transition_function, reward_function, discount_factor=0.9)

In [6]:
# Example usage of MDP functionalities:

# Get transition probability of nex_state S'=1, given state S=0 and action A=1
print(f"Transition probability from state 0 to state 1 with action 1: {mdp.transition_prob(1, 0, 1)}")

# Get reward for taking action 1 in state 2
print(f"Reward for taking action 1 in state 2: {mdp.reward(2, 1)}")

# Get list of states
print(f"States: {mdp.states}")

# Get list of actions
print(f"Actions: {mdp.actions}")

# Get discount factor
print(f"Discount Factor: {mdp.discount_factor}")

Transition probability from state 0 to state 1 with action 1: 0.8
Reward for taking action 1 in state 2: 10.0
States: [0, 1, 2]
Actions: [0, 1]
Discount Factor: 0.9


For the assignment you will need to implement the `BellmanEquationSolver` and `DynamicProgrammingSolver` which should take a MDP as argument to their constructors and offer
methods that return a policy.

## The Policy class

The `Policy` class implements a basic stochastic policy which can give probabilities for each action given a state. You can also use it to implement a deterministic policy by passing an array which maps each state to an action.

In [7]:
num_actions = 3
policy_mapping = np.array([0, 2, 1])        # S=0 -> A=0, S=1 -> A=2, S=1 -> A=1

policy = Policy(policy_mapping=policy_mapping, num_actions=num_actions)

# Policy mapping implies deterministic policy so only one action gets probability 1.
print(f"Probability of action 0 given state 0: {policy.action_prob(0, 0)}")
print(f"Probability of action 1 given state 0: {policy.action_prob(0, 1)}")
print(f"Probability of action 2 given state 0: {policy.action_prob(0, 2)}")
print("____")

# You can also manually set the action probabilities for each state.
policy.set_action_probabilities(1, [0.1, 0.1, 0.8])
print(f"Probability of action 1 given state 1: { policy.action_prob(1, 1)}")
# Output: Probability of action 1 given state 1: 0.1

Probability of action 0 given state 0: 1.0
Probability of action 1 given state 0: 0.0
Probability of action 2 given state 0: 0.0
____
Probability of action 1 given state 1: 0.1
