# <FONT COLOR="red">***MARKOV DECISION PROCESS (MDP)***</FONT>
---
---

Mathematic model rely on the **R**einforcement **L**earning (**RL**). **M**arkov **D**ecision **Process** (**MDP**) is a mathematic model widely in **RL**. Essentialy, **MDP** describe an environment in which an agent takes sequential decisions to maximize a cumulative reward over time.

## <FONT COLOR="orange">**EXERCISE 1**</FONT>
---
---

Defines a basic MDP with three states (A, B, C), two possible actions in each state (Up, Down), a random transition function and
random rewards associated with each action in each state.

In [1]:
# IMPORT COMMON LIBRARIES
import numpy as np

In [4]:
# STATE, ACTIONS, AND REWARD DEFINITION
states = ['A','B','C']
actions = ['Up','Down']
rewards = np.random.randint(0,10,size=(len(states),len(actions)))

In [3]:
# RANDOM TRANSITION FUNCTION
def random_transition ():
  return np.random.choice(states)

In [7]:
# TEST FUNCTION
actual_state = np.random.choice(states)
action = np.random.choice(actions)
new_state = random_transition()
reward = rewards[states.index(actual_state),actions.index(action)]

# DISPLAY DATA
print(f'Actual state: {actual_state}')
print(f'Action: {action}')
print(f'New state: {new_state}')
print(f'Reward: {reward}')

Actual state: A
Action: Up
New state: C
Reward: 9


## <FONT COLOR="orange">**EXERCISE 2**</FONT>
---
---

Write a function to calculate the value function of a state given an MDP, using the value iteration algorithm.

In [8]:
# VALUES FUNCTION
def state_value (mdp, gamma=0.9, theta=0.01):
  # OBTAIN VALUES
  values = {state: 0 for state in mdp.states}
  while True:
    delta = 0
    for state in mdp.states:
      prev_value = values[state]
      values[state] = sum(
        mdp.transitions[state][action][new_state] * (
          mdp.reward[state][action][new_state] + gamma * values[new_state]
        ) for action in mdp.actions for new_state in mdp.states
      )
      delta = max(delta, abs(prev_value - values[state]))
      if delta < theta:
        break
      return values

In [26]:
# MDP CLASS
class MDP:
  # CONSTRUCTOR CLASS
  def __init__(self, states, actions, transitions, rewards):
    self.states = states
    self.actions = actions
    self.transitions = transitions
    self.rewards = {}

    for state in states:
      self.rewards[state] = {}
      for action in actions:
          self.rewards[state][action] = {}
          for next_state in states:
              # Assign a reward value for this state-action-next_state combination
              # Replace this with your actual reward logic
              self.rewards[state][action][next_state] = 0  # Default reward of 0

In [27]:
# DEFINE STATES, ACTIONS, TRANSITIONS, AND REWARDS
states = ['A', 'B', 'C']
actions = ['Up', 'Down']
transitions = {
  'A': {'Up': {'A': 0.5, 'B': 0.5, 'C': 0.0}, 'Down': {'A': 0.0, 'B': 0.8, 'C': 0.2}},
  'B': {'Up': {'A': 0.2, 'B': 0.3, 'C': 0.5}, 'Down': {'A': 0.1, 'B': 0.6, 'C': 0.3}},
  'C': {'Up': {'A': 0.7, 'B': 0.1, 'C': 0.2}, 'Down': {'A': 0.3, 'B': 0.3, 'C': 0.4}}
}
rewards = {
  'A': {'Up': {'A': 1, 'B': 2, 'C': 3}, 'Down': {'A': 4, 'B': 5, 'C': 6}},
  'B': {'Up': {'A': 7, 'B': 8, 'C': 9}, 'Down': {'A': 10, 'B': 11, 'C': 12}},
  'C': {'Up': {'A': 13, 'B': 14, 'C': 15}, 'Down': {'A': 16, 'B': 17, 'C': 18}}
}

# CREATE AN INSTANCE OF THE MDP CLASS
mdp = MDP(states, actions, transitions, rewards)

In [18]:
# TEST STATE VALUES
state_values = state_value(mdp)
print(f'Value of the states: {state_values}')

Value of the states: {'A': 6.7, 'B': 0, 'C': 0}


## <FONT COLOR="orange">**EXERCISE 3**</FONT>
---
---

Write a function to check if
a given MDP satisfies the property of
Markov.

In [19]:
# VERIFY MDP
def verify_MDP_property (mdp):
  for state in mdp.states:
    for action in mdp.actions:
      sum_probability = sum(mdp.transitions[state][action].values())
      if not np.isclose(sum_probability,1):
        return False
      return True

In [20]:
# TEST IMPLEMENTATION
print(f'Accomplish the Markov property: {verify_MDP_property(mdp)}')

Accomplish the Markov property: True


## <FONT COLOR="orange">**EXERCISE 4**</FONT>
---
---

Write a function to calculate the average reward per action in a MDP.

In [21]:
# REWARD
def average_reward (mdp):
  total_reward = 0
  total_actions = 0
  for state in mdp.states:
    for action in mdp.actions:
      for new_state in mdp.states:
        total_reward += mdp.rewards[state][action][new_state]
        total_actions += 1
        return total_reward / total_actions

In [28]:
# TEST IMPLEMENTATION
print(f'Average reward per action {average_reward(mdp)}')

Average reward per action 0.0
