### Markov Decision Process
A MDP is a Markov Reward process with decisions.
- Its a tuple (S,A,P,R $\gamma$)
- A is finite set of actions.
- P is state transition probability.
- R is a reward function for a given action
- $\gamma$ is a discount factor 

In [1]:
import numpy as np 

In [None]:
#actions 
actions = {0: 'study', 1: 'pub', 2: 'facebook', 3: 'quit', 4: 'sleep'} 

### Policy 
A policy defines the behavior of an agent within an MDP.
Definition: A policy π is a distribution over actions given states, expressed as π(a|s) = P[At = a | St = s]. This means for any given state s, the policy specifies the probability of taking each possible action a.
Characteristics of MDP Policies: They depend only on the current state, not the entire history. This is consistent with the Markov property of the environment.
They are stationary (time-independent), meaning At ~ π(·|St) for all time steps t > 0.

In this example, we will use the 50/50 random policy from the slides where $\pi(a|s) = 0.5$. 

Let's define the rewards and transitions for the two main actions: "study" and "pub/facebook"
We will assume a simplified MDP based on the diagrams

In [2]:
#R_study reward for studying in state s 
R_study = np.array([-2, -2, -2, 10, 0, 0, 0]) # R=-2 for classes, R=+10 for pass

In [3]:
# P_study[s, s'] is the transition prob from s to s' if you "study"
P_study = np.array([
    [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0],  # C1 -> C2
    [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0],  # C2 -> C3
    [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0],  # C3 -> Pass
    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0],  # Pass -> Sleep
    [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0],  # Pub -> Pub (no study)
    [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0],  # FB -> FB (no study)
    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0]   # Sleep -> Sleep
]) 

In [4]:
# R_other[s] is reward for playing/quitting
R_other = np.array([-1, -1, 1, 0, 1, -1, 0]) # R=-1 for FB, R=+1 for Pub 

# P_other[s, s'] is transition if you "play" (go to pub/facebook)
P_other = np.array([
    [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0],  # C1 -> FB
    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0],  # C2 -> Sleep
    [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0],  # C3 -> Pub
    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0],  # Pass -> Sleep
    [0.5, 0.0, 0.0, 0.0, 0.5, 0.0, 0.0],  # Pub -> C1 or Pub (simplified)
    [0.5, 0.0, 0.0, 0.0, 0.0, 0.5, 0.0],  # FB -> C1 or FB (simplified)
    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0]   # Sleep -> Sleep
]) 

For every state, there's a 0.5 probability of "studying" and 0.5 of "playing" 

In [5]:
pi_prob = 0.5  #policy 

In [6]:
# P_pi[s, s'] = pi(study|s)*P(s'|s, study) + pi(play|s)*P(s'|s, play)
P_pi = pi_prob * P_study + (1 - pi_prob) * P_other #policy-averaged transitions

# R_pi[s] = pi(study|s)*R(s, study) + pi(play|s)*R(s, play)
R_pi = pi_prob * R_study + (1 - pi_prob) * R_other #policy-averaged rewards

Solve the resulting MRP defined by $(S,P_{\pi},R_{\pi},\gamma)$ using the Bellman's Equation

In [7]:
states = {0: 'C1', 1: 'C2', 2: 'C3', 3: 'Pass', 4: 'Pub', 5: 'FB', 6: 'Sleep'} 
gamma = 0.9  
v_pi = np.linalg.inv(np.eye(len(states)) - gamma * P_pi) @ R_pi 

print("\nCalculated State-Value Function for the 50/50 Policy (v_pi):")
for i in range(len(states)):
    print(f"  v_pi({states[i]}) = {v_pi[i]:.1f}")


Calculated State-Value Function for the 50/50 Policy (v_pi):
  v_pi(C1) = -3.8
  v_pi(C2) = -0.9
  v_pi(C3) = 1.3
  v_pi(Pass) = 5.0
  v_pi(Pub) = -1.1
  v_pi(FB) = -4.2
  v_pi(Sleep) = 0.0
