#### Understanding Optimality
**Bellman Expectation Equation**:
Relates the value of a state-action pair to the value of the next state or state-action pair for a given policy $\pi$. It tells how good it is to be in a state/take an action assuming following policy $\pi$.

**Optimal Value functions($v_{*}$ and $q^{*}$)**
- The optimal state-value function, $v_{*}$(s), is simply the maximum possible value function achievable from state s over all possible policies.
- The optimal action-value function, $q^{∗}(s,a)$, is the maximum value achievable starting in state s, taking action a, and then following the optimal policy thereafter.
- If you know the optimal value function, the MDP is considered "solved".

**Optimal Policy ($\pi_{*}$)**
- An optimal policy is a policy that is better than or equal to all other policies for all states. 
-  if you know the optimal action-value function $q_{∗}(s,a)$, you can find the optimal policy by simply choosing the action that maximizes $q_{∗}(s,a)$ in each state.

**Bellman Optimality Equation**
- This equation provides the recursive definition for the optimal value function. Unlike the expectation equation, it doesn't average over a policy. Instead, it uses a **max** operator to select the best possible action at each step.

- This equation is non-linear and doesn't have a simple closed-form solution like the MRP equation did. It must be solved with iterative methods like Value Iteration or Policy Iteration.

#### Implement Value Iteration 
Use the Bellman Optimality Equation to find v*. Value Iteration works by repeatedly applying this equation to our value function estimate until it stops changing.

In [2]:
import numpy as np 
states = {0: 'C1', 1: 'C2', 2: 'C3', 3: 'Pass', 4: 'Pub', 5: 'FB', 6: 'Sleep'} 
#Initialize the value function to zeros
v = np.zeros(len(states)) 

#set a small threshold for convergence 
threshold = 1e-6  

#### Value iteration loop
- Calculate the action-values, q(s,a), for every state s and every possible action a. The formula for q(s,a) is the part inside the max of the Bellman Optimality Equation.
- Find the new value for each state, v_new[s], by taking the maximum over the action-values.
- Check if the value function has converged. If it has, break the loop. Otherwise, update v and continue. 

In [3]:
actions = {
    0: 'study', 1: 'other' # 'other' represents pub/facebook/quit etc.
} 

In [4]:
gamma = 0.9 

In [5]:
# Immediate rewards for entering each state in the MRP 
R_mrp = np.array([-2, -2, -2, 10, 1, -1, 0]) 

In [6]:
# Transition matrix for the MRP, P[i, j] is P(S'=j | S=i) 
P_mrp = np.array([
    # C1   C2   C3   Pass  Pub   FB    Sleep
    [0.0, 0.5, 0.0, 0.0, 0.0, 0.5, 0.0],  # From C1
    [0.0, 0.0, 0.8, 0.0, 0.0, 0.0, 0.2],  # From C2
    [0.0, 0.0, 0.0, 0.6, 0.4, 0.0, 0.0],  # From C3
    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0],  # From Pass (Terminal -> Sleep)
    [0.2, 0.4, 0.4, 0.0, 0.0, 0.0, 0.0],  # From Pub
    [0.1, 0.0, 0.0, 0.0, 0.0, 0.9, 0.0],  # From FB
    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0]   # From Sleep (Terminal)
]) 

In [7]:
# Rewards for taking the 'study' action in each state
R_study = np.array([-2, -2, -2, 10, 0, 0, 0]) 

# Transition matrix for the 'study' action, P(s'|s, a='study')
P_study = np.array([
    [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0],  # C1 -> C2
    [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0],  # C2 -> C3
    [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0],  # C3 -> Pass
    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0],  # Pass -> Sleep
    [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0],  # Pub (no study) -> Pub
    [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0],  # FB (no study) -> FB
    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0]   # Sleep -> Sleep
])
 

In [8]:
# Rewards for taking the 'other' action (play, quit, etc.)
R_other = np.array([-1, -2, 1, 0, 1, -1, 0]) # C2->Sleep reward is -2, not 0

# Transition matrix for the 'other' action, P(s'|s, a='other')
P_other = np.array([
    [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0],  # C1 -> FB
    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0],  # C2 -> Sleep
    [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0],  # C3 -> Pub
    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0],  # Pass -> Sleep
    [0.5, 0.0, 0.0, 0.0, 0.5, 0.0, 0.0],  # Pub -> C1 or Pub
    [0.5, 0.0, 0.0, 0.0, 0.0, 0.5, 0.0],  # FB -> C1 or FB
    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0]   # Sleep -> Sleep
]) 

In [15]:
while True:
    delta = 0  # To track the change in the value function in this iteration

    # Store the old value function to check for convergence
    v_old = v.copy()

    # --- Main Bellman Update Loop ---
    for s in range(len(states)):
        # Calculate the value of taking each action from state s
        # We have two main actions: 'study' and 'other' (play)

        # q(s, 'study')
        q_study = R_study[s] + gamma * (P_study[s] @ v_old)

        # q(s, 'other')
        q_other = R_other[s] + gamma * (P_other[s] @ v_old)

        # The new value for state s is the max over the action-values
        v[s] = max(q_study, q_other)

    # Check for convergence
    delta = np.max(np.abs(v - v_old))
    if delta < threshold:
        print("Value function converged!")
        break

print("\nOptimal State-Value Function (v*):")
for i in range(len(states)):
    print(f"  v*({states[i]}) = {v[i]:.1f}")

Value function converged!

Optimal State-Value Function (v*):
  v*(C1) = 1.9
  v*(C2) = 4.3
  v*(C3) = 7.0
  v*(Pass) = 10.0
  v*(Pub) = 3.3
  v*(FB) = 0.0
  v*(Sleep) = 0.0


In [16]:
# Create an empty policy array
optimal_policy = np.zeros(len(states), dtype=int) 
for s in range(len(states)):
    # Recalculate the q-values using the final optimal v*
    q_study = R_study[s] + gamma * (P_study[s] @ v)
    q_other = R_other[s] + gamma * (P_other[s] @ v)

    # The optimal action is the one with the higher q-value
    if q_study > q_other:
        optimal_policy[s] = 0  # 0 corresponds to 'study'
    else:
        optimal_policy[s] = 1  # 1 corresponds to 'other'

print("\nOptimal Policy (pi*):")
for i in range(len(states)):
    action = 'study' if optimal_policy[i] == 0 else 'other'
    print(f"  In state {states[i]}, take action: {action}")



Optimal Policy (pi*):
  In state C1, take action: study
  In state C2, take action: study
  In state C3, take action: study
  In state Pass, take action: study
  In state Pub, take action: other
  In state FB, take action: study
  In state Sleep, take action: other
