## Simulating an MDP with Value function and Q-function

In [1]:
import numpy as np

In [2]:
# Define MDP components

states = [0,1,2] # S
actions= [0, 1] # A

In [3]:
# Transition probabilites: T[s][a][s'] = P(s' | s, a)

T = {
    0: {
        0: [0.8, 0.2, 0.0], # From state 0, action 0 -> mostly to 0 and sometimes to 1
        1: [0.0, 1.0, 0.0] # From state 0, action 1 -> always to 1
    },
    1: {
        0: [0.0, 0.0, 1.0], # From state 1, action 0 -> always to 2(terminal)
        1: [0.5, 0.5, 0.0], # From state 1, action 1 -> go to 0 or 1
    },
    2: {
        0: [0.0, 0.0, 1.0], # Terminal state: stays in 2
        1: [0.0, 0.0, 1.0]
    }
}

# Rewards: R[s][a] = expected reward for taking action a in state s
R = {
    0: {0: 5, 1:10},
    1: {0: -10, 1: 0},
    2: {0: 0, 1:0}
}

gamma = 0.9 # Discount factor

In [7]:

# Intialize value and Q functions

V = np.zeros(len(states))
Q = np.zeros((len(states), len(actions)))

# Value Iteration (for V and Q)
for iteration in range(10): 
    V_new = np.zeros_like(V)
    Q_new = np.zeros_like(Q)

    for s in states:
        for a in actions:
            expected_value = 0
            for s_prime in states:
                expected_value += T[s][a][s_prime] * (R[s][a] + gamma * V[s_prime])
            Q_new[s, a] = expected_value

        # Bellman's optimality: pick best Q for this state
        V_new[s] = np.max(Q_new[s])
    
    V = V_new
    Q = Q_new

    print(f'Iteration {iteration + 1}')
    print('V:', V)
    print('Q:\n', Q)
    print('-' * 40)

Iteration 1
V: [10.  0.  0.]
Q:
 [[  5.  10.]
 [-10.   0.]
 [  0.   0.]]
----------------------------------------
Iteration 2
V: [12.2  4.5  0. ]
Q:
 [[ 12.2  10. ]
 [-10.    4.5]
 [  0.    0. ]]
----------------------------------------
Iteration 3
V: [14.594  7.515  0.   ]
Q:
 [[ 14.594  14.05 ]
 [-10.      7.515]
 [  0.      0.   ]]
----------------------------------------
Iteration 4
V: [16.86038  9.94905  0.     ]
Q:
 [[ 16.86038  16.7635 ]
 [-10.        9.94905]
 [  0.        0.     ]]
----------------------------------------
Iteration 5
V: [18.954145  12.0642435  0.       ]
Q:
 [[ 18.9303026  18.954145 ]
 [-10.         12.0642435]
 [  0.          0.       ]]
----------------------------------------
Iteration 6
V: [20.85781915 13.95827483  0.        ]
Q:
 [[ 20.81854823  20.85781915]
 [-10.          13.95827483]
 [  0.           0.        ]]
----------------------------------------
Iteration 7
V: [22.56244734 15.66724229  0.        ]
Q:
 [[ 22.53011926  22.56244734]
 [-10.        

simulating a Markov Decision Process (MDP) with:

	•	3 states: s0, s1, s2
	•	2 actions per state (e.g., a0, a1)
	
	Defined:
	
	•	Transition probabilities T(s, a, s')
	•	Rewards R(s, a, s')
	•	Discount factor γ = 0.9

Value Iteration Recap

The algorithm iteratively updates:

	•	V(s) → the value of being in state s
	•	Q(s, a) → the value of taking action a in state s

Using the update rules:

What my Output Shows

	•	V(s) values are increasing over iterations → converging toward optimal state values.
	•	Q(s, a) shows the value of specific actions in each state.
	
	After ~10 iterations:

	•	V[0] ≈ 26.7 → starting from s0, optimal expected return.
	•	Q[0]: You can compare Q[0,0] vs Q[0,1] to choose best action in s0.

