### The meaning of the equation

$ p_{i,j}^{\pi} = \sum_a \Pr[A_t=a \mid S_t=i, \pi] ; \Pr[S_{t+1}=j \mid S_t=i, A_t=a] $
This means:
> The probability of going from state **i** to state **j** under policy **π** equals
> the sum over all possible actions **a** of:
>
> * The probability that the **policy** chooses action **a** in state **i**
> * times the probability that the **environment** moves to **j** if you took action **a** from **i**

So:
* **Policy choice:** `π[a|i]` = how likely the agent chooses action `a` in state `i`
* **Environment behavior:** `P[a][i][j]` = how likely the environment moves to `j` after doing action `a` in `i`

---

### Step 2. In matrix form

We can combine all ( $ p_{i,j}^{\pi} $ ) into a big **matrix $( P^\pi )$** of size `[n_states × n_states]`.
Each entry represents the **probability of going from i → j** under that policy.

---

### Step 4. Interpretation

* Each row of `P_pi` now represents how the environment behaves **after combining** your policy with the stochastic transitions.
* This lets you treat the entire process as a **Markov Reward Process (MRP)**:
  $
  V^\pi = R^\pi + \gamma P^\pi V^\pi
  $
  which you can then solve or evaluate.

---

### Step 5. Summary

| Symbol       | Meaning                                                | Python representation                           |            |
| :----------- | :----------------------------------------------------- | :---------------------------------------------- | ---------- |
| (π[a         | i])                                                    | Probability of choosing action `a` in state `i` | `pi[s, a]` |
| (P[a][i][j]) | Probability of going to `j` from `i` taking action `a` | `P[a, s, j]`                                    |            |
| (P^{π}[i,j]) | Overall probability of `i→j` under policy `π`          | `P_pi[i, j] = Σ_a π[s,a] * P[a,s,j]`            |            |

---

In [1]:
import numpy as np

# Suppose:
nS, nA = 3, 2  # 3 states, 2 actions

# Environment transition probabilities: P[a][s][s']
P = np.array([
    [[0.8, 0.2, 0.0],   # action 0 transitions
     [0.1, 0.6, 0.3],
     [0.0, 0.3, 0.7]],

    [[0.9, 0.1, 0.0],   # action 1 transitions
     [0.2, 0.7, 0.1],
     [0.5, 0.2, 0.3]]
])

# Policy π[s,a] — probability of choosing each action in each state
pi = np.array([
    [0.7, 0.3],  # in state 0, 70% choose a0, 30% choose a1
    [0.5, 0.5],
    [0.1, 0.9]
])

# Compute P^π = sum_a π[s,a] * P[a][s,:]
P_pi = np.einsum('sa,asj->sj', pi, P)  # vectorized expectation over actions

print("P^π =\n", P_pi)
print("Each row sums to:", P_pi.sum(axis=1))

P^π =
 [[0.83 0.17 0.  ]
 [0.15 0.65 0.2 ]
 [0.45 0.21 0.34]]
Each row sums to: [1. 1. 1.]


### Value Iteration

1. Set $v_0 = [0, \ldots, 0]$
2. For $i = 0, 1, 2, \ldots$

   - For all states $s$, update
     $$
     v_{i+1}(s) = \max_a \Big( \mathcal{R}_s^a + \gamma \sum_{s'} P(s'|s,a) \, v_i(s') \Big)
     $$

---

**Interpretation**

- $v_i(s)$ → best total reward achievable within *i* steps.
- $\mathcal{R}_s^a$ → immediate reward for taking action *a* in state *s*.
- $P(s'|s,a)$ → probability of reaching *s′* given *(s,a)*.
- $\gamma$ → discount factor (how much we value future rewards).

---


In [None]:
V = np.zeros(n_states)
for i in range(max_iters):
    V_new = np.zeros_like(V)
    for s in range(n_states):
        V_new[s] = max(R[s,a] + gamma * sum(P[a][s][s2] * V[s2] for s2 in range(n_states))
                       for a in range(n_actions))
    if np.max(np.abs(V_new - V)) < tol:
        break
    V = V_new