### **Q3: 5×5 Gridworld — Value Iteration (γ = 0.99)**

**Goal**
Compute the optimal state-value function v*(s) and optimal policy π*(s).

Notation (same as lecture):
- states: s ∈ S, written as s_{i,j} (row i, column j)
- actions: a ∈ A(s), with a1=Right, a2=Down, a3=Left, a4=Up
- transition: s' = δ(s,a) (deterministic; invalid move ⇒ s'=s)
- reward: R(s)
- optimal value: v*(s)
- optimal policy: π*(s)

In this grid, the goal state is s_{4,4} (bottom-right cell).

Grey states are valid but unfavorable: { s_{2,2}, s_{3,0}, s_{0,4} }.

#### **Reward function R(s)**

R(s) =
- +10  if s = s_{4,4} (Goal)
- −5   if s ∈ S_grey = { s_{2,2}, s_{3,0}, s_{0,4} }
- −1   otherwise

#### **Value Iteration update (Bellman optimality)**

For each non-terminal state s:

v_{k+1}(s) = max over actions a of  [ R(s') + γ v_k(s') ] , where  s' = δ(s,a)

Stop when the maximum change becomes very small: max over s of | v_{k+1}(s) − v_k(s) | < θ

Then extract the greedy policy: π*(s) = argmax over a of [ R(s') + γ v*(s') ]

#### **Running the implementation**

The Value Iteration algorithm for the 5×5 Gridworld is implemented in a separate Python file
`q3_gridworld_vi_oop.py`.

We import the function `run_q3_value_iteration`, which:

- Builds the Gridworld MDP (states, actions, transition δ, reward R)
- Runs Value Iteration using γ = 0.99
- Computes the optimal state-value function v*(s)
- Computes the greedy optimal policy π*(s)
- Logs how the value matrix V_k (and policy π_k) change over iterations k
- Saves readable matrix snapshots and numeric values to log files

This notebook only **calls** the function and displays results. The algorithm itself is **not duplicated here**, following good software design practice.

In [1]:
import sys
from pathlib import Path

p = Path().resolve()
while p != p.parent and not (p / "src").exists():
    p = p.parent

sys.path.insert(0, str(p))
print("Project root set to:", p)

Project root set to: C:\Users\user\1557_VSC\AI_Sem2\Reinforcement Learning Programming\CSCN8020_Assignment1


#### **Executing Value Iteration and viewing results**

We now run Value Iteration on the 5×5 Gridworld using:

- discount factor γ = 0.99
- a very small convergence threshold θ
- a large maximum iteration limit to ensure convergence

During execution:

- V_k (the state-value matrix at iteration k) is updated repeatedly
- At selected iterations k, the entire 5×5 value matrix is logged
- A snapshot of the matrix values is also saved to a CSV file for analysis or plotting

The function returns:

- the Gridworld MDP object
- the optimal state-value function V*
- the optimal policy π*
- the number of iterations required for convergence
- the path to the human-readable log file
- the path to the CSV file containing numeric snapshots

After convergence, we print:

- the total number of iterations taken
- the final optimal value function V*
- the final greedy optimal policy π*
- the locations of the saved log files

In [2]:
from src.q3.gridworld_vi_oop import run_q3_variations

out = run_q3_variations(gamma=0.99, theta=1e-8, max_iters=200000, snapshot_every=50)
mdp = out["mdp"]

print("Same V*?", out["same_V"])
print("Same π*?", out["same_pi"])
print("Standard:", out["iters_std"], "iters,", round(out["time_std"], 6), "sec")
print("In-place:", out["iters_ip"], "iters,", round(out["time_ip"], 6), "sec")

print("\nV* (standard):\n")
print(mdp.format_V(out["V_std"], decimals=2))

print("\nπ* (standard):\n")
print(mdp.format_pi(out["pi_std"]))

Same V*? True
Same π*? True
Standard: 9 iters, 0.030674 sec
In-place: 9 iters, 0.034413 sec

V* (standard):

  2.53    3.56    4.61    5.67    6.73*
  3.56    4.61    5.67    6.73    7.81 
  4.61    5.67    6.73*   7.81    8.90 
  5.67*   6.73    7.81    8.90   10.00 
  6.73    7.81    8.90   10.00     G   

π* (standard):

 →   →   →   ↓   ↓ 
 →   →   →   →   ↓ 
 →   ↓   →   →   ↓ 
 →   →   →   →   ↓ 
 →   →   →   →   G 


#### **TASK 2**

| Aspect                   | Standard Value Iteration (Synchronous)                | In-Place Value Iteration                            |    
| ------------------------ | ----------------------------------------------------- | --------------------------------------------------- | 
| Update style             | Uses a separate (V_{\text{new}}) table each iteration | Updates (V(s)) directly and reuses updated values   |       
| Converged V*             | Yes (matches exactly)                                 | Yes (matches exactly)                               |   
| Converged π*             | Yes (matches exactly)                                 | Yes (matches exactly)                               |    
| Same V*?                 | True                                                  | True                                                |
| Same π*?                 | True                                                  | True                                                |
| Iterations (k)           | 9                                                     | 9                                                   |       
| Optimization time        | ~0.024 seconds (this run)                             | ~0.033 seconds (this run)                           |       
| Interpretation           | Slightly faster in this run                           | Slightly slower here due to small state space       |     
| Computational complexity | O(K · number_of_states · number_of_actions)           | S                                                   | 
| Practical note           | Stable and easy to reason about                       | Can converge in fewer iterations on larger problems |      


#### **Analysis/Conclusion**

In this task, we solved the 5×5 Gridworld using Value Iteration to find the best long-term strategy for reaching the goal while avoiding penalty (grey) states. Each grid cell represents a state, and Value Iteration repeatedly updates a numerical value for each state that reflects how good it is to be there if the agent behaves optimally. We followed the assignment’s reward-on-arrival rule: entering the goal gives +10, entering a grey cell gives −5, and entering any other cell gives −1. With a high discount factor of γ = 0.99, the agent strongly considers future rewards rather than only immediate outcomes.

Using the standard (synchronous) Value Iteration approach, the algorithm converged in 9 iterations, meaning the values stopped changing in any meaningful way and satisfied the Bellman optimality condition. The final optimal value function V* forms a clear gradient that increases toward the goal, while grey states have lower values due to their penalty. The corresponding optimal policy π* mostly moves Right and Down toward the goal and avoids grey states whenever possible, which matches intuitive expectations.

To complete Task 2, we also implemented In-Place Value Iteration, where state values are updated directly and immediately reused during the same sweep. We verified that both methods reach the same optimal solution, with Same V = True* and Same π = True*. In our experiment, both approaches converged in 9 iterations, and their runtimes were very similar, with the standard version being slightly faster in this particular run. This small difference is expected given the very small state space.

Both methods have the same computational complexity, O(K · |S| · |A|), but in-place updates can sometimes reduce the number of iterations required in larger or more complex environments. Overall, these results confirm that Value Iteration reliably finds an optimal policy for this finite MDP, and that the in-place variation provides a valid alternative with comparable performance.