### **Q3: 5√ó5 Gridworld ‚Äî Value Iteration (Œ≥ = 0.99)**

**Goal**
Compute the optimal state-value function v*(s) and optimal policy œÄ*(s).

Notation (same as lecture):
- states: s ‚àà S, written as si,j  (row i, column j)
- actions: a ‚àà A(s), with a1=Right, a2=Down, a3=Left, a4=Up
- transition: s' = Œ¥(s,a) (deterministic; invalid move ‚áí s'=s)
- Discount factor: ùõæ=0.99
- reward: R(s)
- optimal value: v*(s)
- optimal policy: œÄ*(s)

In this grid, the goal state is s_{4,4} (bottom-right cell).

Grey states are valid but unfavorable: {s2,2‚Äã,s3,0‚Äã,s0,4‚Äã}.

#### **Reward function R(s)**

Reward function R(s) 
Reward is given on arrival: r = R(s‚Ä≤)
- +10  if s = s_{4,4} (Goal)
- ‚àí5   if s ‚àà S_grey = { s_{2,2}, s_{3,0}, s_{0,4} }
- ‚àí1   otherwise

#### **Value Iteration update (Bellman optimality)**

For each non-terminal state s: vk+1‚Äã(s)=a‚ààA(s)max‚Äã[R(s‚Ä≤)+Œ≥vk‚Äã(s‚Ä≤)], where s‚Ä≤ = Œ¥(s,a).

Stop when the maximum change becomes very small: maxs‚Äã‚à£vk+1‚Äã(s)‚àívk‚Äã(s)‚à£<Œ∏

Then extract the greedy policy: œÄ*(s) = argmax over a of [ R(s') + Œ≥ v*(s') ]

#### **Running the implementation**

The Value Iteration algorithm for the 5√ó5 Gridworld is implemented in separate Python modules:
* gridworld.py (environment definition: states, actions, transition Œ¥, reward R)
* value_iteration_solved.py (standard synchronous Value Iteration)
* value_iteration_inplace.py (in-place Value Iteration)
* vi_logger.py (logging intermediate matrices and convergence data)

The notebook imports and runs the appropriate function to:

- Builds the Gridworld MDP (states, actions, transition Œ¥, reward R)
- Runs Value Iteration using Œ≥ = 0.99
- Computes the optimal state-value function v*(s)
- Computes the greedy optimal policy œÄ*(s)
- Logs how the value matrix V_k (and policy œÄ_k) change over iterations k
- Saves readable matrix snapshots and numeric values to log files

TThis notebook only calls the functions and displays results. The algorithm itself is not duplicated here, following good software design practice and separation of concerns.

In [6]:
# === Q3 Setup (Lecture 3 DP files) ===
import sys
from pathlib import Path

# Resolve project root (folder that contains "src")
HERE = Path.cwd().resolve()
ROOT = HERE
while not (ROOT / "src").exists() and ROOT != ROOT.parent:
    ROOT = ROOT.parent

Q3_DIR = ROOT / "src" / "q3"
if str(Q3_DIR) not in sys.path:
    sys.path.insert(0, str(Q3_DIR))

# Imports from lec3-style Q3 folder
from gridworld import GridWorld
from value_iteration_agent import Agent
from vi_logger import VILogger

ENV_SIZE = 5
GAMMA = 0.99
THETA = 1e-8
MAX_ITERS = 200000

# Where logs should go (project-level)
LOG_DIR = ROOT / "logs" / "q3"
LOG_DIR.mkdir(parents=True, exist_ok=True)

print("ROOT =", ROOT)
print("Q3_DIR =", Q3_DIR)
print("LOG_DIR =", LOG_DIR)

ROOT = C:\Users\user\1557_VSC\AI_Sem2\Reinforcement Learning Programming\CSCN8020_Assignment1
Q3_DIR = C:\Users\user\1557_VSC\AI_Sem2\Reinforcement Learning Programming\CSCN8020_Assignment1\src\q3
LOG_DIR = C:\Users\user\1557_VSC\AI_Sem2\Reinforcement Learning Programming\CSCN8020_Assignment1\logs\q3


#### **Executing Value Iteration and viewing results**

We now run Value Iteration on the 5√ó5 Gridworld using:

- discount factor Œ≥ = 0.99
- a very small convergence threshold Œ∏
- a large maximum iteration limit to ensure convergence

*We evaluate two implementations:*
* Standard (synchronous) value iteration: uses a temporary copy ùëânew each sweep.
* In-place value iteration: updates ùëâ directly and immediately reuses updated values.

*During execution:*
- The state-value matrix Vk is updated repeatedly using the Bellman optimality backup.
- At selected sweeps ùëò, the full 5√ó5 value table is written to a human-readable .log.
- Numeric snapshots are also written to a .csv file for analysis/plotting.

*After convergence, we report:*
- the number of sweeps (iterations) taken to converge
- the final optimal state-value function V*
- the final greedy optimal policy œÄ*
- The file paths for the generated .log and .csv outputs

*Note:* Value Iteration is a Dynamic Programming algorithm and is not episode-based. Here, convergence is measured in terms of sweeps (iterations) until the maximum update difference falls below ùúÉ.

In [5]:
# Task 1: Reward function as a LIST (reward_list)

env = GridWorld(ENV_SIZE)
R_list = env.get_reward_list()

print("Reward list length:", len(R_list))
print("Reward list (first 10):", R_list[:10])

goal = (4, 4)
grey_states = {(2, 2), (3, 0), (0, 4)}

to_idx = lambda s: s[0] * ENV_SIZE + s[1]  # row-major index

print("\nGoal state:", goal, "Reward:", R_list[to_idx(goal)])
print("Grey states:", grey_states)
print("Grey rewards:", [R_list[to_idx(s)] for s in sorted(grey_states)])

Reward list length: 25
Reward list (first 10): [-1.0, -1.0, -1.0, -1.0, -5.0, -1.0, -1.0, -1.0, -1.0, -1.0]

Goal state: (4, 4) Reward: 10.0
Grey states: {(0, 4), (3, 0), (2, 2)}
Grey rewards: [-5.0, -5.0, -5.0]


#### **Executing Value Iteration and Viewing Results**

We run Value Iteration on the 5√ó5 Gridworld using:
* discount factor ùõæ=0.99
* convergence threshold Œ∏
* A sufficiently large maximum sweep limit to ensure convergence

*Two implementations are executed:*
* Standard (synchronous) Value Iteration using a temporary copy ùëânew each sweep
* In-place Value Iteration updating ùëâ directly (Gauss‚ÄìSeidel style)

During execution, the scripts output:
* The final optimal value function ùëâ‚àó
* The optimal greedy policy ùúã‚àó
* The number of sweeps (iterations) required for convergence
* Runtime measurements
* Detailed intermediate snapshots saved as .log and .csv files under logs/q3/

Note: Value Iteration is a Dynamic Programming method and is not episode-based. Convergence is measured in terms of sweeps (iterations) until the maximum value update falls below the threshold ùúÉ.

In [8]:
# === Run Task 1 (Standard VI) + Task 2 (In-place VI) using lec3 scripts ===
import subprocess
import sys

std_script = ROOT / "src" / "q3" / "value_iteration_solved.py"
ip_script  = ROOT / "src" / "q3" / "value_iteration_inplace.py"

print("Running:", std_script)
subprocess.run([sys.executable, str(std_script)], check=True, cwd=str(Q3_DIR))

print("\nRunning:", ip_script)
subprocess.run([sys.executable, str(ip_script)], check=True, cwd=str(Q3_DIR))

Running: C:\Users\user\1557_VSC\AI_Sem2\Reinforcement Learning Programming\CSCN8020_Assignment1\src\q3\value_iteration_solved.py

Running: C:\Users\user\1557_VSC\AI_Sem2\Reinforcement Learning Programming\CSCN8020_Assignment1\src\q3\value_iteration_inplace.py


CompletedProcess(args=['c:\\Users\\user\\1557_VSC\\AI_Sem2\\Reinforcement Learning Programming\\CSCN8020_Assignment1\\.venv\\Scripts\\python.exe', 'C:\\Users\\user\\1557_VSC\\AI_Sem2\\Reinforcement Learning Programming\\CSCN8020_Assignment1\\src\\q3\\value_iteration_inplace.py'], returncode=0)

#### **TASK 2 ‚Äî Value Iteration Variations (Standard vs In-Place)**

| Aspect | Standard Value Iteration (Synchronous) | In-Place Value Iteration |
|---|---|---|
| Update style | Uses a separate V_new array each sweep | Updates V(s) directly and immediately reuses updated values |
| Converged to V* | Yes | Yes |
| Converged to œÄ* | Yes | Yes |
| Same V*? | True (within tolerance) | True |
| Same œÄ*? | True (exact match) | True |
| Iterations (sweeps) to converge | 9 | 9 |
| Runtime (seconds) | 0.069475 | 0.142434 |
| ‚ÄúEpisodes‚Äù note | DP Value Iteration is not episode-based. Here, we report iterations/sweeps until convergence. | Same interpretation |

#### **Computational Complexity**

Each sweep evaluates all states and all actions, so total time complexity is: O(K‚ãÖ‚à£S‚à£‚ãÖ‚à£A‚à£)

where ùêæ is the number of sweeps to convergence. Space complexity is: O(‚à£S‚à£)

Standard Value Iteration uses an additional temporary copy ùëânew each sweep, but memory remains linear in ‚à£ùëÜ‚à£.

#### **Implementation Source**

This solution is based on the Lecture 3 Dynamic Programming implementation (gridworld.py, value_iteration_agent.py, value_iteration_solved.py).

The environment and value iteration logic follow the Bellman optimality equation. For Q3, the reward structure was modified to match the assignment requirements (goal +10, grey ‚àí5, others ‚àí1), and an in-place value iteration variant was implemented for comparison.

Performance comparison was done by running both implementations with identical parameters and comparing the resulting ùëâ‚àó,ùúã‚àó, number of sweeps, and runtime.

#### **Analysis/Conclusion**

This problem models the 5√ó5 Gridworld as a Markov Decision Process (MDP) with:

- **States**: grid cells s_{i,j} (row, column), including a goal at s_{4,4}.
- **Actions**: right, down, left, up with transitions (invalid moves keep the agent in the same state).
- **Rewards** (implemented as a reward list R_list):  
  - +10 at the goal state s_{4,4}  
  - -5 at grey states {s_{2,2}, s_{3,0}, s_{0,4}} 
  - -1 for all other non-terminal states

Using the Bellman optimality update: 
V_{k+1}(s) = max_a ( R(s') + Œ≥ V_k(s')), 

(with deterministic s' given (s,a)), both implementations converged to the same optimal value function V* and same greedy optimal policy œÄ* (verified by the Same V* and Same œÄ* checks printed above). This gradient structure reflects the discounted future reward propagation from the terminal state through the grid, while penalty states locally reduce surrounding values.

**Convergence / Performance / Complexity**

Both standard (synchronous) and in-place value iteration converged to the same optimal solution (V* and œÄ*) in 9 sweeps. With snapshot logging enabled every sweep, measured runtimes were 0.069475 s (standard) and 0.142434 s (in-place); this difference is mainly due to per-iteration file I/O overhead rather than algorithmic differences. Since Value Iteration is DP (not episodic), the reported ‚Äúepisodes‚Äù correspond to sweeps until convergence. Time complexity is O(K¬∑|S|¬∑|A|) and space is O(|S|) (standard VI uses an extra temporary V array but remains linear).