# Reinforcement Learning Programming - CSCN 8020
### Assignment 1
* Student Name: Reham Abuarqoub 
* Student ID: 9062922


# Problem 1 

### Problem 1 — Pick-and-Place Robot as an MDP (clear, human version)

**What we want:** a robot arm that picks an object and drops it in a bin **quickly** and **smoothly**, without collisions or drops.

---

#### MDP definition

**States (S) — “what the robot knows right now”**
- **Robot:** joint angles \( \theta_{1:k} \), joint velocities \( \dot\theta_{1:k} \)  
- **Gripper:** open/closed, and a flag “object grasped?”  
- **Task context:** end-effector pose, object pose (estimated), distance to pick/goal points  
- **Safety/status (optional but useful):** contact/collision bit, time step / steps remaining

> Why these? To choose smooth torques, the agent needs positions **and** velocities; to finish the task it must know where the object and goal are; safety bits stop it from learning risky moves.

---

**Actions (A) — “what the robot can do”**
- **Preferred (low-level control):** joint **torques** \( \tau_{1:k} \) or joint **velocity commands**  
- **Plus:** a discrete **gripper** command (open/close)

> Direct motor commands let the policy shape motion smoothness instead of just hopping between waypoints.

---

**Transitions (P) — “what happens next”**
- Deterministic robot dynamics with small noise.  
- Invalid moves (beyond joint limits) are clipped.  
- Collisions or hard constraint violations send the episode to a failure terminal state.

---

**Rewards (R) — “what we encourage”**
- **+100** at successful place (object inside target tolerance) → episode ends.  
- **−1 per time-step** → finish faster.  
- **Smoothness penalty:** **−\( \lambda \) \(\sum_i |\ddot\theta_i|\)** *or* **−\( \lambda \) \|a_t − a_{t-1}\|** → discourage jerky motion.  
- **−50** on collision, dropping the object, or leaving the workspace.  
- **+5** one-time bonus when a stable grasp is first achieved (optional shaping).

> Big terminal reward sets the goal. The small step cost makes it hurry.
> The smoothness term is the knob that trades speed vs. elegance. Safety penalties block bad shortcuts.

---

**Episode setup**
- **Start:** object on table, arm near a home pose.  
- **End:** success, failure (collision/drop), or time limit \(T_{\max}\).

**Discount factor**
- \( \gamma = 0.9 \) (encourages shorter, smoother solutions by valuing near-term rewards more).

---

#### Why this design fits “fast and smooth”
- **Direct control** (torques/velocities) gives the agent the ability to make gentle trajectories.
- **Step cost** pushes it to be quick; **smoothness penalty** keeps it from “sprinting and slamming.”
- **Terminal rewards/penalties** make learning stable and task-focused.
- The state includes exactly what’s needed to plan motion and avoid trouble—no more, no less.

*If implementing a quick tabular prototype, discretize angles/velocities and use small discrete action increments; for real robots, use function approximation (e.g., actor–critic) with normalized, clipped observations.*


# Problem 2

## Problem 2 — Two Iterations of Value Iteration

**Setup**  
- Grid layout:  
  `s1  s2`  
  `s3  s4`
- Rewards: `R(s1)=5, R(s2)=10, R(s3)=1, R(s4)=2`
- Discount: γ = 0.9  
- Transition: deterministic; if action hits wall → stay in same state.  
- Bellman backup:  
  \[
  V_{k+1}(s) = \max_a \big[ R(s) + \gamma V_k(s') \big]
  \]

---

### Iteration 1

1) **Initial value function \(V_0\):**

| State | V₀ |
|-------|----|
| s1    | 0  |
| s2    | 0  |
| s3    | 0  |
| s4    | 0  |

2) **Update step (future term is zero since V₀=0):**

| State | Calculation          | V₁ |
|-------|----------------------|----|
| s1    | 5 + 0 = 5            | 5  |
| s2    | 10 + 0 = 10          | 10 |
| s3    | 1 + 0 = 1            | 1  |
| s4    | 2 + 0 = 2            | 2  |

3) **Updated value function \(V_1\):**

| State | V₁ | Greedy action(s) |
|-------|----|------------------|
| s1    | 5  | any (all tie)    |
| s2    | 10 | any (all tie)    |
| s3    | 1  | any (all tie)    |
| s4    | 2  | any (all tie)    |

---

### Iteration 2

**Start with V₁ = {s1=5, s2=10, s3=1, s4=2}.**

| State | Actions considered | Calculation (max)                  | V₂  | Greedy action |
|-------|--------------------|-------------------------------------|-----|----------------|
| s1    | right→s2, down→s3, stay | max(5+0.9·10, 5+0.9·1, 5+0.9·5) = max(14, 5.9, 9.5) | 14  | right |
| s2    | left→s1, down→s4, stay | max(10+0.9·5, 10+0.9·2, 10+0.9·10) = max(14.5, 11.8, 19) | 19  | stay (up/right) |
| s3    | up→s1, right→s4, stay | max(1+0.9·5, 1+0.9·2, 1+0.9·1) = max(5.5, 2.8, 1.9) | 5.5 | up |
| s4    | up→s2, left→s3, stay  | max(2+0.9·10, 2+0.9·1, 2+0.9·2) = max(11, 2.9, 3.8) | 11  | up |

**Updated value function \(V_2\):**

| State | V₂  | Greedy action |
|-------|-----|---------------|
| s1    | 14  | right         |
| s2    | 19  | stay (up/right) |
| s3    | 5.5 | up            |
| s4    | 11  | up            |

---

### Takeaway
After two iterations, the value function already pushes the agent toward **s2** (the highest reward). The greedy policy shows clear direction:  
- From s1 → move **right** to s2  
- From s3 → move **up** to s1  
- From s4 → move **up** to s2  
- From s2 → best to **stay** (because it’s already the maximum reward spot).  


# Problem 3

## 5×5 Gridworld (Value Iteration)

**Environment**
- Grid size: 5×5. State = (row, col) with 0-based indexing.
- **Goal:** (4,4) — terminal/absorbing, reward **+10** on arrival.
- **Grey states:** {(0,4), (1,2), (3,0)} — reward **−5** on arrival.
- **All other states:** reward **−1** on arrival.
- **Actions:** Right, Left, Down, Up. If the move leaves the grid → stay in place.
- **Discount:** γ = 0.9.

**Algorithm (copy-based value iteration)**
- Keep two arrays: `V` (current values) and `V_new` (updates).
- For each state, compute the best one-step lookahead return using **only** `V`, write to `V_new`.
- After the sweep, set `V = V_new`.
- Stop when the largest change across states is below a small threshold `THETA`.

The script prints the optimal value function \(V^\*\) and the greedy policy \(\pi^\*\) as an arrow grid:
- ► ◄ ▼ ▲ for actions,
- **X** for grey cells,
- **G** for the goal.


In [7]:
#!/usr/bin/env python3
"""
Problem 3 — 5x5 Gridworld (stand-alone)

Environment:
- Deterministic transitions; invalid moves keep you in place.
- Reward paid on ARRIVAL to the next state s':
    +10 at the goal (4,4)  [terminal/absorbing]
    -5  at grey states {(0,4), (1,2), (3,0)}
    -1  otherwise
- Discount gamma = 0.9

Algorithm:
- Copy-based value iteration (write updates into V_new, then replace V ← V_new).
- Prints optimal V* and the greedy policy (arrows), with X for grey and G for goal.
"""

import numpy as np

# -----------------------------
# Configuration
# -----------------------------
N = 5
GAMMA = 0.9
THETA = 1e-6
MAX_ITERS = 10_000

GOAL  = (4, 4)
GREYS = {(0, 4), (1, 2), (3, 0)}  # match the figure in the prompt

# action index -> (Δrow, Δcol)
ACTIONS = {
    0: (0, 1),    # Right
    1: (0, -1),   # Left
    2: (1, 0),    # Down
    3: (-1, 0),   # Up
}
ARROW = {0: "►", 1: "◄", 2: "▼", 3: "▲"}

# -----------------------------
# Environment helpers
# -----------------------------
def in_bounds(r: int, c: int) -> bool:
    """True if (r, c) is inside the 5x5 grid."""
    return 0 <= r < N and 0 <= c < N

def step(state: tuple[int, int], a_idx: int):
    """
    Deterministic transition; wall -> stay.
    Returns: (next_state, reward_on_arrival, done)
    """
    if state == GOAL:
        return state, 0.0, True  # absorbing goal

    dr, dc = ACTIONS[a_idx]
    r, c   = state
    nr, nc = r + dr, c + dc
    if not in_bounds(nr, nc):
        nr, nc = r, c  # hit wall: stay

    s_next = (nr, nc)
    if s_next == GOAL:
        return s_next, 10.0, True
    if s_next in GREYS:
        return s_next, -5.0, False
    return s_next, -1.0, False

# -----------------------------
# Value Iteration (copy-based)
# -----------------------------
def value_iteration():
    """
    Copy-based value iteration:
      - Build V_new from the current V (no reuse within the sweep),
      - Replace V ← V_new after each sweep,
      - Stop when the max change is below THETA.
    Returns: (V*, number_of_sweeps)
    """
    V = np.zeros((N, N), dtype=float)  # V(GOAL)=0; the +10 is paid on arrival
    for it in range(MAX_ITERS):
        delta = 0.0
        V_new = V.copy()
        for r in range(N):
            for c in range(N):
                s = (r, c)
                if s == GOAL:
                    V_new[r, c] = 0.0
                    continue

                # One-step lookahead: max over actions of r + gamma * V(next)
                best_q = -1e18
                for a_idx in ACTIONS:
                    s_next, rwd, done = step(s, a_idx)
                    nr, nc = s_next
                    q = rwd + (0.0 if done else GAMMA * V[nr, nc])
                    if q > best_q:
                        best_q = q

                delta = max(delta, abs(best_q - V[r, c]))
                V_new[r, c] = best_q

        V = V_new
        if delta < THETA:
            return V, it + 1

    return V, MAX_ITERS

def greedy_policy_from_V(V: np.ndarray):
    """
    Build a printable grid of the greedy policy:
      ► ◄ ▼ ▲ for regular cells, X for greys, G for goal.
    """
    Pi = np.empty((N, N), dtype=object)
    for r in range(N):
        for c in range(N):
            s = (r, c)
            if s == GOAL:
                Pi[r, c] = "G"
                continue
            if s in GREYS:
                Pi[r, c] = "X"
                continue

            best_a, best_q = None, -1e18
            for a_idx in ACTIONS:
                s_next, rwd, done = step(s, a_idx)
                nr, nc = s_next
                q = rwd + (0.0 if done else GAMMA * V[nr, nc])
                if q > best_q:
                    best_q, best_a = q, a_idx

            Pi[r, c] = ARROW[best_a]
    return Pi

# -----------------------------
# Main
# -----------------------------
if __name__ == "__main__":
    V_star, iters = value_iteration()
    np.set_printoptions(precision=6, suppress=True)
    print(f"Optimal Value Function found in {iters} sweeps:")
    print(V_star)

    pi_star = greedy_policy_from_V(V_star)
    for row in pi_star:
        print(list(row))


Optimal Value Function found in 9 sweeps:
[[-0.434062  0.62882   1.8098    3.122     4.58    ]
 [ 0.62882   1.8098    3.122     4.58      6.2     ]
 [ 1.8098    3.122     4.58      6.2       8.      ]
 [ 3.122     4.58      6.2       8.       10.      ]
 [ 4.58      6.2       8.       10.        0.      ]]
['►', '►', '►', '▼', 'X']
['►', '▼', 'X', '►', '▼']
['►', '►', '►', '►', '▼']
['X', '►', '►', '►', '▼']
['►', '►', '►', '►', 'G']


##  Task 2: In-Place Value Iteration (with comparison)

**What changes from Task 1?**  
- We keep **one** value table `V`.  
- When we back up a state `(r,c)`, we **write the new value directly into `V[r,c]`** and then **reuse** that fresh value for subsequent states in the same sweep.  
- This often reduces the number of sweeps to converge because each backup can benefit from the most recent neighbors.

**Goal.** Show that in-place value iteration converges to the **same** optimal value function \(V^\*\) and greedy policy \(\pi^\*\) as the copy-based method from Task 1, and briefly compare performance and complexity.

**Environment (same as Task 1).**  
- Grid 5×5, goal `(4,4)` with **+10** on arrival (terminal).  
- Grey cells `{(0,4), (1,2), (3,0)}` with **−5** on arrival.  
- All other arrivals **−1**.  
- Actions: Right, Left, Down, Up (deterministic; walls ⇒ stay).  
- Discount `γ = 0.9`.

**What to look at in the output.**  
- The printed `V* (in-place)` values and the arrow policy grid (► ◄ ▼ ▲; **X** grey; **G** goal).  
- A small summary that:  
  1) checks the **numerical equality** of `V*` from in-place vs copy-based,  
  2) reports **number of sweeps** and **wall-clock time** for both.  
*(There is no “number of episodes” here—value iteration is a planning algorithm, not Monte Carlo.)*


In [8]:
# Task 2: In-Place Value Iteration (stand-alone, includes a comparison vs copy-based)

import time
import numpy as np

# -----------------------------
# Config
# -----------------------------
N         = 5
GAMMA     = 0.9
THETA     = 1e-6
MAX_ITERS = 10_000

GOAL  = (4, 4)
GREYS = {(0, 4), (1, 2), (3, 0)}  # as in the figure

ACTIONS = {
    0: (0, 1),   # Right
    1: (0, -1),  # Left
    2: (1, 0),   # Down
    3: (-1, 0),  # Up
}
ARROW = {0: "►", 1: "◄", 2: "▼", 3: "▲"}

# -----------------------------
# Environment helpers
# -----------------------------
def in_bounds(r, c):
    return 0 <= r < N and 0 <= c < N

def step(state, a_idx):
    """Deterministic transition; walls keep you in place. Reward paid on ARRIVAL."""
    if state == GOAL:
        return state, 0.0, True
    dr, dc = ACTIONS[a_idx]
    r, c   = state
    nr, nc = r + dr, c + dc
    if not in_bounds(nr, nc):   # wall -> stay put
        nr, nc = r, c
    s_next = (nr, nc)
    if s_next == GOAL:  return s_next, 10.0, True
    if s_next in GREYS: return s_next, -5.0, False
    return s_next, -1.0, False

def policy_grid_from_V(V):
    """Pretty grid: ► ◄ ▼ ▲; X for grey; G for goal (greedy one-step lookahead)."""
    grid = []
    for r in range(N):
        row = []
        for c in range(N):
            s = (r, c)
            if s == GOAL:  row.append("G"); continue
            if s in GREYS: row.append("X"); continue
            best_a, best_q = None, -1e18
            for a_idx in ACTIONS:
                (nr, nc), rwd, done = *step(s, a_idx),
                q = rwd + (0.0 if done else GAMMA * V[nr, nc])
                if q > best_q:
                    best_q, best_a = q, a_idx
            row.append(ARROW[best_a])
        grid.append(row)
    return grid

# -----------------------------
# In-Place Value Iteration (this task)
# -----------------------------
def value_iteration_inplace():
    """
    In-place update:
      For each state in the sweep, compute the backup using the *current* V and
      immediately write V[r,c] = backup_value (then reuse it in the same sweep).
    """
    V = np.zeros((N, N), dtype=float)
    t0 = time.perf_counter()
    for it in range(MAX_ITERS):
        delta = 0.0
        for r in range(N):
            for c in range(N):
                s = (r, c)
                if s == GOAL:
                    V[r, c] = 0.0
                    continue
                v_old = V[r, c]
                best  = -1e18
                for a_idx in ACTIONS:
                    (nr, nc), rwd, done = *step(s, a_idx),
                    # uses up-to-date V inside this sweep
                    q = rwd + (0.0 if done else GAMMA * V[nr, nc])
                    if q > best:
                        best = q
                V[r, c] = best
                delta   = max(delta, abs(best - v_old))
        if delta < THETA:
            return V, it + 1, time.perf_counter() - t0
    return V, MAX_ITERS, time.perf_counter() - t0

# -----------------------------
# Copy-based Value Iteration (for comparison)
# -----------------------------
def value_iteration_copy_based():
    """
    Copy-based update:
      Build V_new from old V (no reuse within the sweep), then V <- V_new.
    """
    V = np.zeros((N, N), dtype=float)
    t0 = time.perf_counter()
    for it in range(MAX_ITERS):
        delta = 0.0
        V_new = V.copy()
        for r in range(N):
            for c in range(N):
                s = (r, c)
                if s == GOAL:
                    V_new[r, c] = 0.0
                    continue
                best = -1e18
                for a_idx in ACTIONS:
                    (nr, nc), rwd, done = *step(s, a_idx),
                    q = rwd + (0.0 if done else GAMMA * V[nr, nc])  # read old V
                    if q > best:
                        best = q
                delta = max(delta, abs(best - V[r, c]))
                V_new[r, c] = best
        V = V_new
        if delta < THETA:
            return V, it + 1, time.perf_counter() - t0
    return V, MAX_ITERS, time.perf_counter() - t0

# -----------------------------
# Run task + comparison
# -----------------------------
np.set_printoptions(precision=6, suppress=True)

V_inp, it_inp, t_inp = value_iteration_inplace()
print("=== In-Place Value Iteration (Task 2) ===")
print(f"Converged in {it_inp} sweeps, time={t_inp:.4f}s")
print("V* (in-place):\n", V_inp)
print("π* (► ◄ ▼ ▲; X=grey; G=goal):")
for row in policy_grid_from_V(V_inp):
    print(row)

# Optional: compare with copy-based to confirm same optimum
V_cpy, it_cpy, t_cpy = value_iteration_copy_based()
same = np.allclose(V_inp, V_cpy, atol=1e-10)
print("\n=== Comparison with copy-based VI (Task 1) ===")
print(f"Copy-based:  sweeps={it_cpy}, time={t_cpy:.4f}s")
print(f"In-place  :  sweeps={it_inp}, time={t_inp:.4f}s")
print("Do both methods reach the same V*? ", "YES" if same else "NO")

# Note: there is no concept of 'episodes' in value iteration (that applies to Monte Carlo).


=== In-Place Value Iteration (Task 2) ===
Converged in 9 sweeps, time=0.0027s
V* (in-place):
 [[-0.434062  0.62882   1.8098    3.122     4.58    ]
 [ 0.62882   1.8098    3.122     4.58      6.2     ]
 [ 1.8098    3.122     4.58      6.2       8.      ]
 [ 3.122     4.58      6.2       8.       10.      ]
 [ 4.58      6.2       8.       10.        0.      ]]
π* (► ◄ ▼ ▲; X=grey; G=goal):
['►', '►', '►', '▼', 'X']
['►', '▼', 'X', '►', '▼']
['►', '►', '►', '►', '▼']
['X', '►', '►', '►', '▼']
['►', '►', '►', '►', 'G']

=== Comparison with copy-based VI (Task 1) ===
Copy-based:  sweeps=9, time=0.0023s
In-place  :  sweeps=9, time=0.0027s
Do both methods reach the same V*?  YES


### Short performance & complexity notes

- **Optimization time / sweeps.**  
  In-place updates often need **equal or fewer sweeps** than copy-based because each backup can use the freshest neighbor values. The exact counts depend on the sweep order and the grid, but they should be close. Wall-clock time per sweep is similar.

- **Computational complexity.**  
  Both methods are \(O(|S||A|)\) per sweep (here, \(25\) states × \(4\) actions).  
  Copy-based uses an extra pass to copy `V` (or to allocate `V_new`), while in-place saves that memory and can converge in fewer sweeps.

- **Solution quality.**  
  Both solve the **same Bellman optimality equations** and must converge to the **same** \(V^\*\) and greedy policy \(\pi^\*\) (as verified by the equality check in the code).


# Problem 4

# Problem 4 — Off-Policy Monte Carlo with Importance Sampling

**Setting (same gridworld as Problem 3)**  
- **Goal:** (4,4) → reward **+10** on arrival (terminal/absorbing)  
- **Grey cells:** {(0,4), (1,2), (3,0)} → reward **−5** on arrival  
- **Other cells:** reward **−1** on arrival  
- **Actions:** Right, Left, Down, Up (deterministic; walls ⇒ stay)  
- **Discount:** \(\gamma = 0.9\)

**Objective**  
Estimate the value function using **off-policy Monte Carlo (MC)** with **Weighted Importance Sampling (WIS)**:
- **Behavior policy** \(b\): uniform random over actions (used to generate episodes).
- **Target policy** \(\pi\): greedy w.r.t. current \(Q(s,a)\) (deterministic control).
- **Update (backward through each episode):**
  - Return \(G \leftarrow \gamma G + r\).
  - Cumulative weight \(W\) starts at 1; at each step multiply by \(\pi(a|s)/b(a|s)=4\) **only** if action = current greedy; otherwise stop (ratio 0).
  - Weighted-IS control update:
    \[
      C(s,a) \leftarrow C(s,a) + W,\qquad
      Q(s,a) \leftarrow Q(s,a) + \frac{W}{C(s,a)}\big(G - Q(s,a)\big)
    \]
- Report \(V_{\pi}(s)=\max_a Q(s,a)\) and the greedy policy grid.

**Comparison to Value Iteration (Problem 3)**  
Also compute \(V^*\) with copy-based Value Iteration for reference and report timing / episodes / a simple MAE between \(V_{\text{MC}}\) and \(V^*\).


In [11]:
# Problem 4 — Off-Policy Monte Carlo with Weighted Importance Sampling (stand-alone)

import time, random
import numpy as np

# -----------------------------
# Environment (matches Problem 3)
# -----------------------------
N = 5
GAMMA = 0.9
MAX_STEPS = 200  # safety cap per episode length

GOAL = (4, 4)
GREYS = {(0, 4), (1, 2), (3, 0)}  # non-favourable cells

# Actions: index -> (dr, dc)
ACTIONS = {0:(0,1), 1:(0,-1), 2:(1,0), 3:(-1,0)}  # Right, Left, Down, Up
ARROW   = {0:"►", 1:"◄", 2:"▼", 3:"▲"}

def in_bounds(r, c):
    return 0 <= r < N and 0 <= c < N

def step(state, a_idx):
    """
    Deterministic transition.
    Reward is paid on ARRIVAL to s_next.
    GOAL is absorbing.
    Returns: (s_next, reward, done)
    """
    if state == GOAL:
        return state, 0.0, True
    dr, dc = ACTIONS[a_idx]
    r, c = state
    nr, nc = r + dr, c + dc
    if not in_bounds(nr, nc):            # wall -> stay
        nr, nc = r, c
    s_next = (nr, nc)
    if s_next == GOAL:  return s_next, 10.0, True
    if s_next in GREYS: return s_next, -5.0, False
    return s_next, -1.0, False

# -----------------------------
# Policies
# -----------------------------
def behavior_action():
    """b(a|s): uniform random over 4 actions."""
    return random.randrange(4)

def greedy_action_from_Q(Q, s):
    """π(a|s): greedy w.r.t. Q(s,a)."""
    r, c = s
    return int(np.argmax(Q[r, c, :]))

# -----------------------------
# Episode generation under b
# -----------------------------
def generate_episode_b():
    """
    Start from a random non-terminal state; roll out with behavior policy b.
    Return trajectory as list of (s, a, r).
    """
    while True:
        r, c = random.randrange(N), random.randrange(N)
        if (r, c) != GOAL:
            break
    s = (r, c)
    traj = []
    for _ in range(MAX_STEPS):
        a = behavior_action()
        s_next, rwd, done = step(s, a)
        traj.append((s, a, rwd))
        s = s_next
        if done: break
    return traj

# -----------------------------
# Off-policy MC control (Weighted IS)
# -----------------------------
def offpolicy_mc_weighted_is(n_episodes=20_000, gamma=GAMMA, seed=42):
    """
    Off-policy MC control with Weighted Importance Sampling.
    Behavior b: uniform random.
    Target π: greedy w.r.t. Q.

    Returns:
      V_mc : (N,N) MC value estimate
      Pi   : (N,N) greedy action indices
      pi_grid : printable policy grid
      Q    : (N,N,4) action-values
      t_mc : wall time (seconds)
    """
    random.seed(seed)
    np.random.seed(seed)

    Q = np.zeros((N, N, 4), dtype=float)
    C = np.zeros((N, N, 4), dtype=float)  # cumulative IS weights

    t0 = time.perf_counter()
    for _ in range(n_episodes):
        episode = generate_episode_b()
        G = 0.0
        W = 1.0
        # backward pass
        for (s, a, r) in reversed(episode):
            G = gamma * G + r
            i, j = s
            C[i, j, a] += W
            Q[i, j, a] += (W / C[i, j, a]) * (G - Q[i, j, a])

            # stop if episode action deviates from current greedy action
            a_star = int(np.argmax(Q[i, j, :]))
            if a != a_star:
                break

            # importance ratio π(a|s)/b(a|s) = 1 / (1/4) = 4
            W *= 4.0
            if W == 0.0:  # (defensive)
                break
    t1 = time.perf_counter()

    V = np.max(Q, axis=2)
    Pi = np.argmax(Q, axis=2)

    # pretty policy grid
    grid = []
    for r in range(N):
        row = []
        for c in range(N):
            s = (r, c)
            if s == GOAL:      row.append("G")
            elif s in GREYS:   row.append("X")
            else:              row.append(ARROW[int(Pi[r, c])])
        grid.append(row)

    return V, Pi, grid, Q, (t1 - t0)

# -----------------------------
# Value Iteration (reference V*)
# -----------------------------
def value_iteration_copy_based(theta=1e-6, gamma=GAMMA, max_iters=10_000):
    """
    Copy-based Value Iteration (Problem 3 reference).
    Returns: (V_star, sweeps, wall_time)
    """
    V = np.zeros((N, N), dtype=float)
    t0 = time.perf_counter()
    for it in range(max_iters):
        delta = 0.0
        V_new = V.copy()
        for r in range(N):
            for c in range(N):
                s = (r, c)
                if s == GOAL:
                    V_new[r, c] = 0.0
                    continue
                best = -1e18
                for a in ACTIONS:
                    s2, rwd, done = step(s, a)
                    nr, nc = s2
                    q = rwd + (0.0 if done else gamma * V[nr, nc])
                    if q > best: best = q
                delta = max(delta, abs(best - V[r, c]))
                V_new[r, c] = best
        V = V_new
        if delta < theta:
            t1 = time.perf_counter()
            return V, it + 1, (t1 - t0)
    t1 = time.perf_counter()
    return V, max_iters, (t1 - t0)

# -----------------------------
# Run + print deliverables
# -----------------------------
if __name__ == "__main__":
    np.set_printoptions(precision=6, suppress=True)

    # MC with WIS
    V_mc, Pi_mc, pol_grid, Q_mc, t_mc = offpolicy_mc_weighted_is(
        n_episodes=20_000,  # increase for lower variance (e.g., 50_000)
        seed=42
    )
    print("=== Off-Policy MC (Weighted IS) ===")
    print("Estimated Value Function V_pi(s):")
    print(V_mc)
    print("\nPolicy (► ◄ ▼ ▲; X=grey; G=goal):")
    for row in pol_grid:
        print(row)
    print(f"\nMC runtime (n_episodes=20_000): {t_mc:.4f}s")

    # Value Iteration reference
    V_vi, it_vi, t_vi = value_iteration_copy_based()
    print("\n=== Value Iteration (reference V*) ===")
    print(f"Sweeps: {it_vi}, runtime: {t_vi:.4f}s")
    print(V_vi)

    # Simple quantitative comparison
    mae = np.mean(np.abs(V_vi - V_mc))
    print("\n=== Comparison ===")
    print(f"MAE(|V* - V_MC|): {mae:.6f}")
    print("- VI is model-based; converges in ~9–12 sweeps on this grid.")
    print("- MC is model-free; requires many episodes to reduce variance.")
    print("- Complexity: VI ~ O(|S||A|) per sweep; MC ~ O(episode_length) per episode.")


=== Off-Policy MC (Weighted IS) ===
Estimated Value Function V_pi(s):
[[-0.434877  0.617806  1.798074  3.107557  4.558167]
 [ 0.625742  1.790354  3.112962  4.56834   6.174611]
 [ 1.801372  3.109896  4.570306  6.191116  7.988622]
 [ 3.100723  4.564205  6.187547  7.987846 10.      ]
 [ 4.555934  6.177675  7.99162  10.        0.      ]]

Policy (► ◄ ▼ ▲; X=grey; G=goal):
['▼', '►', '►', '▼', 'X']
['▼', '▼', 'X', '▼', '▼']
['►', '▼', '▼', '►', '▼']
['X', '►', '►', '▼', '▼']
['►', '►', '►', '►', 'G']

MC runtime (n_episodes=20_000): 3.7931s

=== Value Iteration (reference V*) ===
Sweeps: 9, runtime: 0.0014s
[[-0.434062  0.62882   1.8098    3.122     4.58    ]
 [ 0.62882   1.8098    3.122     4.58      6.2     ]
 [ 1.8098    3.122     4.58      6.2       8.      ]
 [ 3.122     4.58      6.2       8.       10.      ]
 [ 4.58      6.2       8.       10.        0.      ]]

=== Comparison ===
MAE(|V* - V_MC|): 0.011815
- VI is model-based; converges in ~9–12 sweeps on this grid.
- MC is model-fr

# Problem 4 — Off-Policy Monte Carlo with Importance Sampling

## Comparison with Value Iteration

| Aspect | Value Iteration (VI) | Off-Policy Monte Carlo (MC) |
|--------|----------------------|------------------------------|
| **Approach** | Model-based planning (requires full knowledge of transitions and rewards) | Model-free learning (uses trajectories generated from a behavior policy) |
| **Policy** | Greedy optimal policy derived from Bellman updates | Greedy policy improved from random behavior via Weighted Importance Sampling |
| **Convergence** | Deterministic and exact (converges in ~9–12 sweeps) | Converges slowly, requires thousands of episodes to reduce variance |
| **Optimization Time** | Very fast per sweep (milliseconds on 5×5 grid) | Slower; runtime grows with number of episodes (e.g., 20k–50k episodes) |
| **Sample Efficiency** | Does not use episodes; directly updates all states | Needs many episodes; higher variance especially in off-policy setting |
| **Complexity** | \(O(|S||A|)\) per sweep | \(O(\text{episode length})\) per episode |
| **Accuracy** | Exact \(V^*\) and \(\pi^*\) | Approximates \(V^*\); quality improves with more episodes |
| **Flexibility** | Only works if environment model is known | Works from logged or off-policy data, even without knowing the model |

## Conclusion
- Both algorithms eventually produce the **same greedy policy** for the 5×5 gridworld.  
- **Value Iteration** is faster, more efficient, and exact when the model is available.  
- **Off-Policy MC with IS** is more flexible, allowing learning from random or logged trajectories, but it suffers from variance and requires significantly more episodes to stabilize.  
