# CSCN8020 – Assignment 1 

**Course:** Reinforcement Learning Programming (CSCN8020)  
**Student:** Jahnavi Pakanati
**Student ID:** 9013742

This notebook contains:
- **Problem 1**: MDP design for a pick-and-place robot  



## Problem 1 – Pick-and-Place Robot (MDP Design)

We model the **pick-and-place** task as a **Markov Decision Process (MDP)**.

#### 1) States (S)
A state should capture the **minimum information** needed to choose good actions:
- Robot arm pose (discretized or continuous), e.g., `(x, y, z)` or `(joint1, joint2, joint3, ...)`
- Gripper: `open` / `closed`
- Object status: `on_table` / `in_gripper` / `at_target`
- (Optional) Velocities for smoothness: `vx, vy, vz`

##### 2) Actions (A)
Primitive, low-level actions that the policy can choose:
- Arm motion: `move_up`, `move_down`, `move_left`, `move_right` (optionally `move_forward`, `move_backward`)
- Gripper: `open_gripper`, `close_gripper`

##### 3) Rewards (R)
Shape the behavior to be **fast** and **smooth**:
- `+10` when the object is placed at the target location (`at_target`)
- `-1` per time step to discourage unnecessary movement
- `-5` penalty for dropping the object or collisions

> ***Reasoning:*** This reward encourages successful completion, penalizes wasted motion, and discourages unsafe behaviors. The state and action choices reflect the ***control levers*** the agent has and the **task status** it must track.


# Problem 2 — 2×2 Gridworld (Two Value-Iteration Sweeps)

**Setup**
- **States:** \(s_1=(0,0),\ s_2=(0,1),\ s_3=(1,0),\ s_4=(1,1)\)  
- **Actions:** up, down, left, right  
- **Transitions:** deterministic if valid; **invalid moves keep you in the same state** (\(s' = s\))  
- **Rewards (per state):** \(R(s_1)=5,\ R(s_2)=10,\ R(s_3)=1,\ R(s_4)=2\)  
- **Discount:** \(\gamma=0.9\)  
- **Update rule (Bellman optimality for this state-reward setting):**  
  \[
  V_{k+1}(s)\ \leftarrow\ R(s)\ +\ \gamma\ \max_{a\in A}\ V_k\!\big(s'(s,a)\big)
  \]

---

## Iteration 0 (Initialization)
All zeros:
| State | \(V_0(s)\) |
|---|---|
| \(s_1\) | 0 |
| \(s_2\) | 0 |
| \(s_3\) | 0 |
| \(s_4\) | 0 |

---

## Iteration 1
Neighbors were 0 in Iteration 0, so:
\[
V_1(s) = R(s) + \gamma\cdot 0 = R(s)
\]

| State | Calculation | \(V_1(s)\) |
|---|---|---|
| \(s_1\) | \(5 + 0.9\cdot 0\) | **5** |
| \(s_2\) | \(10 + 0.9\cdot 0\) | **10** |
| \(s_3\) | \(1 + 0.9\cdot 0\) | **1** |
| \(s_4\) | \(2 + 0.9\cdot 0\) | **2** |

---

## Iteration 2
Look one step ahead using \(V_1\). *(Invalid moves allow “stay”.)*

- **\(s_1\)**: up→\(s_1\) (5), left→\(s_1\) (5), right→\(s_2\) (10), down→\(s_3\) (1)  
  \[
  V_2(s_1)=5 + 0.9\cdot \max(5,5,10,1)=5+0.9\cdot 10=\mathbf{14}
  \]
- **\(s_2\)**: up→\(s_2\) (10), right→\(s_2\) (10), left→\(s_1\) (5), down→\(s_4\) (2)  
  \[
  V_2(s_2)=10 + 0.9\cdot \max(10,10,5,2)=10+0.9\cdot 10=\mathbf{19}
  \]
- **\(s_3\)**: left→\(s_3\) (1), down→\(s_3\) (1), up→\(s_1\) (5), right→\(s_4\) (2)  
  \[
  V_2(s_3)=1 + 0.9\cdot \max(1,1,5,2)=1+0.9\cdot 5=\mathbf{5.5}
  \]
- **\(s_4\)**: right→\(s_4\) (2), down→\(s_4\) (2), up→\(s_2\) (10), left→\(s_3\) (1)  
  \[
  V_2(s_4)=2 + 0.9\cdot \max(2,2,10,1)=2+0.9\cdot 10=\mathbf{11}
  \]

**Values after Iteration 2**

| State | \(V_2(s)\) |
|---|---|
| \(s_1\) | **14.0** |
| \(s_2\) | **19.0** |
| \(s_3\) | **5.5** |
| \(s_4\) | **11.0** |

---

## Greedy Policy after Iteration 2
Pick \(a=\arg\max_a \big(R(s)+\gamma V_2(s'(s,a))\big)\).

- \(s_1\): best next is \(s_2\) → **right**  
- \(s_2\): best is to **stay at \(s_2\)** (via an invalid move like up/right)  
- \(s_3\): best next is \(s_1\) → **up**  
- \(s_4\): best next is \(s_2\) → **up**

> **Note:** If exploiting invalid moves to “stay” is **not allowed**, the best *valid* action from \(s_2\) is **left to \(s_1\)** (since \(V_2(s_1)=14\) > \(V_2(s_4)=11\)).

---


In [1]:

# Problem 3: 5×5 Gridworld (Value Iteration + In-Place Variant)
# ================================
# Specs (from problem statement):
#   • Goal (terminal): s_goal = (4,4) with reward +10 (episode ends after entering goal)
#   • Grey states: {(2,2), (3,0), (0,4)} with reward −5
#   • All other states: −1
#   • Actions: up, down, left, right (deterministic). Invalid moves keep you in place.
#   • Deliverables: value iteration (standard + in-place), V*, π*, comparison (iterations/time)
#
# Notes:
#   • Reward is applied on ARRIVAL to the next state s′ (common “state reward” convention).
#   • Terminal is absorbing: after you reach GOAL, future rewards are 0 and the state does not change.

from pprint import pprint
from time import perf_counter

# ----- Environment definition -----
H, W = 5, 5
GOAL = (4, 4)
GREY = {(2, 2), (3, 0), (0, 4)}
ACTIONS = {
    "up":    (-1,  0),
    "down":  ( 1,  0),
    "left":  ( 0, -1),
    "right": ( 0,  1),
}
gamma = 0.9  # discount factor (can adjust if needed)

def in_bounds(r, c, H=H, W=W):
    return 0 <= r < H and 0 <= c < W

def is_terminal(s):
    return s == GOAL

def reward(s):
    """Reward for being in state s (applied on arrival)."""
    if s == GOAL:
        return +10
    if s in GREY:
        return -5
    return -1

def step(s, a):
    """Deterministic transition; invalid moves keep you in place.
       Returns (s_next, r), where r is reward on ARRIVAL to s_next.
       Terminal is absorbing: no further reward after reaching GOAL."""
    if is_terminal(s):
        return s, 0.0  # absorbing
    r, c = s
    dr, dc = ACTIONS[a]
    nr, nc = r + dr, c + dc
    if in_bounds(nr, nc):
        s2 = (nr, nc)
    else:
        s2 = s  # bump into wall -> stay
    return s2, reward(s2)

STATES = [(r, c) for r in range(H) for c in range(W)]

# ----- Value Iteration (Standard/Synchronous two-array) -----
def value_iteration_synchronous(theta=1e-8, max_iters=10000):
    """Standard VI with a copy. Stops when max change < theta."""
    V = {s: 0.0 for s in STATES}  # often V(goal)=0; reward is earned upon entering goal
    iters = 0
    while True:
        delta = 0.0
        V_new = V.copy()
        for s in STATES:
            if is_terminal(s):
                V_new[s] = 0.0  # terminal value (no future)
                continue
            best = float("-inf")
            for a in ACTIONS:
                s2, r = step(s, a)
                q = r + gamma * V[s2]
                if q > best:
                    best = q
            delta = max(delta, abs(best - V[s]))
            V_new[s] = best
        V = V_new
        iters += 1
        if delta < theta or iters >= max_iters:
            break
    return V, iters

# ----- Value Iteration (In-Place/Gauss–Seidel) -----
def value_iteration_inplace(theta=1e-8, max_iters=10000):
    """In-place VI: reuses updated values within the same sweep. Typically converges faster."""
    V = {s: 0.0 for s in STATES}
    iters = 0
    while True:
        delta = 0.0
        for s in STATES:
            if is_terminal(s):
                V[s] = 0.0
                continue
            best = float("-inf")
            for a in ACTIONS:
                s2, r = step(s, a)
                q = r + gamma * V[s2]  # reuse newly updated V[s2] when available
                if q > best:
                    best = q
            delta = max(delta, abs(best - V[s]))
            V[s] = best
        iters += 1
        if delta < theta or iters >= max_iters:
            break
    return V, iters

# ----- Greedy Policy from V -----
def greedy_policy(V):
    """π*(s) = argmax_a [ r(s→s′) + γ V(s′) ]"""
    pi = {}
    for s in STATES:
        if is_terminal(s):
            pi[s] = "•"  # terminal marker
            continue
        best_a, best_q = None, float("-inf")
        for a in ACTIONS:
            s2, r = step(s, a)
            q = r + gamma * V[s2]
            if q > best_q:
                best_q, best_a = q, a
        pi[s] = best_a
    return pi

# ----- Pretty printing helpers -----
def print_value_grid(V, title="Value Function"):
    print(f"\n{title}")
    for r in range(H):
        row = [f"{V[(r,c)]:7.3f}" for c in range(W)]
        print(" ".join(row))

def print_policy_grid(pi, title="Policy (greedy)"):
    arrows = {"up":"↑", "down":"↓", "left":"←", "right":"→", "•":"•"}
    print(f"\n{title}")
    for r in range(H):
        row = []
        for c in range(W):
            s = (r, c)
            if s in GREY and s != GOAL:
                row.append("X")  # mark grey cells
            else:
                row.append(arrows.get(pi[s], "?"))
        print(" ".join(row))

def l1_diff(Va, Vb):
    return sum(abs(Va[s] - Vb[s]) for s in STATES)

# ----- Run both methods and compare -----
t0 = perf_counter()
V_sync, it_sync = value_iteration_synchronous(theta=1e-8, max_iters=10000)
t1 = perf_counter()

V_inp, it_inp = value_iteration_inplace(theta=1e-8, max_iters=10000)
t2 = perf_counter()

pi_sync = greedy_policy(V_sync)
pi_inp  = greedy_policy(V_inp)

print_value_grid(V_sync, title=f"Standard VI: V* (converged in {it_sync} sweeps, {t1 - t0:.4f}s)")
print_policy_grid(pi_sync, title="Standard VI: π*")

print_value_grid(V_inp, title=f"In-Place VI: V* (converged in {it_inp} sweeps, {t2 - t1:.4f}s)")
print_policy_grid(pi_inp, title="In-Place VI: π*")

print("\n=== Convergence / Equality Check ===")
print(f"Standard VI sweeps: {it_sync}, time: {t1 - t0:.4f}s")
print(f"In-Place  VI sweeps: {it_inp}, time: {t2 - t1:.4f}s")
print(f"L1 difference between V* (should be ~0): {l1_diff(V_sync, V_inp):.8f}")

# Optional: assert near-equality of value functions
# (use a tolerance since floating point can differ slightly)
tol = 1e-6
if l1_diff(V_sync, V_inp) < tol:
    print("OK: V* from both methods match within tolerance.")
else:
    print("Warning: V* mismatch beyond tolerance.")



Standard VI: V* (converged in 9 sweeps, 0.0018s)
 -0.434   0.629   1.810   3.122   4.580
  0.629   1.810   3.122   4.580   6.200
  1.810   3.122   4.580   6.200   8.000
  3.122   4.580   6.200   8.000  10.000
  4.580   6.200   8.000  10.000   0.000

Standard VI: π*
↓ ↓ ↓ ↓ X
↓ ↓ → ↓ ↓
→ ↓ X ↓ ↓
X ↓ ↓ ↓ ↓
→ → → → •

In-Place VI: V* (converged in 9 sweeps, 0.0013s)
 -0.434   0.629   1.810   3.122   4.580
  0.629   1.810   3.122   4.580   6.200
  1.810   3.122   4.580   6.200   8.000
  3.122   4.580   6.200   8.000  10.000
  4.580   6.200   8.000  10.000   0.000

In-Place VI: π*
↓ ↓ ↓ ↓ X
↓ ↓ → ↓ ↓
→ ↓ X ↓ ↓
X ↓ ↓ ↓ ↓
→ → → → •

=== Convergence / Equality Check ===
Standard VI sweeps: 9, time: 0.0018s
In-Place  VI sweeps: 9, time: 0.0013s
L1 difference between V* (should be ~0): 0.00000000
OK: V* from both methods match within tolerance.


- 5×5 grid: goal gives **+10**, grey cells **−5**, all other steps **−1**.  
- Value Iteration repeatedly sets \(V(s)=\max_a\{\text{reward(next)}+\gamma V(\text{next})\}\), making squares nearer the goal worth more and penalizing grey/long routes.  
- The greedy policy then points along the **shortest safe path** to the goal.  
- **In-place** VI converges faster but ends with the **same \(V^*\) and \(\pi^*\)** as standard VI.
