# CSCN8020 – Assignment 1 

**Course:** Reinforcement Learning Programming (CSCN8020)  
**Student:** Jahnavi Pakanati
**Student ID:** 9013742

This notebook contains:
- **Problem 1**: MDP design for a pick-and-place robot  



## Problem 1 – Pick-and-Place Robot (MDP Design)

We model the **pick-and-place** task as a **Markov Decision Process (MDP)**.

#### 1) States (S)
A state should capture the **minimum information** needed to choose good actions:
- Robot arm pose (discretized or continuous), e.g., `(x, y, z)` or `(joint1, joint2, joint3, ...)`
- Gripper: `open` / `closed`
- Object status: `on_table` / `in_gripper` / `at_target`
- (Optional) Velocities for smoothness: `vx, vy, vz`

##### 2) Actions (A)
Primitive, low-level actions that the policy can choose:
- Arm motion: `move_up`, `move_down`, `move_left`, `move_right` (optionally `move_forward`, `move_backward`)
- Gripper: `open_gripper`, `close_gripper`

##### 3) Rewards (R)
Shape the behavior to be **fast** and **smooth**:
- `+10` when the object is placed at the target location (`at_target`)
- `-1` per time step to discourage unnecessary movement
- `-5` penalty for dropping the object or collisions

> ***Reasoning:*** This reward encourages successful completion, penalizes wasted motion, and discourages unsafe behaviors. The state and action choices reflect the ***control levers*** the agent has and the **task status** it must track.


# Problem 2 — 2×2 Gridworld (Two Value-Iteration Sweeps)

**Setup**
- **States:** \(s_1=(0,0),\ s_2=(0,1),\ s_3=(1,0),\ s_4=(1,1)\)  
- **Actions:** up, down, left, right  
- **Transitions:** deterministic if valid; **invalid moves keep you in the same state** (\(s' = s\))  
- **Rewards (per state):** \(R(s_1)=5,\ R(s_2)=10,\ R(s_3)=1,\ R(s_4)=2\)  
- **Discount:** \(\gamma=0.9\)  
- **Update rule (Bellman optimality for this state-reward setting):**  
  \[
  V_{k+1}(s)\ \leftarrow\ R(s)\ +\ \gamma\ \max_{a\in A}\ V_k\!\big(s'(s,a)\big)
  \]

---

## Iteration 0 (Initialization)
All zeros:
| State | \(V_0(s)\) |
|---|---|
| \(s_1\) | 0 |
| \(s_2\) | 0 |
| \(s_3\) | 0 |
| \(s_4\) | 0 |

---

## Iteration 1
Neighbors were 0 in Iteration 0, so:
\[
V_1(s) = R(s) + \gamma\cdot 0 = R(s)
\]

| State | Calculation | \(V_1(s)\) |
|---|---|---|
| \(s_1\) | \(5 + 0.9\cdot 0\) | **5** |
| \(s_2\) | \(10 + 0.9\cdot 0\) | **10** |
| \(s_3\) | \(1 + 0.9\cdot 0\) | **1** |
| \(s_4\) | \(2 + 0.9\cdot 0\) | **2** |

---

## Iteration 2
Look one step ahead using \(V_1\). *(Invalid moves allow “stay”.)*

- **\(s_1\)**: up→\(s_1\) (5), left→\(s_1\) (5), right→\(s_2\) (10), down→\(s_3\) (1)  
  \[
  V_2(s_1)=5 + 0.9\cdot \max(5,5,10,1)=5+0.9\cdot 10=\mathbf{14}
  \]
- **\(s_2\)**: up→\(s_2\) (10), right→\(s_2\) (10), left→\(s_1\) (5), down→\(s_4\) (2)  
  \[
  V_2(s_2)=10 + 0.9\cdot \max(10,10,5,2)=10+0.9\cdot 10=\mathbf{19}
  \]
- **\(s_3\)**: left→\(s_3\) (1), down→\(s_3\) (1), up→\(s_1\) (5), right→\(s_4\) (2)  
  \[
  V_2(s_3)=1 + 0.9\cdot \max(1,1,5,2)=1+0.9\cdot 5=\mathbf{5.5}
  \]
- **\(s_4\)**: right→\(s_4\) (2), down→\(s_4\) (2), up→\(s_2\) (10), left→\(s_3\) (1)  
  \[
  V_2(s_4)=2 + 0.9\cdot \max(2,2,10,1)=2+0.9\cdot 10=\mathbf{11}
  \]

**Values after Iteration 2**

| State | \(V_2(s)\) |
|---|---|
| \(s_1\) | **14.0** |
| \(s_2\) | **19.0** |
| \(s_3\) | **5.5** |
| \(s_4\) | **11.0** |

---

## Greedy Policy after Iteration 2
Pick \(a=\arg\max_a \big(R(s)+\gamma V_2(s'(s,a))\big)\).

- \(s_1\): best next is \(s_2\) → **right**  
- \(s_2\): best is to **stay at \(s_2\)** (via an invalid move like up/right)  
- \(s_3\): best next is \(s_1\) → **up**  
- \(s_4\): best next is \(s_2\) → **up**

> **Note:** If exploiting invalid moves to “stay” is **not allowed**, the best *valid* action from \(s_2\) is **left to \(s_1\)** (since \(V_2(s_1)=14\) > \(V_2(s_4)=11\)).

---
