### CSCN8020 Assignment 1 ‚Äî Question 1 (Pick-and-Place MDP Design + Code + Log)

**GOAL**

We want to model a robot arm ‚Äúpick and place‚Äù task as a Reinforcement Learning problem. The robot controls motors directly, and it receives feedback about positions and velocities. The agent should learn movements that are **fast** and **smooth**.

*This notebook includes:*
1) Part A: State-Action diagram 
2) Part B: Formal MDP definition (S, A, ùúï, P, R, Œ≥) mapped to this task
3) Part C: Key analysis 
4) Code + Log: A small runnable code + generated log file

**Modeling Assumptions**

To keep the environment minimal and focused on smooth motion control, the gripper is fixed in the closed state. In a complete pick-and-place formulation, gripper open/close would be modeled as an additional action dimension.


#### **Part A ‚Äî State‚ÄìAction diagram view** 

Here, the task is represented using **states (s)** and **actions (a)**.

**States (high-level phases)**
- **s1: idle / ready**  
  The robot is at rest and ready to start a task.
- **s2: move_to_object**  
  The arm moves toward the object‚Äôs location.
- **s3: align_gripper**  
  Fine adjustments are made so the gripper is correctly positioned.
- **s4: grasp_object**  
  The gripper closes to grasp the object.
- **s5: lift_object**  
  The object is lifted off the surface.
- **s6: move_to_target**  
  The arm carries the object toward the target location.
- **s7: place_object**  
  The gripper opens to release the object at the target.
- **s8: return_home**  
  The arm returns to its neutral or home position.
  The return-home behavior is absorbed into the terminal success state.

*Exception / failure states*
- **s9: missed_grasp** ‚Äî object not securely grasped  
- **s10: collision_risk** ‚Äî unsafe joint or obstacle proximity  
- **s11: slip_or_drop** ‚Äî object falls during lift or transport  
- **s12: overshoot_or_jitter** ‚Äî movement becomes unstable or jerky  

**Actions**
- **a1: move_arm**  
  Move the arm toward a desired direction or position.
- **a2: fine_adjust**  
  Perform small corrective movements for alignment.
- **a3: close_gripper**  
  Close the gripper to grasp the object.
- **a4: open_gripper**  
  Open the gripper to release the object.

**Example state‚Äìaction transitions**

- ùúï(s0, a1) = s1
- ùúï(s1, a2) = s2
- ùúï(s2, a3) = s3
- ùúï(s3, a1) = s4
- ùúï(s4, a1) = s5
- ùúï(s5, a4) = s6
- ùúï(s6, a1) = s7  

**Why this abstraction is useful**

Real robotic control requires **continuous states and actions** (positions, velocities, motor commands).  

#### **Part B ‚Äî Formal MDP definition for pick-and-place**

The pick-and-place task is modeled as a **Markov Decision Process (MDP)** defined as:

**(S, A, P, R, Œ≥)**

where **P(s‚Ä≤ | s, a)** represents the transition probability from state *s* to state *s‚Ä≤*
after taking action *a*.

- **S (state):** what the robot observes at a time step (positions, velocities, object and goal information)
- **A (action):** motor-level commands applied by the agent (continuous control)
- **P (transition probability):** how the next state occurs after an action
- **R (reward):** numeric feedback encouraging fast and smooth placement
- **Œ≥ (discount factor):** importance of future rewards (e.g., Œ≥ = 0.9)

In the course slides, a **state transition relation** is also used:

- **ùúï(s, a) = s‚Ä≤**, which conceptually describes the next state resulting from applying action *a* in state *s*.

In deterministic or simulated environments, **ùúï can be viewed as a special case of P**
where the next state occurs with probability 1.

The table below summarizes each MDP component as applied to the pick-and-place task.

| MDP Component | Symbol | Definition for Pick-and-Place Task |
|-------------|--------|------------------------------------|
| **State** | **S** | Robot joint positions (q‚ÇÅ‚Ä¶q‚Çô) and velocities (qÃá‚ÇÅ‚Ä¶qÃá‚Çô), object position (x,y,z), target position (xg,yg,zg), and gripper state (open/closed). Optional derived features include distance(gripper, object) and distance(object, target). |
| **Action** | **A** | Continuous motor-level commands. In this implementation, actions are continuous joint velocity commands (Œîq‚ÇÅ‚Ä¶Œîq‚Çô). Gripper actions (open/close) are handled conceptually. |
| **State Transition (conceptual)** | **ùúï** | ùúï(s, a) = s‚Ä≤ describes how the system moves to the next state after applying an action, following the course state-machine notation. |
| **State Transition (formal)** | **P** | P(s‚Ä≤ | s, a) represents the probability of transitioning to state s‚Ä≤ after taking action a in state s. In this simulated environment, transitions are deterministic. |
| **Reward** | **R** | Encourages fast and smooth behavior using progress toward target, smoothness penalty (action change), energy penalty (large actions), step penalty (time), and a success bonus. |
| **Discount Factor** | **Œ≥** | Œ≥ = 0.9, balancing immediate performance with long-term task completion. |
| **Terminal Conditions** | ‚Äî | Episode ends when the object reaches the target (success) or when the maximum number of steps is reached (timeout). |

#### **Conclusion/Analysis**

We tested a simple pick-and-place task by modeling it as a Markov Decision Process (MDP) and running a scripted rollout. The goal was not to train an intelligent agent yet, but to check whether the environment, rewards, and step-by-step behavior work correctly.

The episode finished successfully at step 90, with the object placed very close to the goal (final distance 0.0467). When the task was completed, the system gave a large positive reward, which led to a final total reward of 8.61. This shows that the environment correctly detects success and ends the episode at the right time.

From the distance-to-goal graph, we can see that once the object was picked up, it steadily moved closer to the target without sudden jumps or instability. This tells us that the environment‚Äôs movement and state updates are smooth and reliable.

The reward plots show small negative rewards during most steps, which encourages the agent to finish the task quickly. Only completing the task gives a strong positive reward. This confirms that the reward design properly discourages unnecessary actions and rewards meaningful completion.

The action-magnitude plot shows larger movements early in the episode and smaller, more careful actions near the goal. This indicates stable and sensible control behavior as the task progresses.

Overall, this rollout confirms that:
* The environment behaves consistently from step to step
* Rewards correctly reflect progress and success
* Episode termination works as expected