# Reinforcement Learning Programming - CSCN 8020

## Assignment 1

### Exercise 1

### Done by ***Eris Leksi***

Problem 1 [10]

Pick-and-Place Robot: Consider using reinforcement learning to control the motion of a robot arm
in a repetitive pick-and-place task. If we want to learn movements that are fast and smooth, the
learning agent will have to control the motors directly and obtain feedback about the current positions
and velocities of the mechanical linkages.
Design the reinforcement learning problem as an MDP, define states, actions, rewards with reasoning.

## 1. State Space (S)
The state should capture robot dynamics, task progress, and safety:

- Joint angles **qₜ** (pose)  
- Joint velocities **q̇ₜ** (smoothness)  
- Gripper state **gₜ ∈ {0,1}** (open/closed)  
- Object pose **p_obj** and goal pose **p_goal**  
- Holding flag **hₜ ∈ {0,1}** (object grasped or not)  
- Collision flag **cₜ ∈ {0,1}** (safety)  


## 2. Action Space (A)
The agent directly controls the robot through:

- **Joint torques** or **velocity commands** (continuous)  
- **Gripper command** (open/close)  

This allows the agent to trade off speed and smoothness.


## 3. Transition Dynamics (P)
- Joint states update according to robot physics with noise.  
- Gripper closing near object → holding flag set.  
- Gripper opening at goal → successful placement.  
- Collisions trigger safety flag and may end the episode.  


## 4. Reward Function (R)
Encourages **fast, smooth, safe** task completion:

- **Task rewards**:  
  - +1 for successful pick  
  - +2 for successful place at goal  

- **Shaping terms**:  
  - Small step penalty → faster completion  
  - Distance penalty → stay close to object/goal  
  - Energy/torque penalty → discourage inefficiency  
  - Jerk penalty → encourage smooth actions  
  - Large penalty for collisions 

## 5. Discount Factor (γ)
- **γ ∈ [0.95, 0.995]**  
- Balances valuing future rewards while favoring quick completion.  

## 6. Terminal Conditions
- **Success**: object placed at goal and released  
- **Failure**: collision or time limit exceeded  


## 9. Pseudocode (SAC Training Loop)

```python
initialize SAC agent 
initialize replay buffer B

for episode in range(N):
    s = env.reset()
    for t in range(T):
        a = πθ(s) + exploration_noise
        s', r, done = env.step(a)
        B.add(s, a, r, s', done)

        if len(B) > warmup:
            for _ in range(updates_per_step):
                batch = B.sample()
                update critics and policy
                update target networks

        s = s'
        if done: break

    evaluate policy periodically without noise