# **Homework 3: Function-based RL**
#### **Created by 65340500058 Anuwit Intet**

## **Learning Objectives:**

- Understand how function approximation works and how to implement it.

- Understand how policy-based RL works and how to implement it.

- Understand how advanced RL algorithms balance exploration and exploitation.

- Be able to differentiate RL algorithms based on stochastic or deterministic policies, as well as value-based, policy-based, or Actor-Critic approaches.

- Gain insight into different reinforcement learning algorithms, including Linear Q-Learning, Deep Q-Network (DQN), the REINFORCE algorithm, and the Actor-Critic algorithm. Analyze their strengths and weaknesses.


In [None]:
import torch

## <font color="pink">**Part 1: Understanding the Algorithm**</font>

In this homework, you have to implement 4 different function approximation-based RL algorithms:

- Linear Q-Learning
 
- Deep Q-Network (DQN)

- REINFORCE algorithm

- One algorithm chosen from the following Actor-Critic methods:

    - Deep Deterministic Policy Gradient (DDPG)

    - Advantage Actor-Critic (A2C)

    - Proximal Policy Optimization (PPO)
    
    - Soft Actor-Critic (SAC)

For each algorithm, describe whether it follows a value-based, policy-based, or Actor-Critic approach, specify the type of policy it learns (stochastic or deterministic), identify the type of observation space and action space (discrete or continuous), and explain how each advanced RL method balances exploration and exploitation.

- it follows a value-based, policy-based, or Actor-Critic approach

- the type of policy it learns (stochastic or deterministic)

- the type of observation space and action space (discrete or continuous)

- how each advanced RL method balances exploration and exploitation

### <font color="yellow">**Linear Q-Learning**</font>

- About Linear Q-Learning
  - Linear Q-Learning is a value-based approach. It sometimes learns a function Q(s, a) that is used to determine the method by selecting the maximum Q-value action.

  - This algorithm uses the deterministic policy because Linear Q-Learning uses a Œµ-greedy policy which argmax Q-value, not uses the probability.

  - Linear Q-Learning is applied to continuous observation space (because it uses input feature vectors). But the action space must be discrete because it must compute $max‚Å°_Q(s,a)$, which must look at all actions. 

  - To balance Exploration vs Exploitation, Linear Q-Learning uses a Œµ-greedy policy, i.e. random action with probability Œµ and greedy action with probability 1‚àíŒµ.

- In Linear Q-Learning, Q-Function is estimate by

  $$
  Q(s,a) = \phi(s,a)^T w
  $$

  where:

  - $\phi(s,a)$ is feature vector of state-action pair  
  - $w$ is weight vector

- And update weight by this,

  $$
  w \leftarrow w + \alpha \cdot \delta \cdot \phi(s, a)
  $$

  where: 
  - $\delta = r + \gamma \max_{a'} Q(s', a') - Q(s, a)$ is TD error
  - $\alpha$ is learning rate


**üéÆ ‡∏ï‡∏±‡∏ß‡∏≠‡∏¢‡πà‡∏≤‡∏á Linear Q-Learning ‡∏ö‡∏ô CartPole**

- **State**: ‡πÄ‡∏ß‡∏Å‡πÄ‡∏ï‡∏≠‡∏£‡πå‡∏Ç‡∏ô‡∏≤‡∏î 4:  
  $$
  s = [x, \dot{x}, \theta, \dot{\theta}]
  $$
- **Action space**: ‡∏°‡∏µ 2 ‡∏Ñ‡πà‡∏≤ (discrete):
  - `0` = push cart ‡πÑ‡∏õ‡∏ã‡πâ‡∏≤‡∏¢
  - `1` = push cart ‡πÑ‡∏õ‡∏Ç‡∏ß‡∏≤

---

‡πÄ‡∏£‡∏≤‡∏à‡∏∞‡∏õ‡∏£‡∏∞‡∏°‡∏≤‡∏ì $Q(s, a)$ ‡∏î‡πâ‡∏ß‡∏¢‡∏ü‡∏±‡∏á‡∏Å‡πå‡∏ä‡∏±‡∏ô‡πÄ‡∏ä‡∏¥‡∏á‡πÄ‡∏™‡πâ‡∏ô‡πÅ‡∏ö‡∏ö‡∏ô‡∏µ‡πâ:

$$
Q(s, a) = w_a^T s
$$

‡πÇ‡∏î‡∏¢‡∏ó‡∏µ‡πà:
- $w_0$, $w_1$ ‡∏Ñ‡∏∑‡∏≠ weight vectors ‡∏™‡∏≥‡∏´‡∏£‡∏±‡∏ö action 0 ‡πÅ‡∏•‡∏∞ 1 ‡∏ï‡∏≤‡∏°‡∏•‡∏≥‡∏î‡∏±‡∏ö
- ‡∏´‡∏£‡∏∑‡∏≠‡∏£‡∏ß‡∏°‡πÄ‡∏õ‡πá‡∏ô matrix $W \in \mathbb{R}^{2 \times 4}$

---

- $s = [0.0, 0.5, 0.05, -0.2]$
- $W = \begin{bmatrix} 0.1 & 0.0 & 0.0 & 0.0 \\ 0.0 & 0.1 & 0.0 & 0.0 \end{bmatrix}$
- ‡πÄ‡∏•‡∏∑‡∏≠‡∏Å action $a = 1$
- ‡πÑ‡∏î‡πâ reward $r = 1$
- next state: $s' = [0.01, 0.55, 0.045, -0.18]$
- $\alpha = 0.1$, $\gamma = 0.99$

---

**1. ‡∏Ñ‡∏≥‡∏ô‡∏ß‡∏ì $Q(s, a)$**

$$
Q(s, a=1) = w_1^T s = 0.0*0.0 + 0.1*0.5 + 0.0*0.05 + 0.0*(-0.2) = 0.05
$$

---

**2. ‡∏Ñ‡∏≥‡∏ô‡∏ß‡∏ì $\max_{a'} Q(s', a')$**

$$
Q(s', 0) = w_0^T s' = 0.1*0.01 = 0.001 \\
Q(s', 1) = w_1^T s' = 0.1*0.55 = 0.055 \\
\Rightarrow \max_{a'} Q(s', a') = 0.055
$$

---

**3. ‡∏Ñ‡∏≥‡∏ô‡∏ß‡∏ì TD Error**

$$
\delta = r + \gamma \cdot \max_{a'} Q(s', a') - Q(s, a) \\
= 1 + 0.99 \cdot 0.055 - 0.05 = 1.00445
$$

---

**4. ‡∏≠‡∏±‡∏õ‡πÄ‡∏î‡∏ï weight**

‡πÄ‡∏â‡∏û‡∏≤‡∏∞ $w_1$:

$$
w_1 \leftarrow w_1 + \alpha \cdot \delta \cdot s \\
= [0.0, 0.1, 0.0, 0.0] + 0.1 \cdot 1.00445 \cdot [0.0, 0.5, 0.05, -0.2] \\
= [0.0, 0.1502, 0.005, -0.0201]
$$

---

### <font color="yellow">**Deep Q-Network (DQN)**</font>

üî∏ DQN ‡∏Ñ‡∏∑‡∏≠‡∏≠‡∏∞‡πÑ‡∏£?

DQN ‡πÅ‡∏Å‡πâ‡∏õ‡∏±‡∏ç‡∏´‡∏≤‡∏ô‡∏µ‡πâ‡πÇ‡∏î‡∏¢‡πÉ‡∏ä‡πâ deep neural network ‡∏°‡∏≤‡πÅ‡∏ó‡∏ô‡∏ï‡∏≤‡∏£‡∏≤‡∏á Q ‡πÇ‡∏î‡∏¢‡∏õ‡∏£‡∏∞‡∏°‡∏≤‡∏ì‡∏ü‡∏±‡∏á‡∏Å‡πå‡∏ä‡∏±‡∏ô: $Q(s,a;Œ∏)$


‡πÇ‡∏î‡∏¢‡∏ó‡∏µ‡πà:

- Œ∏ ‡∏Ñ‡∏∑‡∏≠‡∏û‡∏≤‡∏£‡∏≤‡∏°‡∏¥‡πÄ‡∏ï‡∏≠‡∏£‡πå‡∏Ç‡∏≠‡∏á neural network

- input = ‡∏™‡∏ñ‡∏≤‡∏ô‡∏∞ s (‡πÄ‡∏ä‡πà‡∏ô ‡∏†‡∏≤‡∏û‡∏à‡∏≤‡∏Å‡πÄ‡∏Å‡∏° Atari)

- output = ‡∏Ñ‡πà‡∏≤ Q ‡∏™‡∏≥‡∏´‡∏£‡∏±‡∏ö‡∏ó‡∏∏‡∏Å action

üîß ‡∏Å‡∏≤‡∏£‡∏ù‡∏∂‡∏Å DQN ‡∏õ‡∏£‡∏∞‡∏Å‡∏≠‡∏ö‡∏î‡πâ‡∏ß‡∏¢‡πÄ‡∏ó‡∏Ñ‡∏ô‡∏¥‡∏Ñ‡∏´‡∏•‡∏±‡∏Å 3 ‡∏≠‡∏¢‡πà‡∏≤‡∏á:

- Experience Replay
    - ‡πÄ‡∏Å‡πá‡∏ö‡∏õ‡∏£‡∏∞‡∏™‡∏ö‡∏Å‡∏≤‡∏£‡∏ì‡πå $(s, a, r, s', done)$ ‡∏•‡∏á‡πÉ‡∏ô buffer
    - ‡πÅ‡∏•‡πâ‡∏ß‡∏™‡∏∏‡πà‡∏° mini-batch ‡∏°‡∏≤‡πÉ‡∏ä‡πâ‡∏ù‡∏∂‡∏Å ‡πÄ‡∏û‡∏∑‡πà‡∏≠ decorrelate ‡∏Ç‡πâ‡∏≠‡∏°‡∏π‡∏•

- Target Network
    - ‡πÉ‡∏ä‡πâ network ‡πÅ‡∏¢‡∏Å‡∏≠‡∏µ‡∏Å‡∏≠‡∏±‡∏ô‡∏ä‡∏∑‡πà‡∏≠ target network QtargetQtarget‚Äã ‡πÄ‡∏û‡∏∑‡πà‡∏≠‡∏Ñ‡∏≥‡∏ô‡∏ß‡∏ì‡∏Ñ‡πà‡∏≤‡∏ó‡∏µ‡πà‡πÉ‡∏ä‡πâ‡πÄ‡∏õ‡πá‡∏ô target:y=r+Œ≥max‚Å°a‚Ä≤Qtarget(s‚Ä≤,a‚Ä≤)
    - ‡πÅ‡∏•‡πâ‡∏ß update ‡πÄ‡∏â‡∏û‡∏≤‡∏∞ network ‡∏´‡∏•‡∏±‡∏Å Q ‡πÄ‡∏õ‡πá‡∏ô‡∏£‡∏∞‡∏¢‡∏∞

- Fixed Action Space
    - Action space ‡∏ï‡πâ‡∏≠‡∏á discrete ‡πÄ‡∏û‡∏∑‡πà‡∏≠‡πÉ‡∏´‡πâ‡∏Ñ‡∏≥‡∏ô‡∏ß‡∏ì max‚Å°aQ(s,a)maxa‚ÄãQ(s,a) ‡πÑ‡∏î‡πâ‡∏á‡πà‡∏≤‡∏¢

üîÅ ‡∏™‡∏£‡∏∏‡∏õ Training Loop:

- ‡∏£‡∏±‡∏ö‡∏™‡∏ñ‡∏≤‡∏ô‡∏∞ s

- ‡πÄ‡∏•‡∏∑‡∏≠‡∏Å action ‡∏à‡∏≤‡∏Å Q-network ‡∏î‡πâ‡∏ß‡∏¢ $\epsilon$-greedy

- ‡∏ó‡∏≥ action ‚Üí ‡πÑ‡∏î‡πâ reward r, next state s‚Ä≤

- ‡πÄ‡∏Å‡πá‡∏ö transition ‡∏•‡∏á replay buffer

- ‡∏™‡∏∏‡πà‡∏° batch ‡∏à‡∏≤‡∏Å buffer ‚Üí ‡∏Ñ‡∏≥‡∏ô‡∏ß‡∏ì TD error

- ‡∏≠‡∏±‡∏õ‡πÄ‡∏î‡∏ï Œ∏ ‡πÇ‡∏î‡∏¢ gradient descent

‚úÖ ‡∏Ñ‡∏≥‡∏ñ‡∏≤‡∏°‡∏¢‡πà‡∏≠‡∏¢
1. ‚ùì It follows a value-based, policy-based, or Actor-Critic approach?

‡∏Ñ‡∏≥‡∏ï‡∏≠‡∏ö:
üîπ DQN ‡πÄ‡∏õ‡πá‡∏ô value-based approach ‡πÄ‡∏û‡∏£‡∏≤‡∏∞‡∏°‡∏±‡∏ô‡πÄ‡∏£‡∏µ‡∏¢‡∏ô‡∏£‡∏π‡πâ‡∏ü‡∏±‡∏á‡∏Å‡πå‡∏ä‡∏±‡∏ô‡∏Ñ‡πà‡∏≤ Q(s,a)Q(s,a) ‡πÇ‡∏î‡∏¢‡∏ï‡∏£‡∏á ‡πÅ‡∏•‡πâ‡∏ß derive policy ‡∏à‡∏≤‡∏Å‡∏Å‡∏≤‡∏£‡πÄ‡∏•‡∏∑‡∏≠‡∏Å action ‡∏ó‡∏µ‡πà‡πÉ‡∏´‡πâ‡∏Ñ‡πà‡∏≤ Q ‡∏™‡∏π‡∏á‡∏™‡∏∏‡∏î

2. ‚ùì The type of policy it learns (stochastic or deterministic)?

‡∏Ñ‡∏≥‡∏ï‡∏≠‡∏ö:
üî∏ ‡∏ô‡πÇ‡∏¢‡∏ö‡∏≤‡∏¢‡∏ó‡∏µ‡πà‡πÄ‡∏£‡∏µ‡∏¢‡∏ô‡∏£‡∏π‡πâ‡∏Ñ‡∏∑‡∏≠ deterministic
‡πÄ‡∏û‡∏£‡∏≤‡∏∞‡∏°‡∏±‡∏ô‡πÄ‡∏•‡∏∑‡∏≠‡∏Å action ‡∏î‡πâ‡∏ß‡∏¢:
a‚àó=arg‚Å°max‚Å°aQ(s,a)

‡πÅ‡∏ï‡πà‡πÉ‡∏ô‡∏£‡∏∞‡∏´‡∏ß‡πà‡∏≤‡∏á training ‡∏à‡∏∞‡πÉ‡∏ä‡πâ: $\epsilon$ -greedy policy ‚Üí stochastic ‡πÄ‡∏û‡∏∑‡πà‡∏≠‡∏Å‡∏≤‡∏£ explore

‡∏î‡∏±‡∏á‡∏ô‡∏±‡πâ‡∏ô:

Phase	Type of policy
Training	Stochastic (for exploration)
Deployment	Deterministic (greedy action)

3. ‚ùì The type of observation space and action space?

‡∏Ñ‡∏≥‡∏ï‡∏≠‡∏ö:
- Observation Space	Continuous ‡πÑ‡∏î‡πâ ‡πÄ‡∏ä‡πà‡∏ô ‡∏£‡∏π‡∏õ‡∏†‡∏≤‡∏û, ‡πÄ‡∏ß‡∏Å‡πÄ‡∏ï‡∏≠‡∏£‡πå ‡∏Ø‡∏•‡∏Ø (‡∏ú‡πà‡∏≤‡∏ô neural net)
- Action Space	Discrete ‡πÄ‡∏ó‡πà‡∏≤‡∏ô‡∏±‡πâ‡∏ô (‡πÄ‡∏û‡∏£‡∏≤‡∏∞ DQN ‡∏ï‡πâ‡∏≠‡∏á‡∏Ñ‡∏≥‡∏ô‡∏ß‡∏ì $\max_a Q(s, a)$)

‡∏´‡∏≤‡∏Å‡∏ï‡πâ‡∏≠‡∏á‡∏Å‡∏≤‡∏£‡πÉ‡∏ä‡πâ‡∏Å‡∏±‡∏ö continuous action ‡∏ï‡πâ‡∏≠‡∏á‡πÄ‡∏õ‡∏•‡∏µ‡πà‡∏¢‡∏ô‡πÑ‡∏õ‡πÉ‡∏ä‡πâ DDPG, SAC, ‡∏´‡∏£‡∏∑‡∏≠ TD3 ‡πÅ‡∏ó‡∏ô

4. ‚ùì How does this method balance exploration and exploitation?

‡∏Ñ‡∏≥‡∏ï‡∏≠‡∏ö:
- DQN ‡πÉ‡∏ä‡πâ Œµ-greedy strategy ‡πÉ‡∏ô‡∏Å‡∏≤‡∏£ balance:

    - ‡πÇ‡∏≠‡∏Å‡∏≤‡∏™‡∏™‡∏∏‡πà‡∏° (explore) = $\epsilon$

    - ‡πÇ‡∏≠‡∏Å‡∏≤‡∏™‡πÄ‡∏•‡∏∑‡∏≠‡∏Å action ‡∏ó‡∏µ‡πà‡∏î‡∏µ‡∏ó‡∏µ‡πà‡∏™‡∏∏‡∏î (exploit) = $1 - \epsilon$

- ‡πÇ‡∏î‡∏¢‡∏õ‡∏£‡∏±‡∏ö $\epsilon$ ‡∏ï‡∏≤‡∏° schedule ‡πÄ‡∏ä‡πà‡∏ô:

    - ‡πÄ‡∏£‡∏¥‡πà‡∏°‡∏ï‡πâ‡∏ô‡∏ó‡∏µ‡πà $\epsilon = 1.0$

    - ‡∏•‡∏î‡∏•‡∏á‡∏ó‡∏µ‡∏•‡∏∞‡∏ô‡πâ‡∏≠‡∏¢‡∏ñ‡∏∂‡∏á‡πÄ‡∏ä‡πà‡∏ô $\epsilon = 0.1$ ‡∏´‡∏£‡∏∑‡∏≠ $\epsilon = 0.01$

### <font color="yellow">**REINFORCE algorithm**</font>

### <font color="yellow">**Deep Deterministic Policy Gradient (DDPG)**</font>

### <font color="yellow">**Advantage Actor-Critic (A2C)**</font>

### <font color="yellow">**Proximal Policy Optimization (PPO)**</font>

# üß† Proximal Policy Optimization (PPO)

## ‚úÖ ‡πÅ‡∏ô‡∏ß‡∏Ñ‡∏¥‡∏î‡∏´‡∏•‡∏±‡∏Å‡∏Ç‡∏≠‡∏á PPO

PPO ‡∏Ñ‡∏∑‡∏≠‡∏´‡∏ô‡∏∂‡πà‡∏á‡πÉ‡∏ô Policy Gradient algorithms ‡∏ó‡∏µ‡πà‡πÑ‡∏î‡πâ‡∏£‡∏±‡∏ö‡∏Ñ‡∏ß‡∏≤‡∏°‡∏ô‡∏¥‡∏¢‡∏°‡∏™‡∏π‡∏á‡∏°‡∏≤‡∏Å ‡∏ã‡∏∂‡πà‡∏á‡∏û‡∏±‡∏í‡∏ô‡∏≤‡πÇ‡∏î‡∏¢ OpenAI ‡πÇ‡∏î‡∏¢‡∏°‡∏µ‡πÄ‡∏õ‡πâ‡∏≤‡∏´‡∏°‡∏≤‡∏¢‡∏´‡∏•‡∏±‡∏Å‡πÄ‡∏û‡∏∑‡πà‡∏≠:

- ‡∏ó‡∏≥‡πÉ‡∏´‡πâ‡∏Å‡∏≤‡∏£‡∏≠‡∏±‡∏õ‡πÄ‡∏î‡∏ï‡∏ô‡πÇ‡∏¢‡∏ö‡∏≤‡∏¢ (Policy) ‡∏°‡∏µ‡πÄ‡∏™‡∏ñ‡∏µ‡∏¢‡∏£‡∏†‡∏≤‡∏û (stable)
- ‡πÄ‡∏£‡∏µ‡∏¢‡∏ô‡∏£‡∏π‡πâ‡∏à‡∏≤‡∏Å‡∏Ç‡πâ‡∏≠‡∏°‡∏π‡∏•‡∏ó‡∏µ‡πà‡∏°‡∏≤‡∏à‡∏≤‡∏Å policy ‡∏õ‡∏±‡∏à‡∏à‡∏∏‡∏ö‡∏±‡∏ô (on-policy)
- ‡∏´‡∏•‡∏µ‡∏Å‡πÄ‡∏•‡∏µ‡πà‡∏¢‡∏á‡∏Å‡∏≤‡∏£‡∏≠‡∏±‡∏õ‡πÄ‡∏î‡∏ï‡∏ô‡πÇ‡∏¢‡∏ö‡∏≤‡∏¢‡πÅ‡∏ö‡∏ö ‚Äú‡πÅ‡∏£‡∏á‡πÄ‡∏Å‡∏¥‡∏ô‡πÑ‡∏õ‚Äù ‡∏ã‡∏∂‡πà‡∏á‡∏≠‡∏≤‡∏à‡∏ó‡∏≥‡πÉ‡∏´‡πâ performance ‡πÅ‡∏¢‡πà‡∏•‡∏á

---

## üèó ‡πÇ‡∏Ñ‡∏£‡∏á‡∏™‡∏£‡πâ‡∏≤‡∏á PPO

PPO ‡πÉ‡∏ä‡πâ‡πÅ‡∏ô‡∏ß‡∏ó‡∏≤‡∏á **Actor-Critic** ‡∏ã‡∏∂‡πà‡∏á‡∏õ‡∏£‡∏∞‡∏Å‡∏≠‡∏ö‡∏î‡πâ‡∏ß‡∏¢:

- **Actor**: ‡πÄ‡∏£‡∏µ‡∏¢‡∏ô‡∏£‡∏π‡πâ‡∏ô‡πÇ‡∏¢‡∏ö‡∏≤‡∏¢ $\pi_\theta(a|s)$ ‚Üí ‡πÉ‡∏ä‡πâ‡πÄ‡∏•‡∏∑‡∏≠‡∏Å action
- **Critic**: ‡∏õ‡∏£‡∏∞‡πÄ‡∏°‡∏¥‡∏ô value ‡∏Ç‡∏≠‡∏á state ‡∏´‡∏£‡∏∑‡∏≠ action ‡πÄ‡∏ä‡πà‡∏ô $V(s)$ ‡∏´‡∏£‡∏∑‡∏≠ $Q(s,a)$ ‚Üí ‡πÉ‡∏ä‡πâ‡∏Ñ‡∏≥‡∏ô‡∏ß‡∏ì advantage

---

## üîÅ ‡∏´‡∏•‡∏±‡∏Å‡∏Å‡∏≤‡∏£‡∏ó‡∏≥‡∏á‡∏≤‡∏ô‡∏Ç‡∏≠‡∏á PPO (Step-by-Step)

### 1. **Collect Trajectories**
- ‡πÉ‡∏´‡πâ agent ‡∏ß‡∏¥‡πà‡∏á‡πÉ‡∏ô environment ‡∏ï‡∏≤‡∏° policy ‡∏õ‡∏±‡∏à‡∏à‡∏∏‡∏ö‡∏±‡∏ô
- ‡πÄ‡∏Å‡πá‡∏ö‡∏Ç‡πâ‡∏≠‡∏°‡∏π‡∏•: $(s_t, a_t, r_t, \log \pi(a_t|s_t), done)$
- ‡∏£‡∏≠‡∏à‡∏ô‡πÑ‡∏î‡πâ rollout ‡∏Ñ‡∏£‡∏ö (‡πÄ‡∏ä‡πà‡∏ô 2048 steps ‡∏´‡∏£‡∏∑‡∏≠ 1 episode)

---

### 2. **Compute Returns & Advantages**
- ‡∏Ñ‡∏≥‡∏ô‡∏ß‡∏ì **Monte Carlo return** ‡∏´‡∏£‡∏∑‡∏≠‡πÉ‡∏ä‡πâ **GAE (Generalized Advantage Estimation)**:
  
  ```math
  A_t = \delta_t + (\gamma \lambda) \delta_{t+1} + ... ‚âà R_t - V(s_t)

### 3. Surrogate Objective with Clipping

- PPO ‡πÉ‡∏ä‡πâ surrogate loss function ‡πÄ‡∏û‡∏∑‡πà‡∏≠‡∏Ñ‡∏ß‡∏ö‡∏Ñ‡∏∏‡∏°‡∏Å‡∏≤‡∏£‡∏≠‡∏±‡∏õ‡πÄ‡∏î‡∏ï‡∏Ç‡∏≠‡∏á‡∏ô‡πÇ‡∏¢‡∏ö‡∏≤‡∏¢:
- rt(Œ∏)=œÄŒ∏(at‚à£st)œÄŒ∏old(at‚à£st)
- LCLIP(Œ∏)=Et[min‚Å°(rt(Œ∏)At,clip(rt(Œ∏),1‚àíœµ,1+œµ)At)]
- ‡∏ñ‡πâ‡∏≤ $r_t$ ‡πÄ‡∏ö‡∏µ‡πà‡∏¢‡∏á‡πÄ‡∏ö‡∏ô‡∏à‡∏≤‡∏Å 1 ‡∏°‡∏≤‡∏Å‡πÄ‡∏Å‡∏¥‡∏ô‡πÑ‡∏õ ‚Üí ‡∏à‡∏∞‡∏ñ‡∏π‡∏Å clip ‡πÑ‡∏ß‡πâ ‡πÄ‡∏û‡∏∑‡πà‡∏≠‡∏õ‡πâ‡∏≠‡∏á‡∏Å‡∏±‡∏ô policy ‡πÄ‡∏õ‡∏•‡∏µ‡πà‡∏¢‡∏ô‡πÅ‡∏õ‡∏•‡∏á‡πÄ‡∏£‡πá‡∏ß‡πÄ‡∏Å‡∏¥‡∏ô‡πÑ‡∏õ

### 4. Update Policy and Value Function

- ‡∏≠‡∏±‡∏õ‡πÄ‡∏î‡∏ï Actor ‡∏î‡πâ‡∏ß‡∏¢ loss ‡∏à‡∏≤‡∏Å surrogate objective

- ‡∏≠‡∏±‡∏õ‡πÄ‡∏î‡∏ï Critic ‡∏î‡πâ‡∏ß‡∏¢ MSE loss ‡∏£‡∏∞‡∏´‡∏ß‡πà‡∏≤‡∏á $V(s_t)$ ‡∏Å‡∏±‡∏ö return

### 5. Repeat Training for Multiple Epochs

- ‡πÉ‡∏ä‡πâ‡∏Ç‡πâ‡∏≠‡∏°‡∏π‡∏• rollout ‡πÄ‡∏î‡∏¥‡∏°‡∏ù‡∏∂‡∏Å‡πÑ‡∏î‡πâ‡∏´‡∏•‡∏≤‡∏¢‡∏£‡∏≠‡∏ö (‡πÄ‡∏ä‡πà‡∏ô 4-10 epochs)

- ‡∏ó‡∏≥‡πÉ‡∏´‡πâ sample efficient ‡πÇ‡∏î‡∏¢‡πÑ‡∏°‡πà‡∏ï‡πâ‡∏≠‡∏á‡πÉ‡∏ä‡πâ replay buffer

### <font color="yellow">**Soft Actor-Critic (SAC)**</font>

## <font color="pink">**Part 2: Setting up Cart-Pole Agent**</font>

Similar to the previous homework, you will implement a common components that will be the same in most of the function approximation-based RL in the RL_base_function.py.The core components should include, but are not limited to:

### <font color="orange">**1. RL Base class**</font>

- This class should include:

    - Constructor (__init__) to initialize the following parameters:

        - Number of actions: The total number of discrete actions available to the agent.

        - Action range: The minimum and maximum values defining the range of possible actions.

        - Discretize state weight: Weighting factor applied when discretizing the state space for learning.

        - Learning rate: Determines how quickly the model updates based on new information.

        - Initial epsilon: The starting probability of taking a random action in an Œµ-greedy policy.

        - Epsilon decay rate: The rate at which epsilon decreases over time to favor exploitation over exploration.

        - Final epsilon: The lowest value epsilon can reach, ensuring some level of exploration remains.

        - Discount factor: A coefficient (Œ≥) that determines the importance of future rewards in decision-making.

        - Buffer size: Maximum number of experiences the buffer can hold.

        - Batch size: Number of experiences to sample per batch.

    - Core Functions

        - scale_action(): scale the action (if it is computed from the sigmoid or softmax function) to the proper length.

        - decay_epsilon(): Decreases epsilon over time and returns the updated value.

- Additional details about these functions are provided in the class file. You may also implement additional functions for further analysis.

#### <font color="yellow">**scale_action()**</font>

In [None]:
def scale_action(self, action):
    """
    Maps a discrete action in range [0, n] to a continuous value in [action_min, action_max].

    Args:
        action (int): Discrete action in range [0, n].
        n (int): Number of discrete actions (inclusive range from 0 to n).
    
    Returns:
        torch.Tensor: Scaled action tensor.
    """
    # ========= put your code here ========= #

    # Unpack the minimum and maximum values of the action range
    action_min, action_max = self.action_range

    # Scale the discrete action index (0 to num_of_action-1) to a continuous value within [action_min, action_max]
    scaled = action_min + (action / (self.num_of_action - 1)) * (action_max - action_min)

    # Check if the scaled value is already a torch.Tensor
    if isinstance(scaled, torch.Tensor):
        # If yes, detach it from any computation graph and convert to float32
        return scaled.clone().detach().to(dtype=torch.float32)
    else:
        # Otherwise, convert it into a torch.Tensor of type float32
        return torch.tensor(scaled, dtype=torch.float32)

    # ====================================== #

#### <font color="yellow">**decay_epsilon()**</font>

In [None]:
def decay_epsilon(self):
    """
    Decay epsilon value to reduce exploration over time.
    """
    # ========= put your code here ========= #
    # Decay the exploration rate (epsilon) by multiplying with epsilon_decay,
    # but ensure it doesn't go below the minimum value (final_epsilon)
    self.epsilon = max(self.final_epsilon, self.epsilon * self.epsilon_decay)
    # ====================================== #

### <font color="orange">**2. Replay Buffer Class**</font>



- A class use to store state, action, reward, next state, and termination status from each timestep in episode to use as a dataset to train neural networks. This class should include:

    - Constructor (__init__) to initialize the following parameters:

        - memory: FIFO buffer to store the trajectory within a certain time window.

        - batch_size: Number of data samples drawn from memory to train the neural network.

    - Core Functions

        - add(): Add state, action, reward, next state, and termination status to the FIFO buffer. Discard the oldest data in the buffer

        - sample(): Sample data from memory to use in the neural network training.

    - <font color="orange">**Note that some algorithms may not use all of the data mentioned above to train the neural network.**</font>


#### <font color="yellow">**add()**</font>

#### <font color="yellow">**sample()**</font>

### <font color="orange">**3. Algorithm folder**</font>

- This folder should include:

    - Linear Q Learning class

    - Deep Q-Network class

    - REINFORCE Class

    - One class chosen from the Part 1.

- Each class should inherit from the RL Base class in RL_base_function.py and include:

    - A constructor which initializes the same variables as the class it inherits from.

    - Superclass Initialization (super().__init__()).

    - An update() function that updates the agent‚Äôs learnable parameters and advances the training step.

    - A select_action() function select the action according to current policy.

    - A learn() function that train the regression or neural network.

#### <font color="yellow">**Linear Q-Learning class**</font>

#### <font color="yellow">**Deep Q-Network (DQN) class**</font>

#### <font color="yellow">**REINFORCE class**</font>

#### <font color="yellow">**DDPG class**</font>

#### <font color="yellow">**A2C class**</font>

#### <font color="yellow">**PPO class**</font>

#### <font color="yellow">**SAC class**</font>

## <font color="pink">**Part 3: Trainning & Playing to stabilize Cart-Pole Agent**</font>

You need to implement the training loop in train script and main() in the play script (in the "Can be modified" area of both files). Additionally, you must collect data, analyze results, and save models for evaluating agent performance.

- Training the Agent

    - Stabilizing Cart-Pole Task

        ```python
        python scripts/Function_based/train.py --task Stabilize-Isaac-Cartpole-v0
        ```

    - Swing-up Cart-Pole Task (Optional)

        ```python
        python scripts/Function_based/train.py --task SwingUp-Isaac-Cartpole-v0
        ```

- Playing

    - Stabilize Cart-Pole Task

        ```python
        python scripts/Function_based/play.py --task Stabilize-Isaac-Cartpole-v0
        ``` 

    - Swing-up Cart-Pole Task (Optional)

        ```python
        python scripts/Function_based/play.py --task SwingUp-Isaac-Cartpole-v0 
        ```

### <font color="yellow">**train.py**</font>

### <font color="yellow">**play.py**</font>

## <font color="pink">**Part 4: Evaluate Cart-Pole Agent performance**</font>

You must evaluate the agent's performance in terms of learning efficiency (i.e., how well the agent learns to receive higher rewards) and deployment performance (i.e., how well the agent performs in the Cart-Pole problem). Analyze and visualize the results to determine:

- Which algorithm performs best?

- Why does it perform better than the others?


### <font color="yellow">**In term of learning efficiency**</font>

### <font color="yellow">**In term of deployment performance**</font>