# **Homework 3: Function-based RL**
#### **Created by 65340500058 Anuwit Intet**

## **Learning Objectives:**

- Understand how function approximation works and how to implement it.

- Understand how policy-based RL works and how to implement it.

- Understand how advanced RL algorithms balance exploration and exploitation.

- Be able to differentiate RL algorithms based on stochastic or deterministic policies, as well as value-based, policy-based, or Actor-Critic approaches.

- Gain insight into different reinforcement learning algorithms, including Linear Q-Learning, Deep Q-Network (DQN), the REINFORCE algorithm, and the Actor-Critic algorithm. Analyze their strengths and weaknesses.


In [None]:
import torch

## <font color="pink">**Part 1: Understanding the Algorithm**</font>

In this homework, you have to implement 4 different function approximation-based RL algorithms:

- Linear Q-Learning
 
- Deep Q-Network (DQN)

- REINFORCE algorithm

- One algorithm chosen from the following Actor-Critic methods:

    - Deep Deterministic Policy Gradient (DDPG)

    - Advantage Actor-Critic (A2C)

    - Proximal Policy Optimization (PPO)
    
    - Soft Actor-Critic (SAC)

For each algorithm, describe whether it follows a value-based, policy-based, or Actor-Critic approach, specify the type of policy it learns (stochastic or deterministic), identify the type of observation space and action space (discrete or continuous), and explain how each advanced RL method balances exploration and exploitation.

- it follows a value-based, policy-based, or Actor-Critic approach

- the type of policy it learns (stochastic or deterministic)

- the type of observation space and action space (discrete or continuous)

- how each advanced RL method balances exploration and exploitation

### <font color="yellow">**Linear Q-Learning**</font>

- **About Linear Q-Learning**

  - Linear Q-Learning is a value-based approach. It sometimes learns a function Q(s, a) that is used to determine the method by selecting the maximum Q-value action.

  - This algorithm uses the deterministic policy because Linear Q-Learning uses a ε-greedy policy which argmax Q-value, not uses the probability.

  - Linear Q-Learning is applied to continuous observation space (because it uses input feature vectors). But the action space must be discrete because it must compute $max⁡_Q(s,a)$, which must look at all actions. 

  - To balance Exploration vs Exploitation, Linear Q-Learning uses a ε-greedy policy, i.e. random action with probability ε and greedy action with probability 1−ε.

- **In Linear Q-Learning, Q-Function is estimate by**

  $$
  Q(s,a) = \phi(s,a)^T w
  $$

  where:

  - $\phi(s,a)$ is feature vector of state-action pair  
  - $w$ is weight vector

- **And update weight by this,**

  $$
  w \leftarrow w + \alpha \cdot \delta \cdot \phi(s, a)
  $$

  where: 
  - $\delta = r + \gamma \max_{a'} Q(s', a') - Q(s, a)$ is TD error
  - $\alpha$ is learning rate


##### **Example of Linear Q-Learning in CartPole**

- **State**: Vector of size 4:  
  $$
  s = [x, \dot{x}, \theta, \dot{\theta}]
  $$
- **Action space**: 2 Actions (discrete):

  - `0` = push cart to left

  - `1` = push cart to right

---

We will approximate $Q(s, a)$ with a linear function like this:

$$
Q(s, a) = w_a^T s
$$

where:
- $w_0$, $w_1$ are weight vectors for action 0 and 1 respectively
- or combined into a matrix $W \in \mathbb{R}^{2 \times 4}$

---

- $s = [0.0, 0.5, 0.05, -0.2]$
- $W = \begin{bmatrix} 0.1 & 0.0 & 0.0 & 0.0 \\ 0.0 & 0.1 & 0.0 & 0.0 \end{bmatrix}$
- Choose action $a = 1$
- Acheive reward $r = 1$
- next state: $s' = [0.01, 0.55, 0.045, -0.18]$
- $\alpha = 0.1$, $\gamma = 0.99$


**1. Calculate $Q(s, a)$**

$$
Q(s, a=1) = w_1^T s = 0.0*0.0 + 0.1*0.5 + 0.0*0.05 + 0.0*(-0.2) = 0.05
$$

**2. Calculate $\max_{a'} Q(s', a')$**

$$
Q(s', 0) = w_0^T s' = 0.1*0.01 = 0.001 \\
Q(s', 1) = w_1^T s' = 0.1*0.55 = 0.055 \\
\Rightarrow \max_{a'} Q(s', a') = 0.055
$$

**3. Calculate TD Error**

$$
\delta = r + \gamma \cdot \max_{a'} Q(s', a') - Q(s, a) \\
= 1 + 0.99 \cdot 0.055 - 0.05 = 1.00445
$$

**4. Update weight**

$w_1$ only:

$$
w_1 \leftarrow w_1 + \alpha \cdot \delta \cdot s \\
= [0.0, 0.1, 0.0, 0.0] + 0.1 \cdot 1.00445 \cdot [0.0, 0.5, 0.05, -0.2] \\
= [0.0, 0.1502, 0.005, -0.0201]
$$

### <font color="yellow">**Deep Q-Network (DQN)**</font>

- **About Deep Q-Network**

    - Deep Q-Network has the same property of Linear Q-Learning, Both has difference in term of complexity of network that from 1 layer neural network to deep neural network (more than 1 layer).

    - Deep Q-Network is a value-based approach. It sometimes learns a function Q(s, a) that is used to determine the method by selecting the maximum Q-value action.

    - This algorithm uses the deterministic policy because Deep Q-Network uses a ε-greedy policy which argmax Q-value, not uses the probability.

    - Deep Q-Network is applied to continuous observation space (because it uses input feature vectors). But the action space must be discrete because it must compute $max⁡_Q(s,a)$, which must look at all actions. 

    - To balance Exploration vs Exploitation, Deep Q-Network uses a ε-greedy policy, i.e. random action with probability ε and greedy action with probability 1−ε.

- DQN solves this problem by using a deep neural network to replace the Q-table with an approximate function: $Q(s,a;θ)$.

    Where:

    - θ is the neural network parameter

    - input = state s

    - output = Q value for every action

- DQN training consists of 2 main techniques:

    - Experience Replay
        - Store the experience $(s, a, r, s', done)$ in a buffer
        - Then randomly select a mini-batch to train it

    - Target Network
        - Use a separate network called target network​ to calculate the target:$y=r+γmax_{⁡a′}Q_{target}(s′,a′)$
        - Then update only the main network Q periodically

##### **Example of DQN in CartPole**

**1. Choose action with $\epsilon$-greedy**

- Suppose $\epsilon = 0.1$ → Random chance = 10%

- Luckily pick Random → Use Q-network to predict:
$$
Q(s, a=0) = 0.4,\quad Q(s, a=1) = 0.6
$$

- Choose **action = 1** (right) because Q is highest

**2. Send action to environment**

- Got:

  - reward = 1

  - next_state = [0.06, 0.025, -0.015, 0.035]

  - done = False (Not yet failed)

**3. Store transition**

- Store $(s, a=1, r=1, s', done=False)$ in the replay buffer.

**4. Assume that the buffer is sufficient → Start training 1 round**

- Suppose that we store the transition in the replay buffer and randomly get 2 mini-batch samples as follows:

| Index | State (s)                  | Action (a) | Reward (r) | Next State (s′)             | Done  |
|-------|----------------------------|------------|------------|------------------------------|--------|
| 0     | [0.05, 0.02, -0.01, 0.03]  | 1          | 1.0        | [0.06, 0.025, -0.015, 0.035] | False  |
| 1     | [-0.01, -0.03, 0.02, -0.02] | 0         | 1.0        | [0.00, -0.02, 0.01, -0.01]   | True   |




**5. Calculate Q(s, a) from policy network**

Suppose the policy network gives:

| Index | Q(s, a=0) | Q(s, a=1) |
|-------|-----------|-----------|
| 0 | 0.5 | 0.65 |
| 1 | 0.6 | 0.55 |

Get Q of the selected action:

- Example 0: action = 1 → Q = 0.65

- Example 1: action = 0 → Q = 0.6

**Q(s,a) = [0.65, 0.6]**

**Calculate Target Q(s′, a′) from target network**

Suppose Target Network gives:

| Index | Q(s′, a=0) | Q(s′, a=1) | Done |
|-------|------------|------------|--------|
| 0 | 0.4 | 0.6 | False |
| 1 | -- | -- | True |

→ max Q(s′) only for not done:

- Index 0: max = 0.6 

- Index 1: terminal → max = 0

**max_next_q_values ​​= [0.6, 0.0]**

**7. Calculate Target Q-value (Bellman target)**

Use the formula:
$$
y = r + \gamma \cdot (1 - \text{done}) \cdot \max Q(s', a')
$$

ให้ $\gamma = 0.99$:

- Index 0:  $y = 1.0 + 0.99 \cdot 0.6 = 1.594$

- Index 1:  $y = 1.0 + 0 = 1.0$

**Target Q-values = [1.594, 1.0]**

**8. Calculate Loss**

Use the formula:
$$
\text{Loss} = \frac{1}{2} \sum_i (Q(s_i, a_i) - y_i)^2
$$

- $Q(s_i, a_i)$ is Q-value from policy network

- $y_i$ is Q-value from target network

Substitute the value:

$$
\text{Loss} = \frac{1}{2} \left[ (0.65 - 1.594)^2 + (0.6 - 1.0)^2 \right] \\
= \frac{1}{2} \left[ 0.891 + 0.16 \right] = \frac{1.051}{2} = \mathbf{0.5255}
$$

**Summary Table**

| Index | Q(s, a) | Target y | Loss per item |
|-------|---------|----------|----------------|
| 0     | 0.65    | 1.594    | 0.891          |
| 1     | 0.6     | 1.0      | 0.160          |
|       |         |          | Total = 1.051 |
|       |         |          | Avg = 0.5255  |

**Final Loss = 0.5255**

**9. Do Gradient Descent**

- Call `loss.backward()` to compute gradient

- Call `optimizer.step()` to update policy network weights

**This is where Q-network learns from the latest experience that is randomly selected from the replay buffer**


### <font color="yellow">**REINFORCE algorithm**</font>

- **About MC-REINFORCE Algorithm**

    - MC-REINFORCE (Monte Carlo REINFORCE) เป็นอัลกอริทึมพื้นฐานของ Policy Gradient Method ใช้แนวคิดการเรียนรู้ นโยบาย (policy) โดยตรงไม่ประมาณค่า Q-function แบบ Q-learning

    - MC-REINFORCE เป็น Policy-based เพราะ REINFORCE ไม่เรียนรู้ Q-value หรือ V-value แต่เรียนรู้นโยบาย $\pi_\theta$ โดยตรง (pure policy gradient method)

    - type of policy is Stochastic policy เพราะใช้ $\pi_\theta(a|s)$ (เช่น softmax, categorical) ไม่ใช่ argmax ซึ่งเป็น deterministic policy ยกตัวอย่างเช่น $\pi_\theta(a=0|s) = 0.4$, $\pi_\theta(a=1|s) = 0.6$ เราจะทำการ sample action ตามความน่าจะเป็นเหล่านี้ ไม่ใช่ argmax ซึ่งจะได้ action 1

    - Observation space เป็น Continuous ส่วน Action space เป็น Discrete หรือ continuous ก็ได้ ขึ้นอยู่กับรูปแบบ policy ที่เลือก ถ้าเป็น discrete ต้องใช้ softmax กับ linear output แต่หากเป็น continuous ให้ใช้ Gaussian กับ linear output แทน

    - Exploration เกิดจาก stochastic policy โดยตรง เพราะ agent มีโอกาสสุ่ม action ทุกครั้ง ถ้า policy เรียนรู้แล้วว่า action หนึ่งได้ reward มาก แล้ว probability ของ action นั้นจะสูงขึ้นเองซึ่งจะกลายเป็น exploitation โดยอัตโนมัติ

- **MC-REINFORCE ใช้สูตรของ Monte Carlo Policy Gradient**
    $$
    \nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t \mid s_t) \cdot G_t \right]
    $$


    โดยที่:

    - $\pi_\theta(a_t | s_t)$ คือ policy (เช่น softmax over linear output หรือ neural net)

    - $G_t$ คือ return รวม ตั้งแต่ timestep $t$ ถึงจบ episode

    - เราเก็บทั้ง trajectory จนจบ (Monte Carlo) แล้วค่อยอัปเดต

🧠 ขั้นตอนโดยรวม:

- สุ่ม trajectory $(s_0, a_0, r_1, s_1, a_1, r_2, ..., s_T)$ โดย sample จาก policy $\pi_\theta$

- คำนวณ return $G_t$ จาก timestep $t$ ถึงสิ้นสุด episode

- คำนวณ gradient: $\nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t$

- ปรับน้ำหนักของ policy ด้วย `gradient ascent`

##### **Example of MC-REINFORCE in CartPole**

ใช้ policy network แบบ softmax:$π_θ(a∣s)=softmax(Ws)$

สมมุติ policy network เริ่มต้นแบบง่าย ๆ:

- ใช้ weight $W \in \mathbb{R}^{2 \times 4}$ → 2 action × 4 state dim

- state ขณะนั้น: $s_0 = [0.1, 0.0, 0.05, -0.02]$

✅ 1. สุ่ม trajectory จาก policy

สมมุติว่า agent สุ่มได้ trajectory นี้:

| t | State $s_t$ | Action $a_t$ | Reward $r_{t+1}$ | $\pi_\theta(a_t \mid s_t)$ |
|---|-------------|---------------|-------------------|-----------------------------|
| 0 | $s_0$       | 1             | 1                 | 0.6                         |
| 1 | $s_1$       | 0             | 1                 | 0.4                         |
| 2 | $s_2$       | 1             | 1                 | 0.7                         |
| 3 | $s_3$       | 1             | 1                 | 0.8                         |

✅ Episode จบที่ timestep 4 → ได้ reward = 1 ทุก timestep

✅ 2. คำนวณ Return $G_t$

ให้ $\gamma = 0.99$

$$
\begin{aligned}
G_3 &= r_4 = 1 \\
G_2 &= r_3 + \gamma G_3 = 1 + 0.99 \cdot 1 = 1.99 \\
G_1 &= r_2 + \gamma G_2 = 1 + 0.99 \cdot 1.99 = 2.9701 \\
G_0 &= r_1 + \gamma G_1 = 1 + 0.99 \cdot 2.9701 = 3.9404
\end{aligned}
$$

|Time t	|$G_t$|
|-------|---------|
0|	3.9404|
1|	2.9701|
2|	1.99|
3|	1.0|

✅ 3. คำนวณ gradient จากแต่ละ step

เราจะใช้:
$$
\nabla_\theta J(\theta) \approx \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t \mid s_t) \cdot G_t
$$

- ตัวอย่าง Step 0:

    - สมมุติว่า policy network ทำนาย softmax:

    $$
    \pi(a = 0 \mid s_0) = 0.4,\quad \pi(a = 1 \mid s_0) = 0.6
    $$

    - เลือก $a_0 = 1$

    - ได้:

    $$
    \nabla_\theta \log \pi_\theta(1 \mid s_0) = \nabla_\theta \log(0.6)
    $$

    - คูณกับ return:

    $$
    \nabla_\theta J \leftarrow \nabla_\theta \log(0.6) \cdot 3.9404
    $$

- Step 1:

    - $\log \pi(0 \mid s_1) = \log(0.4) \approx -0.9163$

    - $-(-0.9163) \cdot 2.9701 = 2.722$

- Step 2:

    - $\log \pi(1 \mid s_2) = \log(0.7) \approx -0.3567$

    - $-(-0.3567) \cdot 1.9900 = 0.709$

- Step 3:

    - $\log \pi(1 \mid s_3) = \log(0.8) \approx -0.2231$

    - $-(-0.2231) \cdot 1.0000 = 0.223$

- รวม Loss ทั้งหมด Total Loss=2.013+2.722+0.709+0.223=5.667

📌 สรุป

| t   | $\log \pi(a_t \mid s_t)$ | $G_t$   | $-\log \pi \cdot G_t$ |
|-----|---------------------------|---------|-------------------------|
| 0   | -0.5108                   | 3.9404  | 2.013                   |
| 1   | -0.9163                   | 2.9701  | 2.722                   |
| 2   | -0.3567                   | 1.9900  | 0.709                   |
| 3   | -0.2231                   | 1.0000  | 0.223                   |
|     |                           |         | **Total: 5.667**        |

**4. Do Gradient Ascent**

- Call `loss.backward()` to compute gradient

- Call `optimizer.step()` to update policy network weights


### <font color="yellow">**Deep Deterministic Policy Gradient (DDPG)**</font>

### <font color="yellow">**Advantage Actor-Critic (A2C)**</font>

### <font color="yellow">**Proximal Policy Optimization (PPO)**</font>

## ✅ แนวคิดหลักของ PPO

PPO คือหนึ่งใน Policy Gradient algorithms ที่ได้รับความนิยมสูงมาก ซึ่งพัฒนาโดย OpenAI โดยมีเป้าหมายหลักเพื่อ:

- ทำให้การอัปเดตนโยบาย (Policy) มีเสถียรภาพ (stable)
- เรียนรู้จากข้อมูลที่มาจาก policy ปัจจุบัน (on-policy)
- หลีกเลี่ยงการอัปเดตนโยบายแบบ “แรงเกินไป” ซึ่งอาจทำให้ performance แย่ลง

---

## 🏗 โครงสร้าง PPO

PPO ใช้แนวทาง **Actor-Critic** ซึ่งประกอบด้วย:

- **Actor**: เรียนรู้นโยบาย $\pi_\theta(a|s)$ → ใช้เลือก action
- **Critic**: ประเมิน value ของ state หรือ action เช่น $V(s)$ หรือ $Q(s,a)$ → ใช้คำนวณ advantage

---

## 🔁 หลักการทำงานของ PPO (Step-by-Step)

### 1. **Collect Trajectories**
- ให้ agent วิ่งใน environment ตาม policy ปัจจุบัน
- เก็บข้อมูล: $(s_t, a_t, r_t, \log \pi(a_t|s_t), done)$
- รอจนได้ rollout ครบ (เช่น 2048 steps หรือ 1 episode)

---

### 2. **Compute Returns & Advantages**
- คำนวณ **Monte Carlo return** หรือใช้ **GAE (Generalized Advantage Estimation)**:
  
  ```math
  A_t = \delta_t + (\gamma \lambda) \delta_{t+1} + ... ≈ R_t - V(s_t)

### 3. Surrogate Objective with Clipping

- PPO ใช้ surrogate loss function เพื่อควบคุมการอัปเดตของนโยบาย:
- rt(θ)=πθ(at∣st)πθold(at∣st)
- LCLIP(θ)=Et[min⁡(rt(θ)At,clip(rt(θ),1−ϵ,1+ϵ)At)]
- ถ้า $r_t$ เบี่ยงเบนจาก 1 มากเกินไป → จะถูก clip ไว้ เพื่อป้องกัน policy เปลี่ยนแปลงเร็วเกินไป

### 4. Update Policy and Value Function

- อัปเดต Actor ด้วย loss จาก surrogate objective

- อัปเดต Critic ด้วย MSE loss ระหว่าง $V(s_t)$ กับ return

### 5. Repeat Training for Multiple Epochs

- ใช้ข้อมูล rollout เดิมฝึกได้หลายรอบ (เช่น 4-10 epochs)

- ทำให้ sample efficient โดยไม่ต้องใช้ replay buffer

### <font color="yellow">**Soft Actor-Critic (SAC)**</font>

## <font color="pink">**Part 2: Setting up Cart-Pole Agent**</font>

Similar to the previous homework, you will implement a common components that will be the same in most of the function approximation-based RL in the RL_base_function.py.The core components should include, but are not limited to:

### <font color="orange">**1. RL Base class**</font>

- This class should include:

    - Constructor (__init__) to initialize the following parameters:

        - Number of actions: The total number of discrete actions available to the agent.

        - Action range: The minimum and maximum values defining the range of possible actions.

        - Discretize state weight: Weighting factor applied when discretizing the state space for learning.

        - Learning rate: Determines how quickly the model updates based on new information.

        - Initial epsilon: The starting probability of taking a random action in an ε-greedy policy.

        - Epsilon decay rate: The rate at which epsilon decreases over time to favor exploitation over exploration.

        - Final epsilon: The lowest value epsilon can reach, ensuring some level of exploration remains.

        - Discount factor: A coefficient (γ) that determines the importance of future rewards in decision-making.

        - Buffer size: Maximum number of experiences the buffer can hold.

        - Batch size: Number of experiences to sample per batch.

    - Core Functions

        - scale_action(): scale the action (if it is computed from the sigmoid or softmax function) to the proper length.

        - decay_epsilon(): Decreases epsilon over time and returns the updated value.

- Additional details about these functions are provided in the class file. You may also implement additional functions for further analysis.

#### <font color="yellow">**scale_action()**</font>

In [None]:
def scale_action(self, action):
    """
    Maps a discrete action in range [0, n] to a continuous value in [action_min, action_max].

    Args:
        action (int): Discrete action in range [0, n].
        n (int): Number of discrete actions (inclusive range from 0 to n).
    
    Returns:
        torch.Tensor: Scaled action tensor.
    """
    # ========= put your code here ========= #

    # Unpack the minimum and maximum values of the action range
    action_min, action_max = self.action_range

    # Scale the discrete action index (0 to num_of_action-1) to a continuous value within [action_min, action_max]
    scaled = action_min + (action / (self.num_of_action - 1)) * (action_max - action_min)

    # Check if the scaled value is already a torch.Tensor
    if isinstance(scaled, torch.Tensor):
        # If yes, detach it from any computation graph and convert to float32
        return scaled.clone().detach().to(dtype=torch.float32)
    else:
        # Otherwise, convert it into a torch.Tensor of type float32
        return torch.tensor(scaled, dtype=torch.float32)

    # ====================================== #

#### <font color="yellow">**decay_epsilon()**</font>

In [None]:
def decay_epsilon(self):
    """
    Decay epsilon value to reduce exploration over time.
    """
    # ========= put your code here ========= #
    # Decay the exploration rate (epsilon) by multiplying with epsilon_decay,
    # but ensure it doesn't go below the minimum value (final_epsilon)
    self.epsilon = max(self.final_epsilon, self.epsilon * self.epsilon_decay)
    # ====================================== #

### <font color="orange">**2. Replay Buffer Class**</font>



- A class use to store state, action, reward, next state, and termination status from each timestep in episode to use as a dataset to train neural networks. This class should include:

    - Constructor (__init__) to initialize the following parameters:

        - memory: FIFO buffer to store the trajectory within a certain time window.

        - batch_size: Number of data samples drawn from memory to train the neural network.

    - Core Functions

        - add(): Add state, action, reward, next state, and termination status to the FIFO buffer. Discard the oldest data in the buffer

        - sample(): Sample data from memory to use in the neural network training.

    - <font color="orange">**Note that some algorithms may not use all of the data mentioned above to train the neural network.**</font>


#### <font color="yellow">**add()**</font>

#### <font color="yellow">**sample()**</font>

### <font color="orange">**3. Algorithm folder**</font>

- This folder should include:

    - Linear Q Learning class

    - Deep Q-Network class

    - REINFORCE Class

    - One class chosen from the Part 1.

- Each class should inherit from the RL Base class in RL_base_function.py and include:

    - A constructor which initializes the same variables as the class it inherits from.

    - Superclass Initialization (super().__init__()).

    - An update() function that updates the agent’s learnable parameters and advances the training step.

    - A select_action() function select the action according to current policy.

    - A learn() function that train the regression or neural network.

#### <font color="yellow">**Linear Q-Learning class**</font>

#### <font color="yellow">**Deep Q-Network (DQN) class**</font>

#### <font color="yellow">**REINFORCE class**</font>

#### <font color="yellow">**DDPG class**</font>

#### <font color="yellow">**A2C class**</font>

#### <font color="yellow">**PPO class**</font>

#### <font color="yellow">**SAC class**</font>

## <font color="pink">**Part 3: Trainning & Playing to stabilize Cart-Pole Agent**</font>

You need to implement the training loop in train script and main() in the play script (in the "Can be modified" area of both files). Additionally, you must collect data, analyze results, and save models for evaluating agent performance.

- Training the Agent

    - Stabilizing Cart-Pole Task

        ```python
        python scripts/Function_based/train.py --task Stabilize-Isaac-Cartpole-v0
        ```

    - Swing-up Cart-Pole Task (Optional)

        ```python
        python scripts/Function_based/train.py --task SwingUp-Isaac-Cartpole-v0
        ```

- Playing

    - Stabilize Cart-Pole Task

        ```python
        python scripts/Function_based/play.py --task Stabilize-Isaac-Cartpole-v0
        ``` 

    - Swing-up Cart-Pole Task (Optional)

        ```python
        python scripts/Function_based/play.py --task SwingUp-Isaac-Cartpole-v0 
        ```

### <font color="yellow">**train.py**</font>

### <font color="yellow">**play.py**</font>

## <font color="pink">**Part 4: Evaluate Cart-Pole Agent performance**</font>

You must evaluate the agent's performance in terms of learning efficiency (i.e., how well the agent learns to receive higher rewards) and deployment performance (i.e., how well the agent performs in the Cart-Pole problem). Analyze and visualize the results to determine:

- Which algorithm performs best?

- Why does it perform better than the others?


### <font color="yellow">**In term of learning efficiency**</font>

### <font color="yellow">**In term of deployment performance**</font>