### **This is a notebook to give proof of concept of Claude being able to explain the reward functions generated.**

**H0: Claude can explain how its generated reward functions work and also explain the changes it makes to reward functions based of information in its context.**


Answer: Yes

---


Composite Reward Function with Focused Components:

```python
def stabilityReward(observation, action):
    x, xDot, angle, angleDot = observation
    # Focus purely on angle stability
    angle_stability = 1.0 - abs(angle) / 0.209  # 0.209 radians is about 12 degrees
    angular_velocity_component = -abs(angleDot) / 10.0  # Penalize fast angle changes
    return float(angle_stability + angular_velocity_component)
```

```python
def energyEfficiencyReward(observation, action):
    x, xDot, angle, angleDot = observation
    # Focus purely on minimizing movement and energy use
    cart_movement_penalty = -abs(xDot) / 5.0  # Penalize cart velocity
    angular_movement_penalty = -abs(angleDot) / 5.0  # Penalize pole angular velocity
    return float(1.0 + cart_movement_penalty + angular_movement_penalty)
```

```python
def timeBasedReward(observation, action):
    x, xDot, angle, angleDot = observation
    # Simple time-based reward that encourages survival
    base_reward = 1.0
    # Add small penalties for extreme positions/angles to prevent gaming
    if abs(angle) > 0.209 or abs(x) > 2.4:  # If about to fail
        base_reward = 0.0
    return float(base_reward)
```

--

```python
def dynamicRewardFunction(observation, action):
    x, xDot, angle, angleDot = observation
    
    # Get individual rewards
    stability = stabilityReward(observation, action)
    efficiency = energyEfficiencyReward(observation, action)
    timeReward = timeBasedReward(observation, action)
    
    # Combine rewards with equal weights
    return (stability + efficiency + timeReward) / 3.0
```

==================================================

Stability Reward Function Explanation:
Certainly! This stability reward function is designed to encourage the cart-pole system to maintain a stable, upright position. Let's break down the components and explain how they contribute to the goal of stability:

1. Input:
   - The function takes two parameters: `observation` and `action`.
   - `observation` is unpacked into four variables: `x`, `xDot`, `angle`, and `angleDot`.
   - `x`: Position of the cart
   - `xDot`: Velocity of the cart
   - `angle`: Angle of the pole from vertical
   - `angleDot`: Angular velocity of the pole

2. Angle Stability Component:
   ```python
   angle_stability = 1.0 - abs(angle) / 0.209
   ```
   - This component focuses on keeping the pole as close to vertical as possible.
   - `abs(angle)` gives the absolute value of the angle, ensuring symmetrical treatment for both positive and negative angles.
   - 0.209 radians is approximately 12 degrees, which is often used as a termination condition in cart-pole problems.
   - The subtraction from 1.0 inverts the scale, so smaller angles result in higher rewards.
   - This component will be 1.0 when the pole is perfectly vertical (angle = 0) and approach 0 as the angle nears ±0.209 radians.

3. Angular Velocity Component:
   ```python
   angular_velocity_component = -abs(angleDot) / 10.0
   ```
   - This component discourages rapid changes in the pole's angle.
   - `abs(angleDot)` gives the absolute magnitude of the angular velocity.
   - The negative sign ensures that faster angular velocities result in larger penalties (smaller rewards).
   - Division by 10.0 scales this component to be in a similar range as the angle stability component.

4. Combination:
   ```python
   return float(angle_stability + angular_velocity_component)
   ```
   - The function returns the sum of these two components.
   - This combination encourages both a vertical pole position and minimal angular velocity.

5. Focus on Stability:
   - The function ignores the cart's position (`x`) and velocity (`xDot`), focusing solely on the pole's angle and angular velocity.
   - This design prioritizes keeping the pole upright over the cart's position on the track.

6. Why it works for stability:
   - Highest rewards are given when the pole is vertical (angle near 0) and not moving quickly (low angular velocity).
   - As the pole deviates from vertical or starts moving faster, the reward decreases.
   - This incentivizes actions that keep the pole upright and still, which is the essence of stability in the cart-pole system.

In summary, this reward function is tailored specifically for stability by rewarding a vertical pole position and penalizing both angle deviations and rapid angular movements. It disregards the cart's horizontal motion, emphasizing the pole's stability above all else.

==================================================

In [1]:
import anthropic
import gymnasium as gym
import sys
import os
from pathlib import Path

current_dir = os.getcwd()  
project_root = str(Path(current_dir).parent.parent)
sys.path.append(project_root)

# from ExtraNotebooksCodeExamples.cartPoleShared import *

# anthropicAPI
from AdaptiveRewardFunctionLearning.Prompts.APIQuery import queryAnthropicApi


# API configuration
from AdaptiveRewardFunctionLearning.Prompts.prompts import apiKey, modelName, device

# rewardCompositeGenerator
from AdaptiveRewardFunctionLearning.RewardGeneration.rewardCodeGeneration import createCompositeCode

def main():
    # Get the composite code
    compositeCode = createCompositeCode()
    
    print("Composite Reward Function with Focused Components:")
    print(compositeCode)
    print("\n" + "="*50 + "\n")
    

    # criticPrompts
    # Get explanation for stability reward
    from AdaptiveRewardFunctionLearning.Prompts.criticPrompts import  stabilityExplanationMessage

    
    stabilityExplanation = queryAnthropicApi(apiKey, modelName, stabilityExplanationMessage)
    print("Stability Reward Function Explanation:")
    print(stabilityExplanation)
    print("\n" + "="*50 + "\n")

    # logClaudeCall
    from AdaptiveRewardFunctionLearning.Prompts.APIQuery import logClaudeCall
    logClaudeCall(
        rewardPrompt="Composite Reward Function with Focused Components",
        rewardResponse=compositeCode,
        explanationPrompt="Stability Reward Function Explanation",
        explanationResponse=stabilityExplanation,
        logFile="LLMExplainabilityPOC.jsonl"
    )

    
    # Create namespace and execute all functions
    namespace = {}
    exec(compositeCode, globals(), namespace)
    

    # env
    from RLEnvironment.env import CustomCartPoleEnv
    # Set up environment
    env = gym.make("CartPole-v1", render_mode="human")
    env = CustomCartPoleEnv(env)
    
    # Update reward function with the namespace
    for name, func in namespace.items():
        globals()[name] = func
    
    env.setRewardFunction(namespace['dynamicRewardFunction'])
    
    # Initialize and train agent
    stateSize = env.observation_space.shape[0]
    actionSize = env.action_space.n
    
    # env
    from RLEnvironment.training import DQLearningAgent
    agent = DQLearningAgent(env, stateSize, actionSize, device)
    

    # training
    from RLEnvironment.training import trainDQLearning

    # Training loop
    numEpisodes = 100
    rewards = trainDQLearning(agent, env, episodes=numEpisodes)

    for episode in range(numEpisodes):
        print(f"Episode {episode + 1}/{numEpisodes}, Total Reward: {rewards[episode]}")

    env.close()

if __name__ == "__main__":
    main()

Composite Reward Function with Focused Components:

def stabilityReward(observation, action):
    x, xDot, angle, angleDot = observation
    # Focus purely on angle stability
    angle_stability = 1.0 - abs(angle) / 0.209  # 0.209 radians is about 12 degrees
    angular_velocity_component = -abs(angleDot) / 10.0  # Penalize fast angle changes
    return float(angle_stability + angular_velocity_component)

def energyEfficiencyReward(observation, action):
    x, xDot, angle, angleDot = observation
    # Focus purely on minimizing movement and energy use
    cart_movement_penalty = -abs(xDot) / 5.0  # Penalize cart velocity
    angular_movement_penalty = -abs(angleDot) / 5.0  # Penalize pole angular velocity
    return float(1.0 + cart_movement_penalty + angular_movement_penalty)

def timeBasedReward(observation, action):
    x, xDot, angle, angleDot = observation
    # Simple time-based reward that encourages survival
    base_reward = 1.0
    # Add small penalties for extreme positions/

APITimeoutError: Request timed out.

*Is reward a good proxy for effectivness of the reward function?*

### **Dynamically Updating Reward function**


Attempting reward function update at episode 100

Generating new reward function for component 1...

Proposed Function:
Here's a modified reward function with detailed inline comments explaining the changes:

```python
def reward_function_1(state):
    """
    Custom reward function focusing on stability (pole angle and angular velocity).
    
    Args:
    state (list): The current state [x, x_dot, theta, theta_dot]
    
    Returns:
    float: The calculated reward
    """
    # Unpack state variables
    x, x_dot, theta, theta_dot = state
    
    # Define target values (center position and upright)
    x_target, theta_target = 0.0, 0.0
    
    # Calculate errors
    x_error = abs(x - x_target)
    theta_error = abs(theta - theta_target)
    
    # Define weight factors (increased emphasis on angle and angular velocity)
    w_x = 0.2  # Reduced from 1.0 to lower importance of cart position
    w_x_dot = 0.1  # Reduced from 0.5 to lower importance of cart velocity
    w_theta = 2.0  # Increased from 1.0 to emphasize pole angle
    w_theta_dot = 1.5  # Increased from 0.5 to emphasize angular velocity

    # Calculate component rewards (using gaussian-like functions for smoother transitions)
    r_x = np.exp(-0.5 * (x_error / 0.5)**2)  # Widened gaussian (0.5 instead of 0.1) for more lenient position reward
    r_x_dot = np.exp(-0.5 * (x_dot / 2.0)**2)  # Slightly widened gaussian for velocity
    r_theta = np.exp(-0.5 * (theta_error / 0.2)**2)  # Narrowed gaussian (0.2 instead of 0.5) for stricter angle reward
    r_theta_dot = np.exp(-0.5 * (theta_dot / 1.0)**2)  # Narrowed gaussian for stricter angular velocity reward

    # Combine component rewards with weights
    reward = w_x * r_x + w_x_dot * r_x_dot + w_theta * r_theta + w_theta_dot * r_theta_dot
    
    # Normalize reward to [0, 1] range
    reward = reward / (w_x + w_x_dot + w_theta + w_theta_dot)
    
    return reward
```

Key changes and explanations:

1. Increased emphasis on stability:
   - Increased weight for theta (w_theta) from 1.0 to 2.0
   - Increased weight for theta_dot (w_theta_dot) from 0.5 to 1.5
   - Decreased weights for x and x_dot to shift focus to pole stability

2. Modified gaussian functions for smoother rewards:
   - Widened the gaussian for x (0.5 instead of 0.1) to be more lenient on cart position
   - Narrowed the gaussian for theta (0.2 instead of 0.5) to be stricter on pole angle
   - Adjusted gaussians for x_dot and theta_dot for better balance

3. Normalized the final reward to ensure it stays in the [0, 1] range, making it easier to interpret and use in learning algorithms.

These changes should result in a reward function that places higher importance on keeping the pole upright and stable, while being more forgiving of the cart's exact position. This should lead to improved stability in the pole balancing task.

Waiting 10 seconds before critic evaluation...

Getting critic's evaluation...

Critic Response:
1. The function is properly named 'reward_function_1', which is clear and distinct.

2. The reward calculations appear mathematically sound. The use of Gaussian-like functions for smoother transitions is a good approach, and the normalization of the final reward to the [0, 1] range is appropriate.

3. The function focuses appropriately on stability, particularly emphasizing the pole angle and angular velocity. This is achieved through increased weights for theta and theta_dot, and decreased weights for cart position and velocity.

4. Potential issues:
   - The function might be overly lenient on cart position, which could lead to the cart drifting too far from the center.
   - The narrowed Gaussian for theta might make the task too difficult initially, potentially slowing down learning.
   - The weights and Gaussian parameters are somewhat arbitrary and might need fine-tuning based on empirical results.
   - There's no explicit penalty for failure states (e.g., pole falling past a certain angle), which might be beneficial.

Overall, the function seems well-designed for its purpose, but may require some empirical testing and potential adjustments.

Decision: Yes

Critic Decision: Approved

In [None]:
import gymnasium as gym

def main():
    # Initialize with 3 reward components
    env = gym.make("CartPole-v1", render_mode="human")

    # CustomCartPoleEnv
    from RLEnvironment.env import CustomCartPoleEnv
    env = CustomCartPoleEnv(env, numComponents=3)
    
    # RewardUpdateSystem
    from AdaptiveRewardFunctionLearning.Prompts.criticPrompts import RewardUpdateSystem
    # Initialize the update system targeting first component
    updateSystem = RewardUpdateSystem(apiKey=apiKey, modelName=modelName, targetComponent=1)
    
    # Initialize agent
    stateSize = env.observation_space.shape[0]
    actionSize = env.action_space.n

    # DQLearningAgent
    from RLEnvironment.training import DQLearningAgent
    agent = DQLearningAgent(env, stateSize, actionSize, device)

    # updateRewardFunction
    from RLEnvironment.env import updateRewardFunction
    
    # trainDQLearning
    from RLEnvironment.training import trainDQLearning
    agent, env, rewards = trainDQLearning(agent, env, updateSystem, 500, onEpisodeEnd=updateRewardFunction)

    env.close()

if __name__ == "__main__":
    main()

NameError: name 'CustomCartPoleEnv' is not defined