### **This is a notebook to give proof of concept of Claude being able to explain the reward functions generated.**

**H0: Claude can explain how its generated reward functions work and also explain the changes it makes to reward functions based of information in its context.**


Answer: Yes

---


Composite Reward Function with Focused Components:

```python
def stabilityReward(observation, action):
    x, xDot, angle, angleDot = observation
    # Focus purely on angle stability
    angle_stability = 1.0 - abs(angle) / 0.209  # 0.209 radians is about 12 degrees
    angular_velocity_component = -abs(angleDot) / 10.0  # Penalize fast angle changes
    return float(angle_stability + angular_velocity_component)
```

```python
def energyEfficiencyReward(observation, action):
    x, xDot, angle, angleDot = observation
    # Focus purely on minimizing movement and energy use
    cart_movement_penalty = -abs(xDot) / 5.0  # Penalize cart velocity
    angular_movement_penalty = -abs(angleDot) / 5.0  # Penalize pole angular velocity
    return float(1.0 + cart_movement_penalty + angular_movement_penalty)
```

```python
def timeBasedReward(observation, action):
    x, xDot, angle, angleDot = observation
    # Simple time-based reward that encourages survival
    base_reward = 1.0
    # Add small penalties for extreme positions/angles to prevent gaming
    if abs(angle) > 0.209 or abs(x) > 2.4:  # If about to fail
        base_reward = 0.0
    return float(base_reward)
```

--

```python
def dynamicRewardFunction(observation, action):
    x, xDot, angle, angleDot = observation
    
    # Get individual rewards
    stability = stabilityReward(observation, action)
    efficiency = energyEfficiencyReward(observation, action)
    timeReward = timeBasedReward(observation, action)
    
    # Combine rewards with equal weights
    return (stability + efficiency + timeReward) / 3.0
```

==================================================

Stability Reward Function Explanation:
Certainly! This stability reward function is designed to encourage the cart-pole system to maintain a stable, upright position. Let's break down the components and explain how they contribute to the goal of stability:

1. Input:
   - The function takes two parameters: `observation` and `action`.
   - `observation` is unpacked into four variables: `x`, `xDot`, `angle`, and `angleDot`.
   - `x`: Position of the cart
   - `xDot`: Velocity of the cart
   - `angle`: Angle of the pole from vertical
   - `angleDot`: Angular velocity of the pole

2. Angle Stability Component:
   ```python
   angle_stability = 1.0 - abs(angle) / 0.209
   ```
   - This component focuses on keeping the pole as close to vertical as possible.
   - `abs(angle)` gives the absolute value of the angle, ensuring symmetrical treatment for both positive and negative angles.
   - 0.209 radians is approximately 12 degrees, which is often used as a termination condition in cart-pole problems.
   - The subtraction from 1.0 inverts the scale, so smaller angles result in higher rewards.
   - This component will be 1.0 when the pole is perfectly vertical (angle = 0) and approach 0 as the angle nears ±0.209 radians.

3. Angular Velocity Component:
   ```python
   angular_velocity_component = -abs(angleDot) / 10.0
   ```
   - This component discourages rapid changes in the pole's angle.
   - `abs(angleDot)` gives the absolute magnitude of the angular velocity.
   - The negative sign ensures that faster angular velocities result in larger penalties (smaller rewards).
   - Division by 10.0 scales this component to be in a similar range as the angle stability component.

4. Combination:
   ```python
   return float(angle_stability + angular_velocity_component)
   ```
   - The function returns the sum of these two components.
   - This combination encourages both a vertical pole position and minimal angular velocity.

5. Focus on Stability:
   - The function ignores the cart's position (`x`) and velocity (`xDot`), focusing solely on the pole's angle and angular velocity.
   - This design prioritizes keeping the pole upright over the cart's position on the track.

6. Why it works for stability:
   - Highest rewards are given when the pole is vertical (angle near 0) and not moving quickly (low angular velocity).
   - As the pole deviates from vertical or starts moving faster, the reward decreases.
   - This incentivizes actions that keep the pole upright and still, which is the essence of stability in the cart-pole system.

In summary, this reward function is tailored specifically for stability by rewarding a vertical pole position and penalizing both angle deviations and rapid angular movements. It disregards the cart's horizontal motion, emphasizing the pole's stability above all else.

==================================================

In [1]:
import anthropic
import gymnasium as gym
from cartPoleShared import *

# API configuration
apiKey = "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"
modelName = "claude-3-5-sonnet-20240620"

def createCompositeCode():
    stabilityFunc = """
def stabilityReward(observation, action):
    x, xDot, angle, angleDot = observation
    # Focus purely on angle stability
    angle_stability = 1.0 - abs(angle) / 0.209  # 0.209 radians is about 12 degrees
    angular_velocity_component = -abs(angleDot) / 10.0  # Penalize fast angle changes
    return float(angle_stability + angular_velocity_component)
"""

    efficiencyFunc = """
def energyEfficiencyReward(observation, action):
    x, xDot, angle, angleDot = observation
    # Focus purely on minimizing movement and energy use
    cart_movement_penalty = -abs(xDot) / 5.0  # Penalize cart velocity
    angular_movement_penalty = -abs(angleDot) / 5.0  # Penalize pole angular velocity
    return float(1.0 + cart_movement_penalty + angular_movement_penalty)
"""

    timeFunc = """
def timeBasedReward(observation, action):
    x, xDot, angle, angleDot = observation
    # Simple time-based reward that encourages survival
    base_reward = 1.0
    # Add small penalties for extreme positions/angles to prevent gaming
    if abs(angle) > 0.209 or abs(x) > 2.4:  # If about to fail
        base_reward = 0.0
    return float(base_reward)
"""

    dynamicFunc = """
def dynamicRewardFunction(observation, action):
    x, xDot, angle, angleDot = observation
    
    # Get individual rewards
    stability = stabilityReward(observation, action)
    efficiency = energyEfficiencyReward(observation, action)
    timeReward = timeBasedReward(observation, action)
    
    # Combine rewards with equal weights
    return (stability + efficiency + timeReward) / 3.0
"""
    return stabilityFunc + efficiencyFunc + timeFunc + dynamicFunc

def main():
    # Get the composite code
    compositeCode = createCompositeCode()
    
    print("Composite Reward Function with Focused Components:")
    print(compositeCode)
    print("\n" + "="*50 + "\n")
    
    # Get explanation for stability reward
    stabilityExplanationMessage = [
        {
            "role": "user",
            "content": """Please explain the logic behind this focused stability reward function:

            def stabilityReward(observation, action):
                x, xDot, angle, angleDot = observation
                angle_stability = 1.0 - abs(angle) / 0.209
                angular_velocity_component = -abs(angleDot) / 10.0
                return float(angle_stability + angular_velocity_component)

            Explain why this function focuses specifically on stability and how each component contributes to this goal."""
        }
    ]
    
    stabilityExplanation = queryAnthropicApi(apiKey, modelName, stabilityExplanationMessage)
    print("Stability Reward Function Explanation:")
    print(stabilityExplanation)
    print("\n" + "="*50 + "\n")
    
    # Create namespace and execute all functions
    namespace = {}
    exec(compositeCode, globals(), namespace)
    
    # Set up environment
    env = gym.make("CartPole-v1", render_mode="human")
    env = CustomCartPoleEnv(env)
    
    # Update reward function with the namespace
    for name, func in namespace.items():
        globals()[name] = func
    
    env.setRewardFunction(namespace['dynamicRewardFunction'])
    
    # Initialize and train agent
    stateSize = env.observation_space.shape[0]
    actionSize = env.action_space.n
    agent = DQLearningAgent(env, stateSize, actionSize, device)
    
    # Training loop
    numEpisodes = 100
    for episode in range(numEpisodes):
        observation = env.reset()[0]
        totalReward = 0
        done = False
        
        while not done:
            action = agent.chooseAction(observation)
            nextObservation, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            totalReward += reward
            
            agent.remember(observation, action, reward, nextObservation, done)
            observation = nextObservation
            
        agent.replay(32)
        print(f"Episode {episode + 1}/{numEpisodes}, Total Reward: {totalReward}")
    
    env.close()

if __name__ == "__main__":
    main()

Composite Reward Function with Focused Components:

def stabilityReward(observation, action):
    x, xDot, angle, angleDot = observation
    # Focus purely on angle stability
    angle_stability = 1.0 - abs(angle) / 0.209  # 0.209 radians is about 12 degrees
    angular_velocity_component = -abs(angleDot) / 10.0  # Penalize fast angle changes
    return float(angle_stability + angular_velocity_component)

def energyEfficiencyReward(observation, action):
    x, xDot, angle, angleDot = observation
    # Focus purely on minimizing movement and energy use
    cart_movement_penalty = -abs(xDot) / 5.0  # Penalize cart velocity
    angular_movement_penalty = -abs(angleDot) / 5.0  # Penalize pole angular velocity
    return float(1.0 + cart_movement_penalty + angular_movement_penalty)

def timeBasedReward(observation, action):
    x, xDot, angle, angleDot = observation
    # Simple time-based reward that encourages survival
    base_reward = 1.0
    # Add small penalties for extreme positions/

*Is reward a good proxy for effectivness of the reward function?*

### **Dynamically Updating Reward function**


Attempting reward function update at episode 100

Generating new reward function for component 1...

Proposed Function:
Here's a modified reward function with detailed inline comments explaining the changes:

```python
def reward_function_1(state):
    """
    Custom reward function focusing on stability (pole angle and angular velocity).
    
    Args:
    state (list): The current state [x, x_dot, theta, theta_dot]
    
    Returns:
    float: The calculated reward
    """
    # Unpack state variables
    x, x_dot, theta, theta_dot = state
    
    # Define target values (center position and upright)
    x_target, theta_target = 0.0, 0.0
    
    # Calculate errors
    x_error = abs(x - x_target)
    theta_error = abs(theta - theta_target)
    
    # Define weight factors (increased emphasis on angle and angular velocity)
    w_x = 0.2  # Reduced from 1.0 to lower importance of cart position
    w_x_dot = 0.1  # Reduced from 0.5 to lower importance of cart velocity
    w_theta = 2.0  # Increased from 1.0 to emphasize pole angle
    w_theta_dot = 1.5  # Increased from 0.5 to emphasize angular velocity

    # Calculate component rewards (using gaussian-like functions for smoother transitions)
    r_x = np.exp(-0.5 * (x_error / 0.5)**2)  # Widened gaussian (0.5 instead of 0.1) for more lenient position reward
    r_x_dot = np.exp(-0.5 * (x_dot / 2.0)**2)  # Slightly widened gaussian for velocity
    r_theta = np.exp(-0.5 * (theta_error / 0.2)**2)  # Narrowed gaussian (0.2 instead of 0.5) for stricter angle reward
    r_theta_dot = np.exp(-0.5 * (theta_dot / 1.0)**2)  # Narrowed gaussian for stricter angular velocity reward

    # Combine component rewards with weights
    reward = w_x * r_x + w_x_dot * r_x_dot + w_theta * r_theta + w_theta_dot * r_theta_dot
    
    # Normalize reward to [0, 1] range
    reward = reward / (w_x + w_x_dot + w_theta + w_theta_dot)
    
    return reward
```

Key changes and explanations:

1. Increased emphasis on stability:
   - Increased weight for theta (w_theta) from 1.0 to 2.0
   - Increased weight for theta_dot (w_theta_dot) from 0.5 to 1.5
   - Decreased weights for x and x_dot to shift focus to pole stability

2. Modified gaussian functions for smoother rewards:
   - Widened the gaussian for x (0.5 instead of 0.1) to be more lenient on cart position
   - Narrowed the gaussian for theta (0.2 instead of 0.5) to be stricter on pole angle
   - Adjusted gaussians for x_dot and theta_dot for better balance

3. Normalized the final reward to ensure it stays in the [0, 1] range, making it easier to interpret and use in learning algorithms.

These changes should result in a reward function that places higher importance on keeping the pole upright and stable, while being more forgiving of the cart's exact position. This should lead to improved stability in the pole balancing task.

Waiting 10 seconds before critic evaluation...

Getting critic's evaluation...

Critic Response:
1. The function is properly named 'reward_function_1', which is clear and distinct.

2. The reward calculations appear mathematically sound. The use of Gaussian-like functions for smoother transitions is a good approach, and the normalization of the final reward to the [0, 1] range is appropriate.

3. The function focuses appropriately on stability, particularly emphasizing the pole angle and angular velocity. This is achieved through increased weights for theta and theta_dot, and decreased weights for cart position and velocity.

4. Potential issues:
   - The function might be overly lenient on cart position, which could lead to the cart drifting too far from the center.
   - The narrowed Gaussian for theta might make the task too difficult initially, potentially slowing down learning.
   - The weights and Gaussian parameters are somewhat arbitrary and might need fine-tuning based on empirical results.
   - There's no explicit penalty for failure states (e.g., pole falling past a certain angle), which might be beneficial.

Overall, the function seems well-designed for its purpose, but may require some empirical testing and potential adjustments.

Decision: Yes

Critic Decision: Approved

In [2]:
import anthropic
import gymnasium as gym
from cartPoleShared import *
import time
import matplotlib.pyplot as plt
import io
import base64
import numpy as np
from collections import deque

# API configuration
apiKey = "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"
modelName = "claude-3-5-sonnet-20240620"

class RewardUpdateSystem:
    def __init__(self, apiKey: str, modelName: str, maxHistoryLength: int = 100, target_component: int = 1):
        self.apiKey = apiKey
        self.modelName = modelName
        self.rewardHistory = deque(maxlen=maxHistoryLength)
        self.episodeCount = 0
        self.lastUpdateEpisode = 0
        self.target_component = target_component
    
    def generateRewardGraph(self):
        """Generate a base64 encoded string of the reward history plot."""
        plt.figure(figsize=(10, 6))
        plt.plot(list(self.rewardHistory), label=f'Reward Component {self.target_component}')
        plt.title(f'Component {self.target_component} Reward History')
        plt.xlabel('Steps')
        plt.ylabel('Reward Value')
        plt.grid(True)
        
        buffer = io.BytesIO()
        plt.savefig(buffer, format='png')
        buffer.seek(0)
        image_png = buffer.getvalue()
        buffer.close()
        plt.close()
        
        return base64.b64encode(image_png).decode()

    def createUpdatePrompt(self, currentFunction: str, graph: str):
        return [
            {
                "role": "user",
                "content": f"""Analyze this reward component function and its performance graph, then suggest specific modifications.
                
                Current Function:
                {currentFunction}
                
                Performance Graph (base64 encoded):
                {graph}
                
                Requirements:
                1. The function MUST be named 'reward_function_{self.target_component}'
                2. Focus on stability (pole angle and angular velocity)
                3. Include detailed inline comments explaining each change
                4. Compare old vs new values in comments
                
                Output only the modified function with detailed inline comments explaining changes."""
            }
        ]

    def createCriticPrompt(self, proposedFunction: str):
        return [
            {
                "role": "user",
                "content": f"""Act as a reward function critic. Analyze this proposed reward component:

                {proposedFunction}

                Evaluate:
                1. Is the function properly named 'reward_function_{self.target_component}'?
                2. Are reward calculations mathematically sound?
                3. Does it focus appropriately on its specific aspect?
                4. Are there any potential issues?
                
                End your response with EXACTLY one line containing only "Decision: Yes" or "Decision: No"."""
            }
        ]
        
    def recordReward(self, info: dict):
        """Record the specific component reward value."""
        if 'component_rewards' in info:
            component_reward = info['component_rewards'].get(f'reward_function_{self.target_component}')
            if component_reward is not None:
                self.rewardHistory.append(component_reward)
        
    def validateAndUpdate(self, currentFunction: str):
        """Generate and validate updated reward function with delays between API calls."""
        try:
            # First API call for function update
            print(f"\nGenerating new reward function for component {self.target_component}...")
            graph = self.generateRewardGraph()
            updatePrompt = self.createUpdatePrompt(currentFunction, graph)
            proposedFunction = queryAnthropicApi(self.apiKey, self.modelName, updatePrompt)
            
            print("\nProposed Function:")
            print(proposedFunction)
            
            print("\nWaiting 10 seconds before critic evaluation...")
            time.sleep(10)
            
            print("\nGetting critic's evaluation...")
            criticPrompt = self.createCriticPrompt(proposedFunction)
            criticResponse = queryAnthropicApi(self.apiKey, self.modelName, criticPrompt)
            
            print("\nCritic Response:")
            print(criticResponse)
            
            approved = criticResponse.strip().endswith("Decision: Yes")
            print(f"\nCritic Decision: {'Approved' if approved else 'Rejected'}")
            
            if approved:
                return proposedFunction, True
            return currentFunction, False
            
        except Exception as e:
            print(f"\nError during update: {e}")
            return currentFunction, False

    def waitingTime(self, episode: int):
        """Determine if it's time to update the reward function."""
        UPDATE_INTERVAL = 100
        return episode - self.lastUpdateEpisode >= UPDATE_INTERVAL

def main():
    # Initialize with 3 reward components
    env = gym.make("CartPole-v1", render_mode="human")
    env = CustomCartPoleEnv(env, num_components=3)
    
    # Initialize the update system targeting first component
    updateSystem = RewardUpdateSystem(apiKey=apiKey, modelName=modelName, target_component=1)
    
    # Initialize agent
    stateSize = env.observation_space.shape[0]
    actionSize = env.action_space.n
    agent = DQLearningAgent(env, stateSize, actionSize, device)
    
    # Training loop
    numEpisodes = 500
    for episode in range(numEpisodes):
        observation = env.reset()[0]
        totalReward = 0
        done = False
        
        while not done:
            action = agent.chooseAction(observation)
            nextObservation, reward, terminated, truncated, info = env.step(action)
            done = terminated or truncated
            totalReward += reward
            
            # Record the specific component reward
            updateSystem.recordReward(info)
            
            agent.remember(observation, action, reward, nextObservation, done)
            observation = nextObservation
        
        if updateSystem.waitingTime(episode):
            print(f"\nAttempting reward function update at episode {episode}")
            func_name = f'reward_function_{updateSystem.target_component}'
            current_func = env.reward_components[func_name]
            newFunction, updated = updateSystem.validateAndUpdate(current_func)
            
            if updated:
                print(f"\nUpdating reward component {updateSystem.target_component}...")
                env.set_component_reward(updateSystem.target_component, newFunction)
                updateSystem.lastUpdateEpisode = episode
            else:
                print("\nKeeping current reward function.")
        
        agent.replay(32)
        print(f"Episode {episode + 1}/{numEpisodes}, Total Reward: {totalReward}")
    
    env.close()

if __name__ == "__main__":
    main()


Episode 1/500, Total Reward: 10.91313761472702
Episode 2/500, Total Reward: 24.778324127197266
Episode 3/500, Total Reward: 35.82992607355118
Episode 4/500, Total Reward: 21.90902191400528
Episode 5/500, Total Reward: 24.916518688201904
Episode 6/500, Total Reward: 25.92098104953766
Episode 7/500, Total Reward: 12.913588225841522
Episode 8/500, Total Reward: 19.904524862766266
Episode 9/500, Total Reward: 19.890467643737793
Episode 10/500, Total Reward: 11.924701571464539
Episode 11/500, Total Reward: 37.8979309797287
Episode 12/500, Total Reward: 13.880309343338013
Episode 13/500, Total Reward: 12.930569171905518
Episode 14/500, Total Reward: 12.916230142116547
Episode 15/500, Total Reward: 9.901115357875824
Episode 16/500, Total Reward: 11.923483490943909
Episode 17/500, Total Reward: 15.893583595752716
Episode 18/500, Total Reward: 10.895552575588226
Episode 19/500, Total Reward: 25.87209528684616
Episode 20/500, Total Reward: 24.85571175813675
Episode 21/500, Total Reward: 18.90034

TypeError: 'str' object is not callable