### **State of the art Reward Functions for Cart Pole**
https://github.com/openai/gym/wiki/Leaderboard


Continuous Angle-Based Reward
Provides a continuous reward based on the angle of the pole.
```
def decide(self, observation):
        position, velocity, angle, angle_velocity = observation
        action = int(3. * angle + angle_velocity > 0.)
        return action
```

Position-Angle Balanced Reward
Balances rewards between pole angle and cart position.
```

```

Velocity-Aware Reward
Incorporates cart and pole velocities.
```

```

### **Claude's Reward Functions**

```python
def LLMOutput(observation, action):
    x, _, angle, _ = observation
    reward = 1.0 - abs(angle) - 0.5 * abs(x)
    return max(0, reward)
```

1. Input:
   - The function takes two parameters: `observation` and `action` (although `action` is not used in this function).
   - `observation` is assumed to be a tuple or list containing four elements, but only the first (x) and third (angle) are used.

2. Unpacking the observation:
   - `x, _, angle, _` = observation
   - This line extracts the position (x) and angle from the observation. The underscores (_) indicate that we're ignoring the second and fourth elements of the observation.

3. Calculating the reward:
   - `reward = 1.0 - abs(angle) - 0.5 * abs(x)`
   - The reward starts at 1.0 and is reduced based on two factors:
     a. The absolute value of the angle: This penalizes the pole for leaning away from vertical.
     b. Half the absolute value of the position: This penalizes the cart for being far from the center.

4. The logic behind this reward calculation:
   - The goal is to keep the pole upright (angle close to 0) and the cart near the center (x close to 0).
   - A larger angle or distance from center results in a lower reward.
   - The angle has a stronger impact on the reward than the position (full weight vs. half weight).

5. Returning the reward:
   - `return max(0, reward)`
   - This ensures the reward is never negative. If the calculated reward is negative, 0 is returned instead.

In summary, this function calculates a reward based on how well the pole is balanced (angle) and how centered the cart is (x). The reward is highest (close to 1) when the pole is perfectly upright and the cart is in the center. It decreases as the pole leans more or the cart moves farther from the center, with the angle having a stronger influence on the reward than the position.



---


```python
def LLMOutput(observation, action):
    x, _, angle, _ = observation
    reward = 1.0 - abs(angle) - 0.5 * abs(x)
    return int(max(0, reward * 100))
```

1. Function Input:
   - The function takes two parameters: `observation` and `action`.
   - `observation` is likely a tuple or list containing information about the current state of the cart-pole system.

2. Unpacking the Observation:
   - `x, _, angle, _ = observation`
   - This line unpacks the `observation` into separate variables.
   - `x` represents the position of the cart.
   - `angle` represents the angle of the pole.
   - The `_` is used to ignore other values in the observation that are not needed for this calculation.

3. Reward Calculation:
   - `reward = 1.0 - abs(angle) - 0.5 * abs(x)`
   - This line calculates a reward based on the current state.
   - The reward starts at 1.0 and is reduced based on two factors:
     a. The absolute value of the angle: A larger angle (more tilt) reduces the reward more.
     b. Half of the absolute value of the position: Being further from the center (x=0) also reduces the reward, but less severely than the angle.

4. Final Output:
   - `return int(max(0, reward * 100))`
   - The reward is multiplied by 100 to scale it up.
   - The `max` function ensures the result is never negative.
   - The `int` function converts the result to an integer.

Logic behind the reward:
- The goal is to keep the pole as upright as possible (angle close to 0) and the cart near the center (x close to 0).
- The reward is highest (close to 100) when both the angle and position are close to 0.
- As the pole tilts more or the cart moves further from the center, the reward decreases.
- The angle has a more significant impact on the reward than the position, reflecting that keeping the pole upright is the primary goal.

This reward function encourages actions that keep the pole balanced and the cart centered, which aligns with the objective of the cart-pole balancing task.

In [1]:
import gymnasium as gym
import numpy as np

class CustomCartPoleEnv(gym.Wrapper):
    def __init__(self, env):
        super().__init__(env)
        self.env = env
        self.reward_function = self.angle_based_reward

    def step(self, action):
        observation, _, terminated, truncated, info = self.env.step(action)
        reward = self.reward_function(observation, action)
        return observation, reward, terminated, truncated, info

    def angle_based_reward(self, observation, action):
        _, _, angle, _ = observation
        return np.cos(angle)  # Higher reward when angle is closer to 0 (vertical)
    
    def LLMRewardFunction(self, function_string):
        local_namespace = {}
        # Execute the function string in the local namespace
        exec(function_string, globals(), local_namespace)
        
        # Find the function in the local namespace
        new_function = None
        for item in local_namespace.values():
            if callable(item):
                new_function = item
                break
        
        if new_function is None:
            raise ValueError("No function found in the provided string")
        
        # Set the new function as the reward function
        self.set_reward_function(new_function)

    def set_reward_function(self, reward_function):
        self.reward_function = reward_function

def simple_angle_policy(observation):
    _, _, angle, _ = observation
    return 1 if angle > 0 else 0



In [2]:
import anthropic

# Create and wrap the environment
env = gym.make("CartPole-v1", render_mode="human")
env = CustomCartPoleEnv(env)

#API Query
client = anthropic.Anthropic(
    # defaults to os.environ.get("ANTHROPIC_API_KEY")
    api_key="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",
)
generatedRewardFunction = client.messages.create(
    model="claude-3-5-sonnet-20240620",
    max_tokens=1024,
    messages=[
        {"role": "user"

         , "content": """You are a python code outputter. I want your output to only be python code and just be one function, which is an integer. No other text with it the output. 
         This function will be a reward function, named dynamicRewardFunction(), for a RL environment that follows the description below. 
         The inputs are observation and action in that order, the input observation can be broken down as follows: x, _, angle,_ = observation :
         
         This environment is a pole balanced on a cart that can move from left to right, 
         the idea is to keep the pole as upright as possible by moving the cart either left or right, 
         the information you have available to you is the position(x) and the angle(angle)."""}
    ]
)
print(generatedRewardFunction.content[0].text)
print("\n ---------------------------------------------------- \n")

rewardFunctionExplanation = client.messages.create(
    model="claude-3-5-sonnet-20240620",
    max_tokens=1024,
    messages=[
        {"role": "user"

         , "content": f"""I need you to explain the logic behind this following function:
        {generatedRewardFunction.content[0].text} 

        This environment is a pole balanced on a cart that can move from left to right, 
         the idea is to keep the pole as upright as possible by moving the cart either left or right, 
         the information you have available to you is the position(x) and the angle(angle).
         """}
    ]
)

print(rewardFunctionExplanation.content[0].text)

# generatedRewardFunction = """def LLMOutput(observation, action):
#    x, _, angle, _ = observation
#    reward = 1.0 - abs(angle) - 0.5 * abs(x)
#    if abs(angle) > 0.2 or abs(x) > 2.4:
#        reward -= 10
#    return reward
# """#API call to Claude

# Set initial policy and reward function
env.LLMRewardFunction(generatedRewardFunction.content[0].text)
print("\n ---------------------------------------------------- \n Reward Function Name: ", env.reward_function.__name__)
current_policy = simple_angle_policy  # Set the policy 


# Main loop
observation, info = env.reset(seed=42)
for _ in range(1000):
    action = current_policy(observation)
    observation, reward, terminated, truncated, info = env.step(action)

    if terminated or truncated:
        observation, info = env.reset()

env.close()

def dynamicRewardFunction(observation, action):
    x, _, angle, _ = observation
    reward = 1.0 - abs(angle) * 0.5 - abs(x) * 0.1
    if abs(angle) > 0.2 or abs(x) > 2.4:
        reward -= 10
    return int(reward * 10)

 ---------------------------------------------------- 

Certainly! Let's break down the logic of this dynamic reward function:

1. Input:
   - `observation`: A tuple containing the current state of the environment. It includes:
     - `x`: The position of the cart
     - `angle`: The angle of the pole (in radians)
   - `action`: The action taken (not used in this function)

2. Reward Calculation:

   ```python
   reward = 1.0 - abs(angle) * 0.5 - abs(x) * 0.1
   ```

   - The reward starts at 1.0 (maximum reward)
   - It then subtracts penalties based on:
     - The absolute angle of the pole: `abs(angle) * 0.5`
       - This encourages keeping the pole upright
       - The larger the angle, the larger the penalty
     - The absolute position of the cart: `abs(x) * 0

In [None]:
# Implementation with LLM Generated Policy

