### **State of the art Reward Functions for Cart Pole**
https://github.com/openai/gym/wiki/Leaderboard


Continuous Angle-Based Reward
Provides a continuous reward based on the angle of the pole.
```
def decide(self, observation):
        position, velocity, angle, angle_velocity = observation
        action = int(3. * angle + angle_velocity > 0.)
        return action
```

Position-Angle Balanced Reward
Balances rewards between pole angle and cart position.
```

```

Velocity-Aware Reward
Incorporates cart and pole velocities.
```

```

### **Claude's Reward Functions**

```python
def LLMOutput(observation, action):
    x, _, angle, _ = observation
    reward = 1.0 - abs(angle) - 0.5 * abs(x)
    return max(0, reward)
```

1. Input:
   - The function takes two parameters: `observation` and `action` (although `action` is not used in this function).
   - `observation` is assumed to be a tuple or list containing four elements, but only the first (x) and third (angle) are used.

2. Unpacking the observation:
   - `x, _, angle, _` = observation
   - This line extracts the position (x) and angle from the observation. The underscores (_) indicate that we're ignoring the second and fourth elements of the observation.

3. Calculating the reward:
   - `reward = 1.0 - abs(angle) - 0.5 * abs(x)`
   - The reward starts at 1.0 and is reduced based on two factors:
     a. The absolute value of the angle: This penalizes the pole for leaning away from vertical.
     b. Half the absolute value of the position: This penalizes the cart for being far from the center.

4. The logic behind this reward calculation:
   - The goal is to keep the pole upright (angle close to 0) and the cart near the center (x close to 0).
   - A larger angle or distance from center results in a lower reward.
   - The angle has a stronger impact on the reward than the position (full weight vs. half weight).

5. Returning the reward:
   - `return max(0, reward)`
   - This ensures the reward is never negative. If the calculated reward is negative, 0 is returned instead.

In summary, this function calculates a reward based on how well the pole is balanced (angle) and how centered the cart is (x). The reward is highest (close to 1) when the pole is perfectly upright and the cart is in the center. It decreases as the pole leans more or the cart moves farther from the center, with the angle having a stronger influence on the reward than the position.



---

```python
def dynamicRewardFunction(observation, action):
    x, _, angle, _ = observation
    reward = 1.0
    
    if abs(x) > 2.4 or abs(angle) > 0.2:
        reward = -10.0
    else:
        reward += 10.0 / (1.0 + abs(angle))
        reward += 5.0 / (1.0 + abs(x))
    
    return int(reward)
```

Certainly! Let's break down the logic of this dynamic reward function:

1. Function Input:
   - The function takes two parameters: `observation` and `action`.
   - `observation` is unpacked into four variables, but only `x` and `angle` are used.
   - `x` represents the cart's position.
   - `angle` represents the pole's angle from vertical.

2. Initial Reward:
   - The reward starts at 1.0.
...
This reward function encourages the learning algorithm to find a balance between keeping the pole upright and keeping the cart centered, with a stronger emphasis on pole verticality.



In [8]:
#CartPoleLLMRewardFunctionImplmentation

import gymnasium as gym
import numpy as np

class CustomCartPoleEnv(gym.Wrapper):
    def __init__(self, env):
        super().__init__(env)
        self.env = env
        self.rewardFunction = self.angleBasedReward

    def step(self, action):
        observation, _, terminated, truncated, info = self.env.step(action)
        reward = self.rewardFunction(observation, action)
        return observation, reward, terminated, truncated, info

    def angleBasedReward(self, observation, action):
        _, _, angle, _ = observation
        return np.cos(angle)  #higher reward when angle is closer to 0 (vertical)
    
    def LLMRewardFunction(self, functionString):
        localNamespace = {}
        #execute the function string in the local namespace
        exec(functionString, globals(), localNamespace)
        
        new_function = None
        for item in localNamespace.values():
            if callable(item):
                new_function = item
                break
        
        if new_function is None:
            raise ValueError("Invalid Function")
        
        #Set the new function as the reward function
        self.setRewardFunction(new_function)

    def setRewardFunction(self, rewardFunction):
        self.rewardFunction = rewardFunction

def simpleAnglePolicy(observation):
    _, _, angle, _ = observation
    return 1 if angle > 0 else 0



In [14]:
import anthropic

env = gym.make("CartPole-v1", render_mode="human")
env = CustomCartPoleEnv(env)

#API Query
client = anthropic.Anthropic(
    # defaults to os.environ.get("ANTHROPIC_API_KEY")
    api_key="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",
)
generatedRewardFunction = client.messages.create(
    model="claude-3-5-sonnet-20240620",
    max_tokens=1024,
    messages=[
        {"role": "user"

         , "content": """You are a python code outputter. I want your output to only be python code and just be one function, which is an integer. No other text with it the output. 
         This function will be a reward function, named dynamicRewardFunction(), for a RL environment that follows the description below. 
         The inputs are observation and action in that order, the input observation can be broken down as follows: x, _, angle,_ = observation :
         
         This environment is a pole balanced on a cart that can move from left to right, 
         the idea is to keep the pole as upright as possible by moving the cart either left or right, 
         the information you have available to you is the position(x) and the angle(angle)."""}
    ]
)
print(generatedRewardFunction.content[0].text)
print("\n ---------------------------------------------------- \n")

rewardFunctionExplanation = client.messages.create(
    model="claude-3-5-sonnet-20240620",
    max_tokens=1024,
    messages=[
        {"role": "user"

         , "content": f"""I need you to explain the logic behind this following function:
        {generatedRewardFunction.content[0].text} 

        This environment is a pole balanced on a cart that can move from left to right, 
         the idea is to keep the pole as upright as possible by moving the cart either left or right, 
         the information you have available to you is the position(x) and the angle(angle).
         """}
    ]
)

print(rewardFunctionExplanation.content[0].text)

# generatedRewardFunction = """def LLMOutput(observation, action):
#    x, _, angle, _ = observation
#    reward = 1.0 - abs(angle) - 0.5 * abs(x)
#    if abs(angle) > 0.2 or abs(x) > 2.4:
#        reward -= 10
#    return reward
# """#API call to Claude

# Set initial policy and reward function
env.LLMRewardFunction(generatedRewardFunction.content[0].text)
print("\n ---------------------------------------------------- \n Reward Function Name: ", env.rewardFunction.__name__)
current_policy = simpleAnglePolicy  # Set the policy 


#Main loop
observation, info = env.reset(seed=42)
for _ in range(1000):
    action = current_policy(observation)
    observation, reward, terminated, truncated, info = env.step(action)

    if terminated or truncated:
        observation, info = env.reset()

env.close()

def dynamicRewardFunction(observation, action):
    x, _, angle, _ = observation
    reward = 1.0
    
    if abs(x) > 2.4 or abs(angle) > 0.2:
        reward = -10.0
    else:
        reward += 10.0 / (1.0 + abs(angle))
        reward += 5.0 / (1.0 + abs(x))
    
    return int(reward)

 ---------------------------------------------------- 

Certainly! Let's break down the logic of this dynamic reward function:

1. Function Input:
   - The function takes two parameters: `observation` and `action`.
   - `observation` is unpacked into four variables, but only `x` and `angle` are used.
   - `x` represents the cart's position.
   - `angle` represents the pole's angle from vertical.

2. Initial Reward:
   - The reward starts at 1.0.

3. Failure Conditions:
   - If the absolute value of `x` is greater than 2.4 (cart too far from center) or
   - If the absolute value of `angle` is greater than 0.2 radians (pole too tilted),
   - Then the reward is set to -10.0, indicating a failure state.

4

In [None]:
#Alterations
# - I could alter the prompt
# - I could use other LLMs/Finetune