### **State of the art Reward Functions for Cart Pole**
https://github.com/openai/gym/wiki/Leaderboard


Continuous Angle-Based Reward
Provides a continuous reward based on the angle of the pole.
```
def decide(self, observation):
        position, velocity, angle, angle_velocity = observation
        action = int(3. * angle + angle_velocity > 0.)
        return action
```

Position-Angle Balanced Reward
Balances rewards between pole angle and cart position.
```

```

Velocity-Aware Reward
Incorporates cart and pole velocities.
```

```

### **Claude's Reward Functions**

```python
def LLMOutput(observation, action):
    x, _, angle, _ = observation
    reward = 1.0 - abs(angle) - 0.5 * abs(x)
    return max(0, reward)
```

1. Input:
   - The function takes two parameters: `observation` and `action` (although `action` is not used in this function).
   - `observation` is assumed to be a tuple or list containing four elements, but only the first (x) and third (angle) are used.

2. Unpacking the observation:
   - `x, _, angle, _` = observation
   - This line extracts the position (x) and angle from the observation. The underscores (_) indicate that we're ignoring the second and fourth elements of the observation.

3. Calculating the reward:
   - `reward = 1.0 - abs(angle) - 0.5 * abs(x)`
   - The reward starts at 1.0 and is reduced based on two factors:
     a. The absolute value of the angle: This penalizes the pole for leaning away from vertical.
     b. Half the absolute value of the position: This penalizes the cart for being far from the center.

4. The logic behind this reward calculation:
   - The goal is to keep the pole upright (angle close to 0) and the cart near the center (x close to 0).
   - A larger angle or distance from center results in a lower reward.
   - The angle has a stronger impact on the reward than the position (full weight vs. half weight).

5. Returning the reward:
   - `return max(0, reward)`
   - This ensures the reward is never negative. If the calculated reward is negative, 0 is returned instead.

In summary, this function calculates a reward based on how well the pole is balanced (angle) and how centered the cart is (x). The reward is highest (close to 1) when the pole is perfectly upright and the cart is in the center. It decreases as the pole leans more or the cart moves farther from the center, with the angle having a stronger influence on the reward than the position.



---

```python
def dynamicRewardFunction(observation, action):
    x, _, angle, _ = observation
    reward = 1.0
    
    if abs(x) > 2.4 or abs(angle) > 0.2:
        reward = -10.0
    else:
        reward += 10.0 / (1.0 + abs(angle))
        reward += 5.0 / (1.0 + abs(x))
    
    return int(reward)
```

Certainly! Let's break down the logic of this dynamic reward function:

1. Function Input:
   - The function takes two parameters: `observation` and `action`.
   - `observation` is unpacked into four variables, but only `x` and `angle` are used.
   - `x` represents the cart's position.
   - `angle` represents the pole's angle from vertical.

2. Initial Reward:
   - The reward starts at 1.0.
...
This reward function encourages the learning algorithm to find a balance between keeping the pole upright and keeping the cart centered, with a stronger emphasis on pole verticality.



In [1]:
import anthropic
from cartPoleShared import device
from cartPoleShared import *



#Dynmaically adding the LLMGeneratedRewardFunction Function

def LLMRewardFunction(self, functionString):
    localNamespace = {}
    exec(functionString, globals(), localNamespace)
    
    new_function = None
    for item in localNamespace.values():
        if callable(item):
            new_function = item
            break
    
    if new_function is None:
        raise ValueError("Invalid Function")
    
    # Set the new function as the reward function
    self.setRewardFunction(new_function)

    
def setRewardFunction(self, rewardFunction):
    self.rewardFunction = rewardFunction

#Added to CustomCartPoleEnv
CustomCartPoleEnv.LLMRewardFunction = LLMRewardFunction
CustomCartPoleEnv.setRewardFunction = setRewardFunction


#Initilizating Class
env = gym.make("CartPole-v1", render_mode="human")
env = CustomCartPoleEnv(env)

# Determine stateSize and actionSize from the environment
stateSize = env.observation_space.shape[0]
actionSize = env.action_space.n

# Initialize the DQLearningAgent with the required parameters
agent = DQLearningAgent(env, stateSize=stateSize, actionSize=actionSize, device=device)


#API Query
apiKey="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"
modelName = "claude-3-5-sonnet-20240620"


rewardFunctionMessage = [
    {
        "role": "user",
        "content": """You are a python code outputter. I want your output to only be python code and just be one function, which is an integer. No other text with it the output. 
        This function will be a reward function, named dynamicRewardFunction(), for a RL environment that follows the description below. 
        The inputs are observation and action in that order, the input observation can be broken down as follows: x, _, angle,_ = observation :
        
        This environment is a pole balanced on a cart that can move from left to right, 
        the idea is to keep the pole as upright as possible by moving the cart either left or right, 
        the information you have available to you is the position(x) and the angle(angle)."""
    }
]


explanationMessage = [
    {
        "role": "user",
        "content": """I need you to explain the logic behind this following function:
        def dynamicRewardFunction(observation, action):
            x, _, angle, _ = observation
            reward = 1.0 - abs(angle) - 0.5 * abs(x)
            if abs(angle) > 0.2 or abs(x) > 2.4:
                reward -= 10
            return reward

        This environment is a pole balanced on a cart that can move from left to right, 
        the idea is to keep the pole as upright as possible by moving the cart either left or right."""
    }
]

# Query the API for the reward function using a custom message
generatedRewardFunction = queryAnthropicApi(apiKey, modelName, rewardFunctionMessage)
# print(type(generatedRewardFunction))

print(generatedRewardFunction)
print("\n ---------------------------------------------------- \n")

# Get the explanation for the generated function using another custom message
rewardExplanation = queryAnthropicExplanation(apiKey, modelName, explanationMessage)
# print(type(generatedRewardFunction))
# print(generatedRewardFunction)

print(rewardExplanation)


#Logging API Output

logClaudeCall(rewardFunctionMessage, explanationMessage, generatedRewardFunction, rewardExplanation)

# generatedRewardFunction = """def LLMOutput(observation, action):
#    x, _, angle, _ = observation
#    reward = 1.0 - abs(angle) - 0.5 * abs(x)
#    if abs(angle) > 0.2 or abs(x) > 2.4:
#        reward -= 10
#    return reward
# """#API call to Claude



# Set initial policy and reward function
env.LLMRewardFunction(generatedRewardFunction)

print("\n ---------------------------------------------------- \n Reward Function Name: ", env.rewardFunction.__name__)



# Main training loop
numEpisodes = 1000
batchSize = 32

for episode in range(numEpisodes):
    observation, info = env.reset(seed=42)
    state = observation

    done = False
    totalReward = 0

    while not done:
        # Choose an action using the agent's policy
        action = agent.chooseAction(state)

        # Perform the action in the environment
        nextObservation, reward, terminated, truncated, info = env.step(action)
        nextState = nextObservation

        # Remember the experience in the replay memory
        agent.remember(state, action, reward, nextState, terminated or truncated)

        # Update the state
        state = nextState

        # Update the total reward for this episode
        totalReward += reward

        # Check if the episode is done
        done = terminated or truncated

    # Train the agent using replay memory after each episode
    agent.replay(batchSize=batchSize)

    # Print progress
    print(f"Episode {episode + 1}/{numEpisodes}, Total Reward: {totalReward}")

env.close()


def dynamicRewardFunction(observation, action):
    x, _, angle, _ = observation
    reward = 1.0
    if abs(x) > 2.4 or abs(angle) > 0.2:
        reward = -10.0
    else:
        reward += (1.0 - abs(angle)) * 10
        reward -= abs(x) * 0.5
    return int(reward)

 ---------------------------------------------------- 

Certainly! Let's break down the logic behind this dynamic reward function:

1. Input:
   - `observation`: A tuple containing four values (x, _, angle, _)
   - `action`: The action taken (not used in this function)

2. Unpacking the observation:
   `x, _, angle, _ = observation`
   - `x`: The position of the cart
   - `angle`: The angle of the pole (in radians)
   - The underscores (_) represent unused values in the observation

3. Base reward calculation:
   `reward = 1.0 - abs(angle) - 0.5 * abs(x)`
   - Starts with a base reward of 1.0
   - Subtracts the absolute value of the angle (penalizes for tilting)
   - Subtracts half the absolute value of the x position (pe

KeyboardInterrupt: 

In [None]:
#Alterations
# - I could alter the prompt
# - I could use other LLMs/Finetune


# Could also implement a Policy that isnt learned using Q-Learning (Maybe A2C)