# Gym Reward Functions 

In this tutorial, we will learn about Gym reward functions and how to use them within the context of PyTorch. We will be using the OpenAI Gym library to create a simple environment and implement a reward function to train an agent.

## Prerequisites

Before we begin, make sure you have the following installed:

- Python 3.6 or later
- PyTorch
- OpenAI Gym

You can install PyTorch and OpenAI Gym using pip:

```
pip install torch
pip install gym
```


In [None]:
# Import the necessary libraries

import gym
import torch
import torch.nn as nn
import torch.optim as optim
from torch.autograd import Variable

print('Libraries imported!')

## Step 1: Create a Gym Environment

First, let's create a simple Gym environment. We will be using the 'CartPole-v0' environment for this tutorial. The goal of this environment is to balance a pole on a cart by applying forces to the cart.

In [None]:
# Create the Gym environment

env = gym.make('CartPole-v0')
print(env)

## Step 2: Define the Neural Network

Next, we will define a simple feedforward neural network using PyTorch. This network will take the state of the environment as input and output the action probabilities.

In [None]:
# Define the neural network

class Net(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.softmax(self.fc2(x), dim=-1)
        return x

input_size = env.observation_space.shape[0]
hidden_size = 64
output_size = env.action_space.n
model = Net(input_size, hidden_size, output_size)
print(model)

## Step 3: Understand and Implement Different Reward Functions

In this step, we will dive deeper into reward functions and implement three different reward functions for the CartPole environment. The choice of reward function can significantly impact the agent's learning and performance.

### Reward Function 1: Basic Reward

This is the simplest reward function, where the agent receives a reward of +1 for every time step the pole is balanced, and -1 if the episode ends.

In [None]:
def basic_reward(state, action, next_state, done):
    if done:
        return -1
    else:
        return 1

print('Basic reward function defined!')

### Reward Function 2: Pole Angle

In this reward function, the agent receives a reward based on the absolute angle of the pole. The closer the pole is to being upright, the higher the reward.

In [None]:
def angle_reward(state, action, next_state, done):
    pole_angle = abs(next_state[2])
    if done:
        return -1
    else:
        return 1 - pole_angle / (2 * 3.14159265)

print('Pole angle reward function defined!')

### Reward Function 3: Pole Angle and Cart Position

In this reward function, the agent receives a reward based on both the absolute angle of the pole and the cart's position. The closer the pole is to being upright and the closer the cart is to the center, the higher the reward.

In [None]:
def angle_position_reward(state, action, next_state, done):
    pole_angle = abs(next_state[2])
    cart_position = abs(next_state[0])
    if done:
        return -1
    else:
        return 1 - (pole_angle / (2 * 3.14159265) + cart_position / 2.4) / 2

print('Pole angle and cart position reward function defined!')

Now, you can choose one of these reward functions and replace `get_reward` in Step 4 with the chosen reward function. This will allow you to see how different reward functions impact the agent's learning and performance.

## Step 4: Train the Agent

Now, we will train the agent using the neural network, reward function, and Gym environment. We will use the REINFORCE algorithm for training.

In [None]:
# Training parameters
num_episodes = 1000
learning_rate = 0.01

gamma = 0.99
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

# Train the agent
for episode in range(num_episodes):
    state = env.reset()
    rewards = []
    log_probs = []
    done = False
    
    while not done:
        state_tensor = torch.FloatTensor(state).unsqueeze(0)
        action_probs = model(state_tensor)
        action = torch.multinomial(action_probs, 1).item()
        log_prob = torch.log(action_probs[0, action])
        
        next_state, _, done, _ = env.step(action)
        reward = get_reward(state, action, next_state, done)
        
        rewards.append(reward)
        log_probs.append(log_prob)

        state = next_state
    
    # Update the model
    R = 0
    policy_loss = []
    for r, log_prob in zip(reversed(rewards), reversed(log_probs)):
        R = r + gamma * R
        policy_loss.append(-log_prob * R)
    policy_loss = torch.cat(policy_loss).sum()
    
    optimizer.zero_grad()
    policy_loss.backward()
    optimizer.step()
    
    if episode % 50 == 0:
        print(f'Episode {episode}, Loss: {policy_loss.item()}')

## Step 5: Evaluate the Trained Agent

After training, we can evaluate the agent by running it in the environment and observing its performance.

In [None]:
# Evaluate the agent
num_eval_episodes = 10

for episode in range(num_eval_episodes):
    state = env.reset()
    done = False
    total_reward = 0
    
    while not done:
        state_tensor = torch.FloatTensor(state).unsqueeze(0)
        action_probs = model(state_tensor)
        action = torch.argmax(action_probs, 1).item()
        
        next_state, _, done, _ = env.step(action)
        reward = get_reward(state, action, next_state, done)
    total_reward += reward

        state = next_state
    
    print(f'Episode {episode + 1}, Total Reward: {total_reward}')

## Next Steps

In this tutorial, we learned about Gym reward functions and their implementation in the context of PyTorch. We used the OpenAI Gym library to create a simple environment, defined a neural network using PyTorch, and trained an agent using the REINFORCE algorithm.

Next, you can explore more complex environments and algorithms for reinforcement learning, such as Deep Q-Networks (DQN), Proximal Policy Optimization (PPO), and Actor-Critic Methods.