# Deep Reinforcement Learning (DRL) and Weight Agnostic Networks (WAN) on the Pendulum Environment

In this notebook, we will explore two different approaches to controlling an environment with continuous actions:

1. **Deep Reinforcement Learning (DRL)** — where an agent learns a policy using a neural network that is trained to maximize cumulative rewards through interaction with the environment.

2. **Weight Agnostic Neural Networks (WAN)** — a method that uses fixed network architectures and weights, relying on the structure of the network and activation functions rather than extensive training.

Our test environment will be the classic **Pendulum** problem, where the goal is to balance a pendulum upright by applying continuous torque.



In [1]:
import math
import numpy as np
import torch
import math
import random
from game import PendulumEnv

pygame 2.6.1 (SDL 2.28.4, Python 3.10.9)
Hello from the pygame community. https://www.pygame.org/contribute.html


# Deep Reinforcement Learning (DRL)

This class implements a simple Deep Reinforcement Learning (DRL) agent using a neural network built with PyTorch. The agent learns how to choose actions based on the current state of the environment to maximize future rewards.

## Key Components

- **Neural Network (`self.mapping`)**  
  A feedforward network with two hidden layers using ReLU activation functions. It takes the current state as input and outputs predicted action values (Q-values).

- **Weights Initialization**  
  The linear layers’ weights are initialized with a normal distribution (mean = 0, std = 0.1), and biases are set to zero to help the training start smoothly.

- **Epsilon-Greedy Exploration (`explore` method)**  
  To balance exploring new actions and using what it has learned, the agent picks actions using an epsilon-greedy method:  
  - With a probability `epsilon` (which decreases over time), it selects a **random action** to explore new possibilities.  
  - Otherwise, it selects the **best action** predicted by the neural network.

- **Experience Replay Buffer (`remember` method)**  
  The agent stores its experiences — tuples of `(state, action, reward, next_state)` — in a fixed-size buffer. This helps the agent learn from a diverse set of past experiences and reduces the problem of learning from highly correlated data.

- **Training Step (`rethink` method)**  
  When enough experiences are collected, the agent samples a random batch from the buffer and updates the neural network by minimizing the difference between predicted and target Q-values:  
  - **Target Q-values (`y_true`):** Calculated as the immediate reward plus the discounted best future reward predicted from the next state (`gamma` is the discount factor).  
  - **Predicted Q-values (`y_pred`):** The neural network’s output for the sampled states.  
  The agent uses Mean Squared Error loss and the Adam optimizer to update the network weights.

---

**In short:**  
This DRL agent learns how to map states to actions by interacting with the environment, exploring with some randomness, and improving its decisions based on past experiences. The epsilon-greedy approach balances trying new actions and using known good ones, while experience replay and neural network training steadily improve the agent’s performance.


In [2]:
class DRL(torch.nn.Module):
    def __init__(self, input_size, hidden_size, output_size, learning_rate=0.001):
        super(DRL, self).__init__()
        self.input_size = input_size
        self.output_size = output_size
        self.mapping = torch.nn.Sequential(torch.nn.Linear(input_size, hidden_size), torch.nn.ReLU(), torch.nn.Linear(hidden_size, hidden_size), torch.nn.ReLU(), torch.nn.Linear(hidden_size, output_size))
        self.apply(self.__class__.weights_init)  #TPJ
        self.optimizer = torch.optim.Adam(self.parameters(), lr=learning_rate)
        self.criterion = torch.nn.MSELoss()
        self.steps = 0
        self.buffer = []
        self.epsi_low = 0.05
        self.epsi_high = 0.9
        self.gamma = 0.8
        self.decay = 200
        self.capacity = 10000
        self.batch_size = 64

    def weights_init(m):
        if m.__class__.__name__.find('Linear') != -1:
            torch.nn.init.normal_(m.weight.data, mean=0.0, std=0.1)
            torch.nn.init.constant_(m.bias.data, val=0.0)

    def explore(self, state):
        self.steps += 1
        epsilon = self.epsi_low + (self.epsi_high - self.epsi_low) * math.exp(-1.0 * self.steps / self.decay)
        if random.random() < epsilon:
            # Explore: Select a random action in the range (-2.0, 2.0)
            action = random.uniform(-2.0, 2.0)
        else:
            # Exploitation: Continuous Action Forecast
            state = torch.tensor(state, dtype=torch.float).view(1, -1)
            action = self.mapping(state).item()
        return action

    def remember(self, *transition):
        if len( self.buffer)==self.capacity:
            self.buffer.pop(0)
        self.buffer.append(transition)

    def rethink(self):
        if len(self.buffer) >= self.batch_size:
            state_old, action_now, reward_now, state_new = zip(*random.sample(self.buffer, self.batch_size))

            state_old = torch.tensor(np.array(state_old), dtype=torch.float)
            action_now = torch.tensor(np.array(action_now), dtype=torch.float).view(self.batch_size, -1)
            reward_now = torch.tensor(np.array(reward_now), dtype=torch.float).view(self.batch_size, -1)
            state_new = torch.tensor(np.array(state_new), dtype=torch.float)

            y_true = reward_now + self.gamma * torch.max(self.mapping(state_new).detach(), dim=1)[0].view(self.batch_size, -1)

            y_pred = self.mapping(state_old)

            loss = self.criterion(y_pred, y_true)

            self.optimizer.zero_grad()
            loss.backward()
            self.optimizer.step()


In [3]:
def drl(environment):
    drl = DRL(environment.observation_space, 256, 1, learning_rate=0.001)
    exploration_steps = 1000
    for epoch in range(10):
        state_old = environment.reset()
        rewards = 0
        max_steps = 2000
        steps = 0

        while steps < max_steps:
            steps += 1
            environment.render()

            if len(drl.buffer) < exploration_steps:
                action_now = random.uniform(*environment.action_space)
            else:
                action_now = drl.explore(state_old)

            state_new, reward_now, done, _ = environment.step(action_now)
            drl.remember(state_old, action_now, reward_now, state_new)
            rewards += reward_now
            state_old = state_new

            if len(drl.buffer) >= drl.batch_size:
                drl.rethink()

            if done:
                break

        print(f'epoch={epoch:04d}, rewards={rewards:.2f}, step={steps}')


In [9]:
env = PendulumEnv()
print("DRL")
drl(env)
env.close()

DRL
epoch=0000, rewards=-11812.68, step=1067
epoch=0001, rewards=-10494.94, step=312
epoch=0002, rewards=-638.35, step=1241
epoch=0003, rewards=-10174.40, step=86
epoch=0004, rewards=-200.50, step=736
epoch=0005, rewards=-918.91, step=1435
epoch=0006, rewards=-86.76, step=267
epoch=0007, rewards=-260.19, step=506
epoch=0008, rewards=-623.31, step=618
epoch=0009, rewards=-10179.22, step=86


## Summary

- **Setup:**  
  The DRL agent uses a neural network with two hidden layers to predict actions based on the current pendulum state. It learns through interaction with the environment by balancing exploration and exploitation.

- **Training Process:**  
  - Initially performs random actions to collect experiences (exploration phase).  
  - Uses an experience replay buffer to sample past transitions and train the network.  
  - Employs an epsilon-greedy strategy where random actions decrease over time.  

- **Results:**  
  - Rewards vary significantly across epochs, sometimes very negative due to large penalties when the pendulum falls below a threshold.  
  - The agent improves somewhat but training is noisy and unstable within 10 epochs.  
  - Maximum steps reached in some episodes indicate partial success in controlling the pendulum.