# Training Agents using Upside-Down Reinforcement Learning

... TODO: Intro ...

In [17]:
import numpy as np
import torch
from helpers import make_episode

## Upside-Down RL Hyperparameters

<img src="images/hyperparams.png" />

In [15]:
batch_size = 128
horizon_scale = 0.02
last_few = 100
learning_rate = 0.002
n_episodes_per_iter = 100
n_updates_per_iter = 15
n_warm_up_episodes = 50
replay_size = 700
return_scale = 0.02

## 2.3.1 Replay Buffer

RL does not explicitly maximize returns, but instead relies on exploration to continually discover higher return trajectories so that the behavior function can be trained on them. To drive learning progress, we found it helpful to use a replay buffer containing a fixed maximum number of trajectories with the highest returns seen so far, sorted in increasing order by return. The maximum buffer size is a hyperparameter. Since the agent starts learning with zero experience, an initial set of trajectories is generated by executing random actions in the environment. The trajectories are added to the replay buffer and used to start training the agent’s behavior function.

In [8]:
# TODO: I guess we need to get samples

class ReplayBuffer():
    def __init__(self, max_size):
        self.max_size = max_size
        self.buffer = []
        
    def add(self, episode):
        self.buffer.append(episode)
    
    def sort_and_trim(self):
        key_sort = lambda episode: episode.total_return
        self.buffer = sorted(self.buffer, key=key_sort)[-self.max_size:]
        pass
    
    def __len__(self):
        return len(self.buffer)

## 2.3.4 Sampling Exploratory Commands

After each training phase, the agent can attempt to generate new, previously infeasible behavior, potentially achieving higher returns. To profit from such exploration through generalization, one must first create a set of new initial commands c0 to be used in Algorithm 2. We use the following procedure to sample commands:

1. A number of episodes from the end of the replay buffer (i.e., with the highest returns) are selected. This number is a hyperparameter and remains fixed during training.

2. The exploratory desired horizon d<sup>h</sup><sub>0</sub> is set to the mean of the lengths of the selected episodes.

3. The exploratory desired returns d<sup>r</sup><sub>0</sub> are sampled from the uniform distribution U\[M, M + S\] where M is the mean and S is the standard deviation of the selected episodic returns.

In [18]:
def sample_command(buffer, num):
    if len(buffer) == 0: return [1, 1]
    
    # 1.
    commands = buffer[-num:]
    
    # 2.
    lengths = [command.length for command in commands]
    desired_horizon = np.mean(lengths)
    
    # 3.
    returns = [command.total_return for command in commands]
    mean_return, std_return = np.mean(returns), np.std(returns)
    desired_returns = np.random.uniform(mean_return, mean_return+std_return)
    
    return [desired_returns, desired_horizon]

## 3.2 Setup

All agents were implemented using articial neural networks. The behavior function for UDRL agents was implemented using fully-connected feed-forward networks for LunarLander-v2, and convolutional neural networks (CNNs; 16) for TakeCover-v0. The command inputs were scaled by a fixed scaling factor, transformed by a fully-connected sigmoidal layer, and then multiplied element-wise with an embedding of the observed inputs (after the first layer for fully-connected networks; after all convolutional layers for CNNs). Apart from this small modification regarding UDRL command inputs, the network architectures were identical for all algorithms.

In [1]:
import torch

class Behavior(torch.nn.Module):
    def __init__(self, 
                 state_size, 
                 action_size, 
                 hidden_size, 
                 command_scale = [1, 1]):
        super().__init__()
        
        self.command_scale = torch.FloatTensor(command_scale)
        
        self.state_fc = torch.nn.Sequential(torch.nn.Linear(state_size, 
                                                            hidden_size), 
                                            torch.nn.Sigmoid())
        
        self.command_fc = torch.nn.Sequential(torch.nn.Linear(2, hidden_size), 
                                              torch.nn.Sigmoid())
        
        self.output_fc = torch.nn.Sequential(torch.nn.Linear(hidden_size, 
                                                             hidden_size), 
                                             torch.nn.ReLU(), 
                                             torch.nn.Linear(hidden_size, 
                                                             action_size))
        
    
    def forward(self, state, command):
        state_output = self.state_fc(state)
        command_output = self.command_fc(command * self.command_scale)
        embedding = torch.mul(state_output, command_output)
        return self.output_fc(embedding)
    
    def action(self, state, command):
        logits = self.forward(state, command)
        probs = torch.softmax(logits, dim=-1)
        dist = torch.distributions.Categorical(probs)
        return dist.sample()

## Algorithm 1: Upside-Down Reinforcement Learning: High-level Description

<img src="images/udrl_algo1.png" />

In [2]:
def initialize_replay_buffer(behavior, buffer, num_episodes):
    pass

def initialize_behavior_function():
    pass

def stopping_criteria():
    pass

# TODO

## Algorithm 2: Generates an Episode using the Behavior Function

<img src="images/udrl_algo2.png" />

In [None]:
def generate_episode(behavior, state, command = [1, 1]):
    states = []
    actions = []
    rewards = []
    steps = 0
    
    # TODO
    
    return make_episode(states, actions, rewards, sum(rewards), steps)