# Training Agents using Upside-Down Reinforcement Learning

... TODO: Intro ...

In [3]:
import torch
from collections import namedtuple

### Hyperparameters

In [13]:
max_return_achievable = 1
num_warm_up_episodes = 50
horizon_scale = 0.02
return_scale = 0.02
buffer_size = 1000
learning_rate = 0.002
# TODO: find out more

### Helper functions:

In [7]:
make_episode = namedtuple('Episode', 
                          field_names=['states', 
                                       'actions', 
                                       'rewards', 
                                       'total_return', 
                                       'length'])

## 2.3.1 Replay Buffer

RL does not explicitly maximize returns, but instead relies on exploration to continually discover higher return trajectories so that the behavior function can be trained on them. To drive learning progress, we found it helpful to use a replay buffer containing a fixed maximum number of trajectories with the highest returns seen so far, sorted in increasing order by return. The maximum buffer size is a hyperparameter. Since the agent starts learning with zero experience, an initial set of trajectories is generated by executing random actions in the environment. The trajectories are added to the replay buffer and used to start training the agent’s behavior function.

In [8]:
# TODO: I guess we need to get samples

class ReplayBuffer():
    def __init__(self, max_size):
        self.max_size = max_size
        self.buffer = []
        
    def add(self, episode):
        self.buffer.append(episode)
    
    def sort_and_trim(self):
        key_sort = lambda episode: episode.total_return
        self.buffer = sorted(self.buffer, key=key_sort)[-self.max_size:]
        pass
    
    def __len__(self):
        return len(self.buffer)

## 3.2 Setup

All agents were implemented using articial neural networks. The behavior function for UDRL agents was implemented using fully-connected feed-forward networks for LunarLander-v2, and convolutional neural networks (CNNs; 16) for TakeCover-v0. The command inputs were scaled by a fixed scaling factor, transformed by a fully-connected sigmoidal layer, and then multiplied element-wise with an embedding of the observed inputs (after the first layer for fully-connected networks; after all convolutional layers for CNNs). Apart from this small modification regarding UDRL command inputs, the network architectures were identical for all algorithms.

In [1]:
import torch

class Behavior(torch.nn.Module):
    def __init__(self, 
                 state_size, 
                 action_size, 
                 hidden_size, 
                 command_scale = [1, 1]):
        super().__init__()
        
        self.command_scale = torch.FloatTensor(command_scale)
        
        self.state_fc = torch.nn.Sequential(torch.nn.Linear(state_size, 
                                                            hidden_size), 
                                            torch.nn.Sigmoid())
        
        self.command_fc = torch.nn.Sequential(torch.nn.Linear(2, hidden_size), 
                                              torch.nn.Sigmoid())
        
        self.output_fc = torch.nn.Sequential(torch.nn.Linear(hidden_size, 
                                                             hidden_size), 
                                             torch.nn.ReLU(), 
                                             torch.nn.Linear(hidden_size, 
                                                             action_size))
        
    
    def forward(self, state, command):
        state_output = self.state_fc(state)
        command_output = self.command_fc(command * self.command_scale)
        embedding = torch.mul(state_output, command_output)
        return self.output_fc(embedding)
    
    def action(self, state, command):
        logits = self.forward(state, command)
        probs = torch.softmax(logits, dim=-1)
        dist = torch.distributions.Categorical(probs)
        return dist.sample()

## Algorithm 1: Upside-Down Reinforcement Learning: High-level Description

<img src="images/udrl_algo1.png" />

In [2]:
def initialize_replay_buffer(behavior, buffer, num_episodes):
    pass

def initialize_behavior_function():
    pass

def stopping_criteria():
    pass

# TODO

## Algorithm 2: Generates an Episode using the Behavior Function

<img src="images/udrl_algo2.png" />

In [None]:
def generate_episode(behavior, state, command = [1, 1]):
    pass