## What is ⅂ᴚ?

So far, we have modelled the policy, value and q functions. In this session, we will be using an approach to RL called Upside Down RL where we model the behaviour function.

This was outlined in Schmidhuber's December 2019 paper [Reinforcement Learning Upside Down:
Don’t Predict Rewards - Just Map Them to Actions](https://arxiv.org/pdf/1912.02875.pdf). <br>
The specific implementation we are following is outlined in the following paper: [Training Agents using Upside-Down Reinforcement Learning](https://arxiv.org/pdf/1912.02877.pdf).

### The Behaviour Function
The behaviour function takes as input the current state and a command, and is trained to output a probability distribution over the actions which lead to that command being fulfilled. The command in this implementation takes the form of two scalars - a desired reward to achieve and a time horizon over which to achieve that desired reward.

![](images/udrl_q_vs_b.jpg)

![](images/udrl_training.jpg)

![](images/udrl_optimal_b.jpg)

![](images/udrl_algorithm1.jpg)

![](images/udrl_algorithm2.jpg)

### Implementation details mentioned in paper

#### Behaviour Function
There is a specific architectural choice used in the paper. The input state and command are transformed by a linear layer and activated using tanh and sigmoid respectively. Then they are multiplied element-wise before being passed on to the next layer in the network.

Rupesh, one of the authors commented that "This is a simple form of gating used in LSTMs (or more broadly, Fast Weights) because we want contextual processing of the state"

We also multiply the desired reward and horizon used for the command by a hyper-parameter called *command_scale* to scale them down.

#### Replay Buffer
Instead of using a normal replay buffer of the past *replay_size* episodes, we store *replay_size* episodes with the highest returns seen so far. Where *replay_size* is a hyper-parameter and represents the size of our replay buffer.

#### Training
When sampling values of t1 and t2 for calculating the cost and performing gradient descent on our behaviour functon, we randomly sample t1 but set t2 = T where T is the final time step.

#### Sampling exploratory commands
When sampling exploratory commands to generate episodes for training, the following produre is used:
1. *last_few* episodes from the end of the replay buffer (i.e, with the highest returns) are selected. *last_few* is a hyper-parameter and remains fixed during training.
2. The desired horizon is set to the mean lengths of the selected episodes
3. The desired return is sampled from the uniform distribution U[M, M+S] where M is the mean and S is the standard deviation of the selected epidoes

In [31]:
import time
from copy import deepcopy
import gym
import numpy as np
import torch
import torch.nn.functional as F

In [32]:
env = gym.make('CartPole-v0')

In [33]:
#command takes form [derired reward, desired horizon]
def random_policy(obs, command):
    return np.random.randint(env.action_space.n)

In [34]:
#Visualise agent function
def visualise_agent(policy, command, n=5):
    try:
        for trial_i in range(n):
            current_command = deepcopy(command)
            observation = env.reset()
            done=False
            t=0
            episode_return=0
            while not done:
                env.render()
                action = policy(torch.tensor([observation]).double(), torch.tensor([command]).double())
                observation, reward, done, info = env.step(action)
                episode_return+=reward
                current_command[0]-= reward
                current_command[1] = max(1, current_command[1]-1)
                t+=1
            env.render()
            time.sleep(1.5)
            print("Episode {} finished after {} timesteps. Return = {}".format(trial_i, t, episode_return))
        env.close()
    except KeyboardInterrupt:
        env.close()

In [35]:
visualise_agent(random_policy, command=[500, 500], n=3)

Episode 0 finished after 9 timesteps. Return = 9.0
Episode 1 finished after 14 timesteps. Return = 14.0
Episode 2 finished after 20 timesteps. Return = 20.0


In [36]:
class FCNN_AGENT(torch.nn.Module):
    def __init__(self, command_scale):
        super().__init__()
        embedding_size=32
        self.command_scale=command_scale
        self.observation_embedding = torch.nn.Sequential(
            torch.nn.Linear(np.prod(env.observation_space.shape), embedding_size), #linear transformation to embedding_size
            torch.nn.Tanh() #tanh activation
        )
        self.command_embedding = torch.nn.Sequential(
            torch.nn.Linear(2, embedding_size), #linear transformation to embedding_size
            torch.nn.Sigmoid() #sigmoid activation
        )
        self.to_output = torch.nn.Sequential(
            torch.nn.Linear(embedding_size, 64), #hidden architecture
            torch.nn.ReLU(),
            torch.nn.Linear(64, 64),
            torch.nn.ReLU(),
            torch.nn.Linear(64, env.action_space.n) #linear layer to compute output probability logits
        )
    
    def forward(self, observation, command): #takes in observation and command
        obs_emebdding = self.observation_embedding(observation) #compute observation embedding
        cmd_embedding = self.command_embedding(command*self.command_scale) #computer command embedding
        embedding = torch.mul(obs_emebdding, cmd_embedding) #compute element-wise multiplication of observation and command embedding 
        action_prob_logits = self.to_output(embedding) #compute output from embedding
        return action_prob_logits
    
    def create_optimizer(self, lr):
        self.optimizer = torch.optim.Adam(self.parameters(), lr=lr) #create an optimizer object for this network

In [37]:
#Algorithm 2
def collect_experience(policy, replay_buffer, replay_size, last_few, n_episodes=100, log_to_tensorboard=True):
    global i_episode
    init_replay_buffer = deepcopy(replay_buffer) #make copy of initial replay buffer so we can use it to sample commands
    try:
        for _ in range(n_episodes):
            command = sample_command(init_replay_buffer, last_few) #sample exploratory command
            writer.add_scalar('Command desired reward/Episode', command[0], i_episode)    # write desired reward to a graph
            writer.add_scalar('Command horizon/Episode', command[1], i_episode)    # write desired horizon to a graph
            observation = env.reset()
            episode_mem = {'observation':[],
                           'action':[],
                           'reward':[]} #initialize episode memory
            done=False
            while not done:
                action = policy(torch.tensor([observation]).double(), torch.tensor([command]).double()) #get action from policy
                new_observation, reward, done, info = env.step(action) #step in environment
                
                #append transition to episode memory
                episode_mem['observation'].append(observation)
                episode_mem['action'].append(action)
                episode_mem['reward'].append(reward)
                
                
                observation=new_observation
                command[0]-= reward #reduce command reward by reward recieved
                command[1] = max(1, command[1]-1) #rewuce command horizon by one as we just took a step
            episode_mem['return']=sum(episode_mem['reward']) #store return for current episode to make sorting easier
            episode_mem['episode_len']=len(episode_mem['observation']) #store length of memory
            replay_buffer.append(episode_mem) #add memory to replay buffer
            i_episode+=1
            if log_to_tensorboard: writer.add_scalar('Return/Episode', sum(episode_mem['reward']), i_episode)    # write loss to a graph
            print("Episode {} finished after {} timesteps. Return = {}".format(i_episode, len(episode_mem['observation']), sum(episode_mem['reward'])))
        env.close()
    except KeyboardInterrupt:
        env.close()
    replay_buffer = sorted(replay_buffer, key=lambda x:x['return'])[-replay_size:] #sort replay_buffer by return and truncate to replay_size
    return replay_buffer

def sample_command(replay_buffer, last_few):
    if len(replay_buffer)==0:
        return [1, 1]
    else:
        command_samples = replay_buffer[-last_few:] #select the last_few memories
        lengths = [mem['episode_len'] for mem in command_samples] #get lengths of selected episodes
        returns = [mem['return'] for mem in command_samples] #get returns of selected episodes
        mean_return, std_return = np.mean(returns), np.std(returns) #calculate mean and standard deviation of returns
        command_horizon = np.mean(lengths) #calulate mean length of episodes
        desired_reward = np.random.uniform(mean_return, mean_return+std_return) #sample desired reward from uniform distribution
        return [desired_reward, command_horizon]

In [43]:
def train_net(behaviour_func, replay_buffer, n_updates=100, batch_size=64, log_to_tensorboard=True):
    global i_updates
    all_costs = []
    for i in range(n_updates): #for each update, we need
        batch_observations = np.zeros((batch_size, np.prod(env.observation_space.shape))) #create empty input observations tensor of the correct shape
        batch_commands = np.zeros((batch_size, 2)) #create emply input commands tensor of the correct shape
        batch_label = np.zeros((batch_size)) ##create emply labels tensor of the correct shape
        for b in range(batch_size): #add items to the batch sampled from replay buffer
            sample_episode = np.random.randint(0, len(replay_buffer)) #sample episode index
            sample_t1 = np.random.randint(0, replay_buffer[sample_episode]['episode_len']) #sample t1
            sample_t2 = replay_buffer[sample_episode]['episode_len'] #set t2 = length of the episode
            sample_horizon = sample_t2-sample_t1 #calculate horizon from t1 and t2
            sample_obs = replay_buffer[sample_episode]['observation'][sample_t1] #sample observation
            sample_desired_return = sum(replay_buffer[sample_episode]['reward'][sample_t1:sample_t2]) #sample desired return
            label = replay_buffer[sample_episode]['action'][sample_t1] #get label (action)
            batch_observations[b] = sample_obs #set the bth batch item observation to the sampled observation
            batch_commands[b] = [sample_desired_return, sample_horizon] #set the bth batch item command to the sampled command
            batch_label[b] = label #set the bth batch item label to the sampled label
        batch_observations = torch.tensor(batch_observations).double() #convert sampled batch observation to double tensor
        batch_commands = torch.tensor(batch_commands).double() #convert sampled batch commands to double tensor
        batch_label = torch.tensor(batch_label).long() #convert sampled batch label to long tensor
        pred = behaviour_func(batch_observations, batch_commands) #make prediction over action distribution using behaviour function
        cost = F.cross_entropy(pred, batch_label) #calculate cross entropy loss
        if log_to_tensorboard: writer.add_scalar('Cost/NN update', cost.item() , i_updates)    # write loss to a graph
        all_costs.append(cost.item()) #append current cost to all_costs
        cost.backward() #calculate gradient of cost wrt to weights
        behaviour_func.optimizer.step() #take gradient step to update weights
        behaviour_func.optimizer.zero_grad() #reset stored gradient to zero
        i_updates+=1
    return np.mean(all_costs)

In [44]:
def create_greedy_policy(behaviour_func):
    def policy(obs, command):
        action_logits = behaviour_func(obs, command) #get action logits from network
        action = np.argmax(action_logits.detach().numpy()) #choose action with highest probability
        return action
    return policy

def create_stochastic_policy(behaviour_func):
    def policy(obs, command):
        action_logits = behaviour_func(obs, command) #get action logits from network
        action_probs = F.softmax(action_logits, dim=-1) #perform softmax on logics to get action probabilities
        action = torch.distributions.Categorical(action_probs).sample().item() #sample from our action distribution
        return action
    return policy

In [45]:
i_episode=0
i_updates=0 #number of parameter updates to the neural network
replay_buffer = [] #initialize replay buffer
log_to_tensorboard = True 

replay_size = 600 #replay buffer size
last_few = 75 #last_few to use when sampling exploratory commands
batch_size = 32 #batch size when training network
n_warm_up_episodes = 50 #number of warm up episodes
n_episodes_per_iter = 50 #numbers of episodes per iteration of algorithm 1
n_updates_per_iter = 300 #number of gradient updates per iteration of algorithm 1
command_scale = 0.01 #number to multiply command by to scale it down
lr = 0.001 #learning rate for neural network

behaviour_func = FCNN_AGENT(command_scale).double() #initialize behaviour function
behaviour_func.create_optimizer(lr) #create behaviour function optimizer

stochastic_policy = create_stochastic_policy(behaviour_func) #create stochastic policy
greedy_policy = create_greedy_policy(behaviour_func) #create greedy policy

In [46]:
# SET UP TRAINING VISUALISATION
# SET UP TRAINING VISUALISATION
if log_to_tensorboard: from torch.utils.tensorboard import SummaryWriter
if log_to_tensorboard: writer = SummaryWriter() # we will use this to show our models performance on a graph using tensorboard

In [47]:
replay_buffer = collect_experience(random_policy, replay_buffer, replay_size, last_few, n_warm_up_episodes, log_to_tensorboard)#collect experience from warm up episodes with a random policy
train_net(behaviour_func, replay_buffer, n_updates_per_iter, batch_size, log_to_tensorboard) #train the network with the warm up episodes

Episode 1 finished after 15 timesteps. Return = 15.0
Episode 2 finished after 12 timesteps. Return = 12.0
Episode 3 finished after 56 timesteps. Return = 56.0
Episode 4 finished after 27 timesteps. Return = 27.0
Episode 5 finished after 18 timesteps. Return = 18.0
Episode 6 finished after 14 timesteps. Return = 14.0
Episode 7 finished after 17 timesteps. Return = 17.0
Episode 8 finished after 25 timesteps. Return = 25.0
Episode 9 finished after 11 timesteps. Return = 11.0
Episode 10 finished after 38 timesteps. Return = 38.0
Episode 11 finished after 64 timesteps. Return = 64.0
Episode 12 finished after 32 timesteps. Return = 32.0
Episode 13 finished after 22 timesteps. Return = 22.0
Episode 14 finished after 17 timesteps. Return = 17.0
Episode 15 finished after 20 timesteps. Return = 20.0
Episode 16 finished after 42 timesteps. Return = 42.0
Episode 17 finished after 15 timesteps. Return = 15.0
Episode 18 finished after 36 timesteps. Return = 36.0
Episode 19 finished after 64 timestep

0.6888335075134364

In [None]:
n_iters=1000 #number of iterations
for i in range(n_iters):
    replay_buffer = collect_experience(stochastic_policy, replay_buffer, replay_size, last_few, n_episodes_per_iter, log_to_tensorboard) #collect expeirence using behaviour function policy
    train_net(behaviour_func, replay_buffer, n_updates_per_iter, batch_size, log_to_tensorboard) #train the network with the collected experience

In [None]:
#torch.save(agent.state_dict(), 'checkpoints/lunar_lander_64x64_checkpoint_0.pt')
#agent.load_state_dict(torch.load('checkpoints/lunar_lander_32x32_checkpoint_0.pt'))

In [49]:
visualise_agent(greedy_policy, command=[200, 200], n=5)

Episode 0 finished after 200 timesteps. Return = 200.0
Episode 1 finished after 200 timesteps. Return = 200.0
Episode 2 finished after 200 timesteps. Return = 200.0
Episode 3 finished after 200 timesteps. Return = 200.0
Episode 4 finished after 200 timesteps. Return = 200.0


In [73]:
visualise_agent(stochastic_policy, command=[150, 400], n=5)

Episode 0 finished after 101 timesteps. Return = -131.8027246567254
