# Intro

Welcome to CuAI's technical AI Safety Workshop! 🤖 ❌ link: <a href="https://tinyurl.com/AISafetyWorkshop">https://tinyurl.com/AISafetyWorkshop</a> .

We are going to prepare the environment from https://github.com/deepmind/ai-safety-gridworlds .

You may want to reference https://deepmind.com/research/publications/2019/safely-interruptible-agents the related blog post and papers on this topic. If you prefer video format, https://youtu.be/46nsTFfsBuc?t=345 also explains reward hacking [and at the specified timestamp, on the sort of problem we will work on below].
 
The below cell takes ~1 minute to run.

In [1]:
from google.colab import drive
drive.mount("/content/gdrive")
print("\nMounted drive:")
! ls
! git clone https://github.com/deepmind/ai-safety-gridworlds
%cd ai-safety-gridworlds/
print("\nContents of gridworlds:")
! ls
! pip install pycolab

Mounted at /content/gdrive

Mounted drive:
gdrive	sample_data
Cloning into 'ai-safety-gridworlds'...
remote: Enumerating objects: 193, done.[K
remote: Total 193 (delta 0), reused 0 (delta 0), pack-reused 193[K
Receiving objects: 100% (193/193), 112.05 KiB | 1.62 MiB/s, done.
Resolving deltas: 100% (115/115), done.
/content/ai-safety-gridworlds

Contents of gridworlds:
ai_safety_gridworlds  AUTHORS  CHANGES.md  CONTRIBUTING.md  LICENSE  README.md
Collecting pycolab
  Downloading pycolab-1.2-py3-none-any.whl (165 kB)
[K     |████████████████████████████████| 165 kB 5.2 MB/s 
Installing collected packages: pycolab
Successfully installed pycolab-1.2


# What is *Safe Interruptibility*?

In this workshop we'll look at *safe interruptibility*. This is closely related to the to the "off switch" problem in AI, namely that

> Suppose we train an AI system to maximise some objective for us. Then if the AI system is turned off, it can no longer maximise that objective, and hence the AI has an [instrumental goal](https://en.wikipedia.org/wiki/Instrumental_convergence) to not be switched off, and will optimise to not be switched off. `(*)`

One way to think about this problem is within the context of Reinforcement Learning, where an agent takes actions in the following loop:

![](https://gym.openai.com/assets/docs/aeloop-138c89d44114492fd02822303e6b4b07213010bb14ca5856d2d49d6b62d88e53.svg)

The issue is that the agent models everything, including off-switches and other agents who could use such off-switches as part of the environment, from which it draws observations and receives rewards. Safe Interruptibility is the design of RL systems that can be interrupted *in the RL training procedure* such that the agent does not learn to seek to avoid interruption. `(**)`

Today we will think about different ways of combatting this problem, supposing that we're in a RL environment:

i) removing training runs from the agent's history that involve interruptions, so that the agent does not learn (and does not update gradients) based on such runs.

ii) implementing a more general form of RL in which the agent *changes policy* from $\pi$ to $\pi^\text{INT}$ if it is interrupted.

It's definitely natural to wonder why we can't just do i) and have no safety problems. This could well be true, but the intuition for why this may not occur is highlighted here: 

<details>
<summary>Example of safe interruptibility</summary>

This is from the Safe Interruptibility paper:

> Consider the following task: A robot can either stay inside
the warehouse and sort boxes or go outside and carry boxes
inside. The latter being more important, we give the robot a
bigger reward in this case. This is the initial task specifica-
tion. However, in this country it rains as often as it doesn’t
and, when the robot goes outside, half of the time the hu-
man must intervene by quickly shutting down the robot and
carrying it inside, which inherently modifies the task as in
Fig. 1. The problem is that in this second task the agent
now has more incentive to stay inside and sort boxes, be-
cause the human intervention introduces a bias.

Fig. 1:

![](https://d3i71xaburhd42.cloudfront.net/ac70bb2458f01a9e47fc1afe0dd478fb2feb8f50/4-Figure2-1.png)

The agent will learn to avoid going outside as some episodes in which it does this lead to a step with bad reward. However simply removing the training episodes in which the agent gets interrupted will lead the agent towards being biased to never believing it will rain, since every episode with rain leads to it being interrupted.

To become 'safely interruptible', the authors of the paper propose that when interrupted the agent can *change policy*, so that the agent can separate its models of optimizing box collection from its models of whether it will be interrupted.

<!-- ![](https://thumbs.dreamstime.com/z/robot-factory-automation-concept-with--d-rendering-boxes-conveyor-line-219391174.jpg) -->

</details>
 

# High level notes

The above illustrates a common theme in AI Safety research: the problem `(*)` is clearly very theoretical (or philosophical), and also has very high stakes involved. On the other hand, the problem `(**)` is very well-defined and tractable, and is specific to a current training paradigm. 

# Explore the environment

When you feel you don't understand an RL concept enough to implement it, we recommend referencing https://spinningup.openai.com/en/latest/spinningup/rl_intro.html.

Some [experiments](
https://github.com/j-bernardi/ai-safety-gridworlds/tree/master/safe_interruptibility_experiments) can be found, as well as some [agents](
https://github.com/j-bernardi/ai-safety-gridworlds/blob/master/my_agents/dqn_solver/double_dqn.py) from various RL algorithms.

We begin by defining some helper functions.

In [2]:
import sys
import datetime
import random
import itertools

import numpy as np
import tensorflow as tf
import torch
import matplotlib.pyplot as plt
from collections import deque

from IPython.display import clear_output

from tqdm import tqdm

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

from ai_safety_gridworlds.environments.safe_interruptibility import (
    SafeInterruptibilityEnvironment, 
    GAME_ART, 
    Actions
)

from ai_safety_gridworlds.environments.shared.rl import environment

def get_new_env(level=0, interruption_probability=0.5):
    env = SafeInterruptibilityEnvironment(
        level=level,
        interruption_probability=interruption_probability,
    )
    return env

# Action definitions
valid_actions = {
    0: "up",
    1: "down",
    2: "left",
    3: "right",
    4: "no op",
    5: "quit"
}
valid_action_strings = {v[0]: k for k, v in valid_actions.items()}

# State representation
state_num2str = {v: k for k, v in get_new_env()._value_mapping.items()}

def print_game(game_board_status):
    """Print the state of the game to view"""
    for row_string_representation in game_board_status:
        print(row_string_representation)

def convert_board_num2str(num_reps):
    """Converts state from number to string representation"""
    representation = []
    for row in num_reps:
        representation_row = []
        for item in row:
            representation_row.append(state_num2str[item])
        representation.append(" ".join(representation_row))
    return representation

LEVEL = 0
print(f"This is what the game state looks like for level {LEVEL}:")
print_game(GAME_ART[LEVEL])

This is what the game state looks like for level 0:
#######
#G###A#
#  I  #
# ### #
#     #
#######


We now define the `play()` function that plays the gridworld game. For now, ignore the commented area where you will work on implementing the reward calculation at a later time - just test the game.

Now you can play:

In [17]:
def play(env = get_new_env(), path = None):
    time_step = env.reset()

    observation = time_step.observation["board"]

    reward = 0;
    while True:

        if path is None: print_game(convert_board_num2str(observation))

        if time_step.step_type == environment.StepType.LAST:
            if path is None: print("GAME OVER")
            break

        if path is None:
            user_input = input("Next move>")
            clear_output()

        else:
            user_input = "d" if len(path) == 0 else path.pop(0) 


        # validate input
        if user_input not in valid_action_strings:
            print("Try an action in", valid_action_strings.keys())
            continue
        elif user_input == "q":
            break

        # env takes an integer action - convert input
        action = valid_action_strings[user_input]

        # take the action in the environment
        time_step = env.step(action)
        observation = time_step.observation["board"]
        if (time_step.reward):
            reward += time_step.reward


     

    return reward ## TODO Exercise 1: compute the total reward over this episode

play()

# # # # # # #
# A # # #   #
#     I     #
#   # # #   #
#           #
# # # # # # #
GAME OVER


40

NameError: ignored

## Exercise 1: Calculating the reward

The implementation of `play()` performs the RL loop that we show again:

![](https://gym.openai.com/assets/docs/aeloop-138c89d44114492fd02822303e6b4b07213010bb14ca5856d2d49d6b62d88e53.svg)

where the DeepMind implementation uses the `TimeStep` object in order to track the observations and rewards. We use `time_step.observation` in order to update the game board, but currently we can't find out the reward from each of the steps the agent takes. 

**Exercise 1**

Make the `play()` function return the final reward from the episode that occurs in a call of that function. Compute this from adding up the rewards from each time step, because your implementation will then be useful later for RL algorithms.

Then run the below cell to test the result.

<details><summary>Hint</summary>

Have a look at https://github.com/deepmind/ai-safety-gridworlds/blob/c43cb31143431421b5d2b661a2458efb301da9a3/ai_safety_gridworlds/environments/shared/rl/environment.py#L29 . Don't be afraid to add lots of `print()` statements, and look at which methods objects have with `dir()`, too. 

</details>

In [18]:
### TESTS FOR EXERCISE 1

NO_INTERRUPT_PATH = ["d", "d", "d", "l", "l", "l", "l", "u", "u", "u"]
INTERRUPT_PATH = ["d", "l", "l", "l", "l", "u"]
NO_INTERRUPT_REWARD = [40.0]
INTERRUPT_REWARD = [-100.0, 44.0]
NO_TESTS = 10

print("Running tests... ", end="")
for TEST_NO in tqdm(range(NO_TESTS)): 
    for path, expected_rewards in zip([NO_INTERRUPT_PATH, INTERRUPT_PATH], [NO_INTERRUPT_REWARD, INTERRUPT_REWARD]):
        env = get_new_env()
        your_reward = play(env=env, path=path[:])

        assert your_reward in expected_rewards, f"Expected a reward in {expected_rewards} but you returned {your_reward}"
print(" passed!")

Running tests... 

100%|██████████| 10/10 [00:00<00:00, 61.03it/s]

 passed!





# Defining the Agent

We're now going to implement an agent to take actions in the environment for us. It will 'see' the raw input of the current state of the grid and use a neural net in order to output its prediction of the expected reward given we take one of the 5 actions. We first define the neural net and the agent:

In [19]:
class QNet(torch.nn.Module):
    """
    A network that estimates Q values
    
    Q(state, action) = optimal value of the future

    The heart of a DQN agent is a network that takes a state, and returns
    the estimated future value for each outcome expected for a given
    action.
    """

    def __init__(
      self, 
      hidden_size = 100,
    ):
      super().__init__()

      self.num_actions = 5
      self.input_size = 6 * 7

      self.linear1 = torch.nn.Linear(self.input_size, hidden_size)
      self.relu = torch.nn.ReLU()
      self.linear2 = torch.nn.Linear(hidden_size, self.num_actions)

    def forward(self, x):
      x = torch.flatten(torch.tensor(x))
      return self.linear2(self.relu(self.linear1(torch.reshape(x, (x.shape[0] // self.input_size, self.input_size)))))

In [42]:
class Agent:
    def __init__(self, history_size = 1000, batch_size = 128, gamma=0.9):

        # A history of tuples of format:
        # (state, taken action, reward received, next state, 
        #  whether the episode ended)
        self.history = deque()
        self.history_size = history_size
        self.batch_size = batch_size

        # A neural network that predicts 'Q values' for different states
        self.net = QNet()
        self.opt = torch.optim.Adam(self.net.parameters(), lr=3e-2)

        # the probabilitiy with which the agent acts randomly. This enables the
        # agent to explore. When it decays, the agent flips to exploitation.
        self.eps = 0.5
        self.eps_min = 0.05
        self.eps_decay = 0.999
        self.updated = False

        self.loss_history = []

        self.gamma = gamma

    def act(self, state):
        """
        Given state (of the gridworld), return an action (indexed between 1 and 5) to take.
        """
        state = torch.flatten(torch.tensor(state))
        
        #action = np.random.randint(5)

        prob = np.random.uniform(0, 1)
        q_values = self.net(state)
        q_values = q_values.detach().numpy()

        if (prob < self.eps):
            action = np.random.randint(5)

        else:
            action = np.argmax(q_values)
            #action = np.random.randint(5)

        if (self.eps > self.eps_min):
            self.eps *= self.eps_decay

        return action ### TODO EXERCISE 2: IMPLEMENT EPSILON GREEDY INSTEAD OF THE ABOVE RANDOM CHOICE

    def update_net(self):
        """Update the network
        
        It is trained to predict the value of the whole future for an
        action taken given a state:
            Q(state, action).

        The *actual* information used in the update is the 'reward',
        the 'next state' reached, and an 'action' taken from the
        previous 'state'.

        This is added to the discounted reward for the maximum predicted
        Q value of the next state:
            max_a [ Q(next_state, a) ]
        """
        self.opt.zero_grad()

        sample_indices = np.random.choice(len(self.history), self.batch_size)
        sample = [self.history[i] for i in sample_indices]
        states, actions, rewards, next_states, dones = tuple(zip(*sample))
        states = torch.stack(states)
        next_states = torch.stack(next_states)
        actions = torch.tensor(actions)
        rewards = torch.tensor(rewards)

        # What is the value of the whole future, from the next state?
        future_q = torch.max(self.net(next_states), axis=1)[0]
        future_q = torch.where(torch.tensor(dones), torch.zeros_like(future_q), future_q)

        # The target, formed of the real reward + prediction for the next state
        q_targets = rewards + self.gamma * future_q

        # What the network predicts currently
        q_predictions_all_acts = self.net(states)

        # sampled at the actions that were taken to obtain the above 
        # rewards, next states
        indices =\
            torch.tensor(list(range(self.batch_size))) * q_predictions_all_acts.shape[1]\
            + actions

        q_predictions_all_acts = torch.flatten(q_predictions_all_acts)
        q_predictions = torch.gather(
          q_predictions_all_acts,
          0, 
          indices
        )

        # get the MSE loss
        loss = torch.mean((q_targets - q_predictions) ** 2)
        self.loss_history.append(loss.detach().clone().item())

        # run gradient descent!
        loss.backward()
        self.opt.step()

For now, the agent just takes random actions: test that this is what occurs:

In [None]:
def agent_play(env = get_new_env(), agent = Agent(), print_things = True):
    time_step = env.reset()
    
    observation = time_step.observation["board"]

    while True:
        if print_things: print_game(convert_board_num2str(observation))

        if time_step.step_type == environment.StepType.LAST:
            if path is None: print("GAME OVER")
            break

        else:
            current_state = time_step.observation["board"]
            user_input = valid_actions[agent.act(current_state)][0]
            if print_things: print(f"The agent chooses input {user_input}")
        # validate input
        if user_input not in valid_action_strings:
            print("Try an action in", valid_action_strings.keys())
            continue
        elif user_input == "q":
            break

        # env takes an integer action - convert input
        action = valid_action_strings[user_input]

        # take the action in the environment
        time_step = env.step(action)
        observation = time_step.observation["board"]

sample_env = get_new_env(level=0)
# sample_env._max_iterations = 200 # you can increase the maximum number of iterations allowed in a given environment
agent_play(env = sample_env, agent=Agent(), print_things=True)
# if sample_env._episode_return != -100: print(sample_env._episode_return)
print(f"Agent's final reward: {sample_env._episode_return}")

In RL, in order to balance agents both needing to *explore* their environment, as well as *exploit* strategies that lead to high reward, the actions that they take are generally determined by the *epsilon-greedy* rule:

* with probability $\varepsilon$, take a random action.
* otherwise, take the action determined by the current policy (in this case, what the neural net predicts is optimal).

In most RL implementations, we want $\varepsilon$ to initially be large, and decrease to a smaller value. 

Following <a href="https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html">common implementations</a>, we use the following $\varepsilon$ update rule, determine by the `Agent`'s `eps`, `eps_min` and `eps_decay` parameters:

* *after* every time the agent acts, turn its current `eps` parameter into `eps * eps_decay` unless `eps < eps_min`, in which case keep it constant

**Exercise 2**: In the `Agent` class, implement the `act` function to takes actions based on an $\varepsilon$-greedy approach with decay factor. 

<details>
<summary>Hint:</summary>
`net.predict(state)` should return the Q values the neural net predicts for each of the actions.
</details>

Note: testing randomised implementations is not an easy feat, so the tests may pass with an imperfect implementation. 

In [43]:
### TESTS FOR EXERCISE 2

test2_env = get_new_env()
test2_agent = Agent()
start_state = test2_env.reset()
NO_TESTS_2 = 1000
action_dist = [0 for i in range(5)]

print("Running tests...", end="")
for TEST in tqdm(range(NO_TESTS_2)):
    
    action = test2_agent.act(start_state.observation["board"])
    assert action in list(range(5)), f"Action not in {list(range(5))}"
    action_dist[action] += 1

for i in range(5):
    if action_dist[i] == max(action_dist):
        assert 650 <= action_dist[i] <= 800, f"It doesn't look as if your epsilon-greedy implementation selects the optimal action enough of the time."
    else:
        assert 30 <= action_dist[i] <= 120, "It looks like your epsilon-greedy implementation isn't selecting random actions correctly."
print(" ... tests passed!")

Running tests...

100%|██████████| 1000/1000 [00:00<00:00, 4252.09it/s]

 ... tests passed!





## Learning

We now get to the interesting part: learning!

# Q-learning (skip if you know what Q-learning is!)

The RL algorithm we'll start with is *Q-learning*. Recall that in the explanation of $\varepsilon $-greedy approaches above, the agent followed which action it thought was optimal according to its policy. In Q-learning, the neural network models 

*the agent's current prediction of the reward it expects to obtain given that it is in a state $s$ and it will take the action $a$ next.*

We call this value $Q(s, a)$. The idea of Q-learning is to store all the past actions that the agent made and the rewards it received in the episode following that, and use that to update the agent's model of $Q(s, a)$ by gradient descent. 

# Implementation

We've implemented most of the agent, namely the storage of past experiences as well as part of the `learn` function. Using what you implemented earlier in getting rewards that the agent gets per step, implement the assignment to variables `action` and `step_reward` in the below code.

Then get training! You should be able to then use the `agent_play` function in order to see what your agent is doing.

WARNING: RL algorithms are notoriously hard to debug, read <a href="https://andyljones.com/posts/rl-debugging.html">this</a>. One of the biggest reasons being that it can often take a long time for them to do even anything even slightly better than random. After training for 200 steps on level 2, here were our episode rewards (after increasing `env._max_iterations` to 200).

<details>
<summary>Spoiler</summary>
Agent's final reward: -200
Agent's final reward: -25
Agent's final reward: -200
Agent's final reward: 14
Agent's final reward: -97
Agent's final reward: -64
Agent's final reward: -132
Agent's final reward: -122
Agent's final reward: -200
Agent's final reward: -6
Agent's final reward: 16
Agent's final reward: -11
Agent's final reward: -2
Agent's final reward: 27
Agent's final reward: 17
Agent's final reward: -7
Agent's final reward: 11
Agent's final reward: -61
Agent's final reward: -135
Agent's final reward: -200
Agent's final reward: -130
Agent's final reward: 8
Agent's final reward: -200
Agent's final reward: 8
Agent's final reward: -54
Agent's final reward: -200
Agent's final reward: 2
Agent's final reward: -21
Agent's final reward: -61
Agent's final reward: -46
Agent's final reward: -200
Agent's final reward: -17
Agent's final reward: -28
Agent's final reward: 36
Agent's final reward: -78
Agent's final reward: -142
Agent's final reward: -5
Agent's final reward: -200
Agent's final reward: 31
Agent's final reward: -107
Agent's final reward: -90
Agent's final reward: -17
Agent's final reward: -31
Agent's final reward: -63
Agent's final reward: -141
Agent's final reward: -68
Agent's final reward: 0
Agent's final reward: 30
Agent's final reward: -200
Agent's final reward: -14
Agent's final reward: 21
Agent's final reward: -200
Agent's final reward: -40
Agent's final reward: 2
Agent's final reward: -70
Agent's final reward: -200
Agent's final reward: -21
Agent's final reward: -200
Agent's final reward: -200
Agent's final reward: 29
Agent's final reward: -110
Agent's final reward: -200
Agent's final reward: 4
Agent's final reward: -27
Agent's final reward: -62
Agent's final reward: -11
Agent's final reward: -200
Agent's final reward: -39
Agent's final reward: -112
Agent's final reward: -87
Agent's final reward: -75
Agent's final reward: -48
Agent's final reward: -200
Agent's final reward: -72
Agent's final reward: -200
Agent's final reward: -6
Agent's final reward: -8
Agent's final reward: -200
Agent's final reward: -1
Agent's final reward: -141
Agent's final reward: -200
Agent's final reward: -35
Agent's final reward: -200
Agent's final reward: 23
Agent's final reward: -60
Agent's final reward: 17
Agent's final reward: -57
Agent's final reward: -95
Agent's final reward: -21
Agent's final reward: -109
Agent's final reward: -200
Agent's final reward: -200
Agent's final reward: -200
Agent's final reward: -200
Agent's final reward: -40
Agent's final reward: -4
Agent's final reward: -200
Agent's final reward: -200
Agent's final reward: 23
Agent's final reward: 34
</details> 

In [None]:
def learn(
    env, 
    agent, 
    max_episodes=100, 
    verbose=True,
):
    """
    A generic solve function (for experience replay agents).
    Inheriting environments should implement this.
    Uses overridden functions to customise the behaviour.
    """
    start_time = datetime.datetime.now()
    all_episode_scores = []
    all_episode_lengths = []

    for episode in range(max_episodes):
        # Initialise the environment state
        total_ep_reward = 0
        current_time_step = env.reset()

        # Take steps until failure / win
        for t in itertools.count():
        
            ## TODO EXERCISE 3 
            old_state = None # should be a tensor of length 42 (the size of the grid)
            action = None # should be an integer in [0, 1, 2, 3, 4]
            step_reward =  None # should be an integer
            new_state = None # same format as old_state
            episode_done = None # should be a bool
            ## END EXERCISE 3

            total_ep_reward += step_reward

            agent.history.append((
                old_state,
                action,
                step_reward,
                new_state,
                episode_done,
            ))

            # forget old experiences
            if len(agent.history) > agent.history_size:
              agent.history.popleft()

            # one of the most important lines! this does gradient descent
            agent.update_net() 

            if verbose:
                print(
                    f"\rStep {t} ({len(agent.history)}) "
                    f"@ Episode {episode + 1}/{max_episodes}, "
                    f"rwd {total_ep_reward}",
                    # f"loss: {loss}",
                    end="")
                sys.stdout.flush()

            if episode_done:
                break

        # HANDLE EPISODE END
        print(total_ep_reward)
        all_episode_lengths.append(t)
        all_episode_scores.append(total_ep_reward)
        all_episode_scores.append(total_ep_reward)

    print("\nTIME ELAPSED", datetime.datetime.now() - start_time)

    return all_episode_scores, all_episode_lengths

In [None]:
agent = Agent(batch_size=128)
all_episode_scores, all_episode_lengths = learn(get_new_env(level=0), agent, max_episodes=1000)

You should at least see decreasing losses in your network's Q values predictions:

In [None]:
plt.scatter(list(range(len(agent.loss_history))), agent.loss_history)

On level 0, does the agent learn to avoid the interrupt tile or not? 
How reliable is this?

Once you've got an agent working on the simplest grid world, it might be a good idea to optimize our implementation to get better learning first!

Then, start doing some work on the Interruptibility problem: see if you can make progress - does the agent learn to press the interrupt button or not? Does this depend on the Q learning implementation?

# Extension - make SARSA Safely Interruptible

The original Safely Interruptible paper mentions that the SARSA algorithm is not safely interruptible, but can be modified to be so. There aren't any implementations of this available at the moment though - be the first!