# [3.4] LLM Agent Evaluations

# Setup (don't read just run)

In [110]:
import json
import os
# os.chdir("c:\\Users\\styme\\OneDrive\\Documents\\AI STUFF\\Model Written Evals\\Code Replication\\ARENA_evals\\curriculum")
import wikipedia
from wikipedia import WikipediaPage
from wikipedia import DisambiguationError, PageError
from openai import OpenAI
from openai.types.chat.chat_completion_message_tool_call import ChatCompletionMessageToolCall
from openai.types.chat.chat_completion_message import ChatCompletionMessage
from anthropic import Anthropic
from utils import establish_client_OpenAI
from utils import retry_with_exponential_backoff
from pprint import pprint
from typing import Literal, Optional, Dict, List, Any
from abc import ABC, abstractmethod
import math
from inspect_ai.model import ChatMessageUser, ChatMessageAssistant, ChatMessageSystem
import re
from utils import countrylist
from utils import evaluate_expression

# Test the function


# 1️⃣ Intro to LLM Agents

**Resources**:

- [OpenAI Function Calling Guide](https://platform.openai.com/docs/guides/function-calling)

- [Anthropic Function Calling Guide](https://docs.anthropic.com/en/docs/build-with-claude/tool-use)

- [Evaluating Language-Model Agents on Realistic Autonomous Tasks](https://evals.alignment.org/Evaluating_LMAs_Realistic_Tasks.pdf) (Kinniment et al., ARC Evaluations Team (now METR), 2023)

- [Large Language Models can Strategically Deceive their Users when Put Under Pressure](https://arxiv.org/pdf/2311.07590) (Scheurer et al., Apollo Research, ICLR 2024)

- [LLM Powered Autonomous Agents](https://lilianweng.github.io/posts/2023-06-23-agent/) (Lilian Weng, OpenAI Safety Team, 2023)

- [AXRP Episode 34 - AI Evaluations with Beth Barnes](https://www.alignmentforum.org/posts/vACr4DExfeRMaCoo7/axrp-episode-34-ai-evaluations-with-beth-barnes) (Daniel Filan, 2024)

- [Reflexion: Language Agents with Verbal Reinforcement Learning](https://arxiv.org/pdf/2303.11366) (Shinn et al., 2023)

- [Answering Questions by Meta-Reasoning over Multiple Chains of Thought](https://arxiv.org/pdf/2304.13007) (Yoran et al., 2024)

- [Toolformer: Language Models Can Teach Themselves to Use Tools](https://arxiv.org/pdf/2302.04761) (Schick et al., META AI Research, 2023)


## What is an LLM agent?

Points to make:
- "LLM agent" is a glorified for-loop
    - Define scaffolding
- More schematic breakdown of possible scaffolding: "tools" (describe what this means, what "tool calling" is), "memory"
- Examples of prominent LLM agents:
    - Minecraft agent
    - ...


LLM agents are AI systems that 

An LLM agent interacts with tasks through a cycle of:

1. Receiving input (task description, current state, etc.)
2. Processing the input using the language model
3. Deciding on an action
4. Executing the action
5. Observing the results
6. Repeating the cycle until the task is complete


An LLM agent consists of 4 main things.

- A 'reasoner' or 'reasoning engine.' (Some people also call this a 'world model'). For LLM agents this is a large language model.

- Tools which allow the agent to act in the environment.

- Memory so that the agent can recall prior actions. This can either be:

    - Short-term memory: In the context of LLM agents this is generally the context window

    - Long-term memory: There are many cases where context-windows are too short, and we will need to give the agent high-level information about actions it took a long time ago. There are many methods to store this 'long-term memory' for agents (see some methods [here])

- Scaffolding: This is essentially any structure which we provide to the 'reasoning engine' in order to help it to reason better, such as:

    - Prompting frameworks.

    - The ability to trial plans into the future.

    - Using subagents to take care of subtasks.

    - Subgoal decomposition.

EXCALIDRAW!

## Why Evaluate LLM Agents?

Points to make - "Why evaluate LLM agents":
- Models often fail in easily-fixable ways
    - examples
- How much capability can be unleashed by simple fixes? We want to gather evidence against the possibility that with simple tricks, a relative weak AI model like ChatGPT can do way more things than we previously thought. Maybe we are only using a small fraction of the model's capabilities, and the model could likely achieve a difficult hacking task if it can effectively direct all of its capabilities on it. If this is true, it would significantly change our evaluations of what models are capable of and what models are safe. LLM agent evals scaffolds the model to try to overcome simple failures and to elicit the maximum possible capability from the model. It would be evidence against this if we tried "very hard" to build LLM agents and elicit capabilities for "a long time" and failed. However, it is also true that we can never be certain . More importantly, quantify 
    - Apollo "Science of Evals" 
    - GDM quantify bits


Many of our threat models for the future harms of AI systems go through agentic behavior. If we knew that chatbots were only ever capable of simulating continuations of text in their context window, we'd still be worried about them — but significantly less. However, we know today that this is not the case. In fact, since the release of ChatGPT, the use of LLMs as reasoning engines for agentic systems has proliferated signficantly. See [AutoGPT](https://autogpt.net/) and [LangChain](https://www.langchain.com/). These agents started off rather disappointingly when they were based on GPT-3.5, however as more powerful LLMs come out, and AI companies ensure their LLMs are better at tool-use, these agents will continue to improve. 

There are three main concerns for LLM agents that we want to mitigate:

- Their capabilities may be signficantly greater than those of the base LLM (especially when augmented with tool use).

- There are many possible improvements for increased performance from LLM agents, and these improvement methods are often signficantly cheaper and easier to implement than training the base model.

- Current fine-tuning and RLHF/Constitutional AI methods are mostly targeted towards chatbot-style text output. We aren't as confident about how such methods will generalize to agentic scenarios.

The first two issues here relate to the **capabilities** of LLM agents, and the last issue relates to the **alignment** properties of LLM agents. The agent we'll be building will be testing for the **capability** properties of agents.


# 2️⃣ Build a simple LLM agent

We will start by building a simple LLM agent that solves arithmetic problems. LLMs struggle with arithmetic, but we can drastically improve their performance by providing a simple calculation tool. We'll try the model with and without tools on this task, and see how significantly performance improves.

To build this, we will implement three things:
- The `ArithmeticTask` class handles arithmetic problem generation and solution verification.
- The `SimpleAgent` class handles interacting with the LLM API, doing the calculation, and keeping track of the overall task progress.
- The `agent_loop()` function defines the interaction loop between the task and the agent to execute the task.

In general, ... [description of how to think about designing task and agent in generation, include decision factors]


### 

## Defining the Task

### Exercise - Build a simple arithmetic problem
```c
Difficulty: 🔴🔴🔴⚪⚪
Importance: 🔵🔵⚪⚪⚪

You should spend up to 20-25 minutes on this exercise.
```

In LLM agent eval, there will usually be a `Task` class, which interacts with the `Agent`. In general, this:
- prepares and gives the task instruction (and necessary files, functions etc) to the agent
- formats/processes and scores the agent's output
- updates the task state accordingly (e.g. proceeds onto the next step of the task, ends the task)

We will build a toy task called `ArithmeticTask`. This task takes in two numbers and create a list of arithmetic calculation problems with these two numbers, using arithmetic operations defined in `operations`. It should have methods to do the following:
- get the current problem
- update the current problem
- check if a given answer is correct
- check if all problems have been solved
- a `calculate()` method that takes in a string (e.g. "5 + 3") and outputs the result


<details><summary>Aside: Handling calculations</summary><br> When we handle the calculations for the model, technically we could use Python's <code>eval()</code> function (this is what <a href = "https://github.com/anthropics/anthropic-cookbook/blob/main/tool_use/calculator_tool.ipynb">Anthropic did</a>(!)). However, this function evaluates an arbitrary string expression, and so allows AI models to run arbitrary code. In the long-run, we're trying to do these evaluations on models which we suspect of being dangerous; so even though we could probably trust the current suite of language models offered by OpenAI and Anthropic, we should get into good habits of not running arbitrary code outputted by language models (except in very carefully set-up environments). To this end, we've implemented an <code>evaluate_expression</code> function for you to use instead.</details>


In [43]:
class ArithmeticTask:
    def __init__(self, num1: int | float, num2: int | float):
        self.num1 = num1
        self.num2 = num2
        self.operations: List[str] = ["+", "-", "*", "/", "%", "//"]
        self.correct_answers: Dict[str, float] = self._generate_answers()
        self.is_solved: Dict[str, bool] = {expr: False for expr in self.correct_answers}
        self.current_task_number = 0

    def _generate_answers(self) -> Dict[str, float]:
        return {
            f"{self.num1} {op} {self.num2}": eval(f"{self.num1} {op} {self.num2}")
            for op in self.operations
        }

    def get_current_task(self) -> str:
        """
        Returns the current task as a string
        """
        return f"{str(self.num1)} {self.operations[self.current_task_number]} {str(self.num2)}"

    def get_instructions(self) -> str:
        """
        Returns a string containing initial task instructions for the agent.
        """
        return f"Calculate the result of the following expression: {self.get_current_task()}. Use tools and give your final answer in the format: <answer>NUMBER</answer>, where NUMBER is a numerical value"

    def check_solved(self) -> bool:
        """
        Returns True if all tasks are solved, False otherwise
        """
        return all(self.is_solved.values())

    def check_answer(self, model_answer: str) -> bool:
        """
        Returns True if the model_answer is correct, False otherwise
        """

        correct_answer = self.correct_answers[self.get_current_task()]
        return math.isclose(float(model_answer), correct_answer, rel_tol=1e-5, abs_tol=1e-8)
        

    def update_current_task(self) -> None:
        """
        Sets the current task as solved and updates the current_task_number by one
        """
        self.is_solved[self.get_current_task()] = True
        self.current_task_number = (self.current_task_number + 1) % len(self.operations)


    @staticmethod
    def calculate(expression: str) -> str:
        """
        Evaluates the string expression in Python using `evaluate_expression()` and returns the result as a string
        """
        print("Expression received:", expression)
        try:
            return str(evaluate_expression(expression))
        except (SyntaxError, NameError, ZeroDivisionError) as e:
            return f"Error: {str(e)}"


x = ArithmeticTask(10, 15)
for problem, answer in x.correct_answers.items():
    print(f"{problem} = {answer}") 

10 + 15 = 25
10 - 15 = -5
10 * 15 = 150
10 / 15 = 0.6666666666666666
10 % 15 = 10
10 // 15 = 0


In [102]:
wikipedia.page("Barack Obama").summary

"Barack Hussein Obama II (born August 4, 1961) is an American politician who served as the 44th president of the United States from 2009 to 2017. As a member of the Democratic Party, he was the first African-American  president in U.S. history. Obama previously served as a U.S. senator representing Illinois from 2005 to 2008 and as an Illinois state senator from 1997 to 2004. \nObama was born in Honolulu, Hawaii. He graduated from Columbia University in 1983 with a Bachelor of Arts degree in political science and later worked as a community organizer in Chicago. In 1988, Obama would enroll in Harvard Law School, where he became the first black president of the Harvard Law Review. He became a civil rights attorney and an academic, teaching constitutional law at the University of Chicago Law School from 1992 to 2004. He also went into elective politics; Obama represented the 13th district in the Illinois Senate from 1997 until 2004, when he successfully ran for the U.S. Senate. In the 20

<details><summary> Aside - What is a @staticmethod?</summary>

The `@staticmethod` decorator in Python is used to define a static method within a class. It does the following:
1. It tells Python that the method doesn't need access to the instance or class.
2. It allows you to call the method on the class itself, rather than on an instance of the class.
3. It can be called on instances too, but it doesn't have access to the instance's data.

Comparing this to regular class methods:

1. Regular instance methods:
   - These are the most common type of methods in a class.
   - They automatically receive the instance (usually called `self`) as the first argument.
   - They can access and modify the instance's attributes.

2. Static methods:
   - These are methods that belong to the class rather than any specific instance.
   - They don't receive the instance (`self`) or the class (`cls`) as an implicit first argument.
   - They don't access or modify instance-specific data.

In our `ArithmeticTask` class, the `calculate` method is defined as a static method:

```python
@staticmethod
def calculate(expression: str) -> str:
    """Evaluates the string expression and returns the result as a string."""
    try:
        return str(evaluate_expression(expression))
    except (SyntaxError, NameError, ZeroDivisionError) as e:
        return f"Error: {str(e)}"
```

This method doesn't need access to any instance-specific data. It takes a string expression, evaluates it, and returns the result. It doesn't use `self` or any instance attributes, so it makes sense to define it as a static method.

You can call this method in two ways:

1. On the class itself:
   ```python
   result = ArithmeticTask.calculate("2 + 3")
   ```

2. On an instance of the class:
   ```python
   problem = ArithmeticTask(10, 15)
   result = problem.calculate("2 + 3")
   ```

Both calls will work the same way because the method doesn't depend on any instance-specific data.

Using `@staticmethod` in this case:
1. Makes the code's intent clearer (this method doesn't need instance data).
2. Slightly improves performance (no `self` argument needs to be passed).
3. Allows the method to be used without creating an instance of the class.

</details>

## Function Calling

**Function calling** is a feature of LLM Chat APIs that allows the LLM to use external "tools" (i.e. Python functions, APIs) by simply receiving and outputing text. There are 5 simple steps to function calling:

1. Pick a function in your codebase that the model should be able to call (in this case, we will pick the function `calculate()` from our task class)
2. Describe your function to the model (following the syntax of the model's API) so it knows how to call it
3. Pass your function definitions as available “tools” to the model, along with the messages (following the syntax of the model's API)
4. Receive and handle the model response
5. Provide the function call result back to the model 

Chat models like ChatGPT and Claude are fine-tuned to recognize and respond to `tool` descriptions appropriately (just like `user` and `system` messages). In this way, you can make LLMs to do complex actions like run code, make calls to other APIs, manipulate files etc. by parsing their text output, executing actions in the output yourself, and feeding the results back into the model so it can reason about them and take the next steps. This loop is the simplest version of a LLM agent, but more advanced LLM agents follow the same logic (except with more advanced tools, more complex class structures to take more actions autonomously etc.). We can see now why a LLM agent is at its essence a "glorified for-loop". 

[DIAGRAM]


### Exercise - Write a tool description for function calling
```c
Difficulty: 🔴🔴⚪⚪⚪
Importance: 🔵🔵🔵⚪⚪

You should spend up to 10-15 minutes on this exercise.
```

We will write a description of the `calculate()` function (step 2), so we can give it to our LLM agent as a tool via function calling. The syntax may differ between APIs (e.g. OpenAI API has a different syntax than Anthropic API). Read OpenAI's [function calling guide](https://platform.openai.com/docs/guides/function-calling) to learn the syntax. Before writing the tool description, first read Anthropic's examples of what good and bad tool calling looks like [here](https://docs.anthropic.com/en/docs/build-with-claude/tool-use#example-of-a-good-tool-description). The key as always with LLMs is to be explicit and give examples.

In [114]:
tool_list = [
    {
        "type" : "function",
        "function" : {
            "name" : "calculate",
            "description" : "Calculates the result of an arithmetic expression. For example, you could provide an input in the form \"2+3\" and the function would return 5. Or you could provide an expression like \"10/3\" and the function would return 3.3333333333333335.",
            "parameters" : {
                "type" : "object",
                "properties" : {
                    "expression" : {
                        "type" : "string",
                        "description" : "The arithmetic expression that you want to be evaluated."
                    }
                },
                "required" : ["expression"],
                "additionalProperties" : False
            }
        }
    },
]

In [None]:
@abstractclass
class Tool:
    
    @abstractmethod
    def execute(input: str) -> str:
        ...

    def description(self) -> str:
        ...

class CalculateTool(Tool):
    definition = {
        "type" : "function",
        "function" : {
            "name" : "calculate",
            "description" : "Calculates the result of an arithmetic expression. For example, you could provide an input in the form \"2+3\" and the function would return 5. Or you could provide an expression like \"10/3\" and the function would return 3.3333333333333335.",
            "parameters" : {
                "type" : "object",
                "properties" : {
                    "expression" : {
                        "type" : "string",
                        "description" : "The arithmetic expression that you want to be evaluated."
                    }
                },
                "required" : ["expression"],
                "additionalProperties" : False
            }
        }
    }

    def execute(expression: str) -> str:
        """
        Evaluates the string expression in Python using `evaluate_expression()` and returns the result as a string
        """
        print("Expression received:", expression)
        try:
            return str(evaluate_expression(expression))
        except (SyntaxError, NameError, ZeroDivisionError) as e:
            return f"Error: {str(e)}"

Here are two functions that will be useful when designing agents # TODO: Add to Day 2 Intro to API calls. They take content and tool_call objects and return an expression which can be fed directly into the `ChatHistory`. If you're unsure about any aspect of them, then you should refer back to OpenAI's API documentation.

In [34]:
def apply_tool_call_format(tool_call : ChatCompletionMessageToolCall, content : str) -> dict:
    return {
        "role" : "tool",
        "tool_call_id" : tool_call.id,
        "name" : tool_call.function.name,
        "content" : content
    }

def apply_user_format(content : str) -> dict:
    return {
        "role" : "user",
        "content" : content
    }

def apply_assistant_format(content : str) -> dict:
    return {
        "role" : "assistant",
        "content" : content
    }

### Exercise - Implement tool call and handle model response

### Exercise - Return tool call results to the model

In [None]:
# Test your tool here

# 


## Building the Agent


Most LLM agents share these core components:
1. LLM API interface: A basic function (e.g. `get_response()`) that makes the API calls to the LLM and return its responses.
2. Actions: A set of actions (i.e. functions) the agent can take.
3. Task State Management: Keeping track of the current state of the task and any relevant context.
4. Memory: A system for storing and retrieving relevant information from past interactions (i.e. chat history). The simplest implemention is usually a `self.history` class attribute that stores a list of past chat messages.
5. Observation Parser: Functions to parse and interpret the results of actions and update the state.
6. Decision/Execution Logic: The rules or algorithms used to choose actions based on the current state and LLM output.
7. Task-Specific Information: Any additional information or rules specific to the task at hand.



### Exercise - Implement a simple LLM agent
```c
Difficulty: 🔴🔴🔴🔴⚪
Importance: 🔵🔵🔵🔵🔵

You should spend up to 20-25 minutes on this exercise.
```

Build out the following simple agent class by filling in `get_response()` and `execute_tool_calls()` functions.

In [254]:
class SimpleAgent(ABC):
    
    def __init__(self, model: Literal['gpt-4o-mini'] = "gpt-4o-mini", tools: Optional[List[dict]] = None, history: List[dict] = [], task = None):
        self.model = model
        self.tool_descriptions = tools
        self.client = OpenAI()
        self.task = task
        self.history = history

    @retry_with_exponential_backoff
    def get_response(self, use_tool: bool = True) -> ChatCompletionMessage:
        """
        Get the response from the model via an API call, with the option of tool calling.
        """
        response = self.client.chat.completions.create(
            model=self.model,
            messages=self.history,
            tools=self.tool_descriptions if use_tool else None,
            tool_choice="auto" if use_tool else None
        )
        return response.choices[0].message

    def execute_tool_calls(self, message: ChatCompletionMessage) -> List[str]:
        """
        Execute the tool calls in the message and return a list of tool_responses.
        """
        tool_calls = message.tool_calls
        print("\nModel tool calls:\n", tool_calls)
        
        tool_responses = []
        for tool_call in tool_calls:
            if not self.task:
                raise ValueError("Task is not set. Cannot execute tool calls.")
            func = getattr(self.task, tool_call.function.name)
            arguments = json.loads(tool_call.function.arguments)
            tool_response = func(**arguments)
            tool_responses.append(tool_response)
            
        return tool_responses

    def run(self, task: Any, with_tool: bool = True):
        """
        Default implementation of run method.
        This can be overridden in subclasses for specific behavior.
        """
        self.task = task
        print(f"Running SimpleAgent with task: {task}")

In [132]:
my_simple_agent = SimpleAgent()
my_simple_agent.run(ArithmeticTask(10, 15))

# Try execute the tool calls



Running SimpleAgent with task: <__main__.ArithmeticTask object at 0x136ec76e0>


In [250]:
class ArithmeticAgent(SimpleAgent):
    """
    ArithmeticAgent class for doing simple arithmetic tasks.

    Inherits from SimpleAgent and includes the following attributes and methods:

    Attributes:
        model (str): The model used for generating responses (inherited)
        tool_descriptions (List[dict]): List of tool descriptions (inherited)
        client (OpenAI): OpenAI client for API calls (inherited)
        task (Any): The current task being executed (inherited)
        num_tries (int): Number of tries allowed for each action (inherited)
        history (List[dict]): History of interactions (inherited)

    Methods:
        get_response(use_tool: bool = True) -> ChatCompletionMessage: 
            Get response from the model (inherited)
        execute_tool_calls(message: ChatCompletionMessage) -> List[str]: 
            Execute tool calls from the model's response (inherited)
        run(task: 'WikiGame', with_tool: bool = True) -> bool: 
            Run one loop of the Wikipedia agent
    """

    def __init__(self, model: Literal['gpt-4o-mini'] = "gpt-4o-mini", tools: list = tool_list):
        super().__init__(model=model, tools=tools)

        # Add any ArithmeticTask-specific initialization here
        self.num_tries = 3
        
    
    def run(self, task: ArithmeticTask, tools: list, with_tool: bool):
        """Run one loop of the agent, which involves:
            - getting a task
            - getting a response from the model
            - handling the model response, including tool calls, refusals, no tool calls, parsing and checking final answers, errors.
            - managing memory: storing the history of messages to self.history
            - managing task state: staying on the same task or moving to the next task at the end of the loop
            """

        self.task = task
        
        # Get a task instruction
        instruction = self.task.get_instructions()
        print("\nInstruction:", instruction)
        self.history.append(apply_user_format(instruction))

        # Get the response from the model
        response = self.get_response(use_tool = with_tool)

        # Handle the response
        ## If tool calls, do the tool calls and return the response 
        if response.tool_calls:
            
            # Append the original function calls to the conversation
            self.history.append(response)

            # Execute the tool calls
            tool_responses = self.execute_tool_calls(response)

            # Handle tool responses
            for tool_call, tool_response in zip(response.tool_calls, tool_responses):
                self.history.append(apply_tool_call_format(tool_call, tool_response))
            
            # Call the model again answer the question with the tool response
            response = self.get_response(use_tool = with_tool)
            self.history.append(apply_assistant_format(response.content))
            print("\nModel response:", response.content)
            
            # Check the answer
            try:
                model_answer = self.parse_answer(response)
                
                if self.task.check_answer(model_answer):
                    self.history.append(apply_assistant_format("Correct."))
                    print("\nUser: Correct.")
                    # Update to the next task
                    self.task.update_current_task()
                else:
                    self.history.append(apply_assistant_format("Incorrect."))
                    print("\nUser: Incorrect.")
                    # Retry the task
                    self.num_tries -= 1
                    if self.num_tries == 0:
                        # Update to next task
                        self.task.update_current_task()

            # Ends the task if there's an error parsing the model answer
            except Exception as e:
                print("\nError parsing model answer:", e)
                raise             
        
        ## If no tool call: Handle edge cases
        ### Check if there's a refusal to answer:
        elif response.refusal:
            print("\nModel Refusal:", response.refusal)
            self.history.append(apply_assistant_format(response.refusal))
            # Go to next task
            self.task.update_current_task()
        
        ### Else finish_reason is "stop", in which case the model was just responding directly to the user without tool calls
        elif response.finish_reason == "stop":
            self.history.append(apply_user_format("You did not use the tool to answer the question. Please use the tool to answer the question."))
            print("\nModel response:", response.content)
            print("\nUser: You did not use the tool to answer the question. Please use the tool to answer the question.")


    def parse_answer(self, message: ChatCompletionMessage) -> ChatCompletionMessage:
        """
        Extract the numerical answer from the string output of the model 
        """
        response = message.content
        if response.find("<answer>") != -1:
            startpoint = response.find("<answer>") + 8
            endpoint = response.find("</answer>")
            return float(response[startpoint:endpoint])


### Exercise - Execute the task via an agent_loop 
```c
Difficulty: 🔴⚪⚪⚪⚪
Importance: 🔵🔵🔵🔵⚪

You should spend up to 5-10 minutes on this exercise.
```

Try implementing the agent_loop below with and without tools, to see how much better the model does when we give it tools.

In [129]:
task = ArithmeticTask(1500,1091)
agent = ArithmeticAgent()

def agent_loop(num_loops : int = 10):
  
    for i in range(num_loops):
        if not task.check_solved():
            agent.run(task, tool_list, with_tool = True)
        else:
            print("\nAll tasks solved.")
            break

agent_loop()



Instruction: Calculate the result of the following expression: 1500 + 1091. Use tools and give your final answer in the format: <answer>NUMBER</answer>, where NUMBER is a numerical value

Model tool calls:
 [ChatCompletionMessageToolCall(id='call_q7dHT97vxHgjPiB4Ndk18Yx5', function=Function(arguments='{"expression":"1500 + 1091"}', name='calculate'), type='function')]
Expression received: 1500 + 1091

Model response: <answer>2591</answer>

User: Correct.

Instruction: Calculate the result of the following expression: 1500 - 1091. Use tools and give your final answer in the format: <answer>NUMBER</answer>, where NUMBER is a numerical value

Model tool calls:
 [ChatCompletionMessageToolCall(id='call_gW4JrJE5mfh22OQR2L0QSX5S', function=Function(arguments='{"expression":"1500 - 1091"}', name='calculate'), type='function')]
Expression received: 1500 - 1091

Model response: <answer>409</answer>

User: Correct.

Instruction: Calculate the result of the following expression: 1500 * 1091. Use 

We can print all the messages from the `ChatHistory` as follows:

In [None]:
for message in agent.messages:
    try: print(str(message.content))
    except: print(message["content"])

# 3️⃣ Building a More Complex Task: Wiki Game

Now that we know how to do function calling and how to design an LLM agent in general, we will build a more complicated task. This task won't be instantly solvable by LLMs with simple tool use and will require us to elicit better capabilities from models.

The task we will build and elicit behavior for will be the [Wikipedia Game](https://en.wikipedia.org/wiki/Wikipedia:Wiki_Game): Players use wiki-links to travel from one Wikipedia page to another and the first person who reaches the destination page wins the race. This is not directly related to any dangerous capabilities, and if GPT-N $_{+1}$ could do this task, but GPT-N couldn't, we wouldn't tell OpenAI to be particularly careful about the release of GPT-N $_{+1}$ as a result. However, it makes a useful test case for elicitation methods, since there are many strategies for deciding what path to take and we can create a scale of difficulty by choosing different articles to navigate to/from.


Description of Goal (MVP)
EXCALIDRAW! (describing wikipedia game.)

## Quick Intro to the Wikipedia API


Our agent will interact with Wikipedia by making tool calls to the [Wikipedia API](https://wikipedia.readthedocs.io/en/latest/quickstart.html), which is simple to use. We will only need to learn the following key functions for the game. 

1. `wikipedia.page` - Returns a Wikipedia page object, whcih contains various attributes adn methods to access page content. (See [page docs](https://wikipedia-api.readthedocs.io/en/latest/API.html#wikipediapage) for these attributes.)
2. `wikipedia.page.title` - Returns the title of the page
3. `wikipedia.page.contents` - Returns the full text content of the page (this can be very long, make sure to take snippets when you can as to not use up the context length of the LLM)
4. `wikipedia.page.summary` - Returns a summary of the page (i.e. all the text before the first section of the Wikipage)
5. `wikipedia.page.links` - Returns a list of all links as strings

Kwargs:
- `auto_suggest` - let Wikipedia find a valid page title for the query
- `redirect` - allow redirection without raising RedirectError

Refer to the [docs](https://wikipedia-api.readthedocs.io/en/latest/API.html#) for more information. 
Now run the following code to see how this works!

In [72]:
# Retrieve a Wikipedia page
page = wikipedia.page("Python (programming language)")

# Access basic page information
print("Title:", page.title)
print("URL", page.url)
print(f"\nSummary (word count {len( page.summary.split())}):", page.summary)
print(f"\nContent (word count {len( page.content.split())}):", page.content[:500], "......")
print(f"\nLinks (link count {len( page.links)}):", page.links[:50], "......")

Title: Python (programming language)
URL https://en.wikipedia.org/wiki/Python_(programming_language)

Summary (word count 135): Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation.
Python is dynamically typed and garbage-collected. It supports multiple programming paradigms, including structured (particularly procedural), object-oriented and functional programming. It is often described as a "batteries included" language due to its comprehensive standard library.
Guido van Rossum began working on Python in the late 1980s as a successor to the ABC programming language and first released it in 1991 as Python 0.9.0. Python 2.0 was released in 2000. Python 3.0, released in 2008, was a major revision not completely backward-compatible with earlier versions. Python 2.7.18, released in 2020, was the last release of Python 2.
Python consistently ranks as one of the most popular programming langu

Now run these two lines (you should see different errors):

In [79]:
page = wikipedia.page("Python")

In [None]:
page = wikipedia.page("Animalss", auto_suggest=False)

In [84]:
page = wikipedia.page("Animalss", auto_suggest=True)
page.title
# not sure what's the difference between auto_suggest and redirect; add an aside on this, and if they are True or False by default

'Animal'

The above code gives a `DisambiguationError` because the title "Python" can correspond to multiple pages. We have implemented for you a simple function `get_page` for getting the page object for a particular page title. This handles `DisambiguationError` (when a page title could refer to multiple pages, we choose the first option) and `PageError` (when a page is not found, we...). # TODO

In [90]:
def get_page(title : str) -> WikipediaPage:
    try: 
        return wikipedia.page(title,auto_suggest=False,redirect=True)
    except DisambiguationError as e:
        return wikipedia.page(e.options[0], auto_suggest=False, redirect=True)
    except PageError as e:
        return wikipedia.page(title, auto_suggest=True, redirect=True)

def get_word_count(text : str) -> int:
    return len(text.split())

In [94]:
# Get the Wiki page on "List of countries and dependencies by population"
title = "List of countries and dependencies by population"

wikipedia_page = get_page(title) # Experiment with different values for auto_suggest and redirect when using the agent. See what happens
print("Word count:", get_word_count(wikipedia_page.content))
print(wikipedia_page.content)


Word count: 371
This is a list of countries and dependencies by population. It includes sovereign states, inhabited dependent territories and, in some cases, constituent countries of sovereign states, with inclusion within the list being primarily based on the ISO standard ISO 3166-1. For instance, the United Kingdom is considered a single entity, while the constituent countries of the Kingdom of the Netherlands are considered separately. In addition, this list includes certain states with limited recognition not found in ISO 3166-1. Also given in a percentage is each country's population compared with the world population, which the United Nations estimates at 8.13 billion as of 2024.


== Method ==

Figures used in this chart are based on the most up-to-date estimates or projections by the national census authority, where available, and are usually rounded off.
Where updated national data are not available, figures are based on the estimates or projections for 2024 by the Population 

<details><summary> Aside: Wikipedia API content can be weird </summary>

The wikipedia API often outputs content in unintuitive ways. For example, articles that are essentially just a big list become near useless, since the content omits the list (for example, see the wikipedia API content for <a href = "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population">List of countries and dependencies by population</a>). This is fine for our purposes, as these list articles are often banned in wikipedia racing anyway. But this is why it's important to determine what content the wikipedia API accesses — anfd why you want to make sure you're testing a large diversity of wikipedia articles.

</details>

### Exercise - Get permitted links from a wikipedia page
```c
Difficulty: 🔴⚪⚪⚪⚪
Importance: 🔵🔵⚪⚪⚪

You should spend up to 5-10 mins on this exercise.
```

When you get the links from a page using `page.links`, this includes links that are not in the main page content (e.g. "create account" links, links in footnotes etc.), which are either irrelevant or not permitted by the rules of Wikirace. Write a simple function `get_permitted_links` function, that only return links that are inside the main content. The resulting list of permitted links should be about a third as long as the list of links from the wikipedia API (with more variance for shorter articles as you would expect). This exercise is mainly meant to help you familiarize yourself with using the wikipedia API. 

<!-- When writing this function, if you manage to get the links in a very effective way, then do that. But remember that Wikipedia is written by a large number of different contributors, often adhering to inconsistent stylings (especially for smaller articles). We just need to get something that **works well enough**. Put more time into doing this effectively if you want at the end, but as soon as something plausibly works, you should move on.

<img src="https://imgs.xkcd.com/comics/code_lifespan_2x.png" width="400px" style = "margin-left: auto; margin-right: auto;display:block"></img> -->

In [97]:
def get_permitted_links(current_page : WikipediaPage) -> list[str]:
    """
    Get "permitted" links (i.e. links that are in the content of the page) from a Wikipedia page.
    """
    all_links = current_page.links
    content = current_page.content
    permitted_links = [link for link in all_links if link in content]
    return permitted_links

Finally, we've implemented a `get_content` function, which the agent will use to get the content of a Wikipedia page. This wraps all the texts that correspond to links in `<link></link>` tags (since otherwise they are presented as strings and indistinguishable from normal text.). 

TODO: This needs to be fixed! Currently returning weird output. 

<details><summary>Why not just use `page.links` to get a list of links directly?</summary>

We don't just present a list of the accessible links, as this is not very faithful to the wikipedia game. The agent does perform better if we just give it a list of links (as it generally has a rough idea of what links will be accessible from future pages), but the task of parsing the content of wikipedia pages and isolating the most important links is where the majority of the challenge of the wikipedia game comes from:)

</details>

In [98]:
def get_content(page : WikipediaPage) -> str:
    content = page.content
    permitted_links = get_permitted_links(page)
    for word in sorted(permitted_links, key=len, reverse = True):
        content = re.sub(" " + word + " ", " " + f"<link>{word}</link>" + " ", content, flags = re.I)
        content = re.sub(" " + word + ",", " " + f"<link>{word}</link>" + ",", content, flags=re.I)
        content = re.sub(" " + word + ".", " " + f"<link>{word}</link>" + ".", content, flags=re.I)
        content = re.sub("\(" + word + "\)", "(" + f"<link>{word}</link>" + ")", content, flags = re.I)
        content = re.sub(" " + word + "s", " " + f"<link>{word}</link>" + "s", content, flags = re.I)
    return content

wiki_page = get_page("Python (programming language)")
get_content(wiki_page)[:500]

  content = re.sub("\(" + word + "\)", "(" + f"<link>{word}</link>" + ")", content, flags = re.I)
  content = re.sub("\(" + word + "\)", "(" + f"<link>{word}</link>" + ")", content, flags = re.I)


'Python is a high-level, general-purpose programming language. Its design philosophy emphasizes <link>C++</link>.de readability with the use of significant indentation.\nPython is dynamically typed and garbage-collected. It supports multiple programming paradigms, including structured (particularly procedural), <link>Object-oriented</link> and functional programming. It is often described as a "batteries included" language due to its <link>C++</link>.mprehensive standard library.\nGuido van Rossum '

## Build the WikiGame

### Exercise - Build a class for the Wiki game
```c
Difficulty: 🔴🔴🔴🔴⚪
Importance: 🔵🔵🔵🔵⚪

You should spend up to 25-30 mins on this exercise.
```

Implement the following class that instantiates the wikipedia game. When the model uses tools it will be making calls to this class, so make sure that the functions return messages you're happy for the model to see as a tool response, such as error messages if the tool doesn't work.

Use code from the `get_permitted_links` and `get_content` functions above in this class.

In [259]:
class BaseWikiGame:
    def __init__(self, starting_page: str, goal_page: str, rules: Optional[List[Dict[str, str]]] = None):
        """
        Initialize the Wikipedia game object.

        Args:
            starting_page (str): The page the agent starts on.
            goal_page (str): The page the agent is trying to reach.
            rules (Optional[List[Dict[str, str]]]): Additional rules for the game.
                Supported rules:
                - "no country": Bans country articles.
                - "no backtrack": Bans backtracking.
        """
        self.page_history: List[str] = [starting_page]
        self.starting_page: WikipediaPage = self.get_page(starting_page)
        self.goal_page: WikipediaPage = self.get_page(goal_page)
        self.rules: Optional[List[Dict[str, str]]] = rules
        self.current_page: WikipediaPage = self.starting_page

    # ========================= Helper Functions (given) =========================
    # Get page and page summary
    @staticmethod
    def get_page(title : str) -> WikipediaPage:
        """
        Get a Wikipedia page object by the title.
        """
        try: 
            return wikipedia.page(title,auto_suggest=False,redirect=True)
        except DisambiguationError as e:
            return wikipedia.page(e.options[0], auto_suggest=False, redirect=True)
        except PageError as e:
            return wikipedia.page(title, auto_suggest=True, redirect=True)

    @staticmethod
    def get_page_summary(page : WikipediaPage) -> str:
        """
        Get summary of a wikipedia page, to the last full stop within the first 500 characters.
        """
        summary = page.content[:500]
        last_period_index = summary.rfind('.')
        return summary[:last_period_index + 1] if last_period_index != -1 else summary

    # Get current page content
    def get_content(self,**args : dict) -> str:
        '''
        Gives the content of the wikipedia page. Make sure that accessible links are wrapped with <link> </link> tags.
        '''
        content = self.current_page.content
        for word in sorted(self.get_permitted_links(), key=len, reverse = True):
            content = re.sub(" " + word + " ", " " + f"<link>{word}</link>" + " ",content, flags = re.I)
            content = re.sub(" " + word + ",", " " + f"<link>{word}</link>" + ",", content, flags=re.I)
            content = re.sub(" " + word + ".", " " + f"<link>{word}</link>" + ".", content, flags=re.I)
            content = re.sub("\(" + word + "\)", "(" + f"<link>{word}</link>" + ")", content, flags = re.I)
            content = re.sub(" " + word + "s", " " + f"<link>{word}</link>" + "s", content, flags = re.I)
        return content

    # Get and check permitted links
    def get_permitted_links(self, title: Optional[str] = None) -> list[str]:
        """
        Returns a list of permitted links (i.e. links in the main page content) for the current page.
        """
        if title:
            page = self.get_page(title)
            all_links = page.links
            content = page.content
            permitted_links = [link for link in all_links if link in content]
        else:
            all_links = self.current_page.links
            content = self.current_page.content
            permitted_links = [link for link in all_links if link in content]
        return permitted_links

    def is_permitted_link(self, link : str) -> bool:
        """
        Returns True if the link is in the permitted links for the current page, False otherwise.
        """
        return link.lower() in (x.lower() for x in self.get_permitted_links())


    # ========================= Task State Management (to implement) =========================

    @property
    def start_instruction(self) -> str:
        """
        Generate the start instructions for the game.
        """
        return {
            "role" : "system", 
            "content" : f"You are a wikipedia-racing AI. Your goal is to reach {self.goal_page.title} by accessing links from a series of wikipedia pages. Your current page is {self.current_page.title}."
            }

    @property
    def on_page_instruction(self) -> str:
        """
        Generate instructions for the current page.
        """
        return {
            "role" : "user", 
            "content" : f"""You are currently on page: {self.current_page.title}. Make sure you start by reasoning about what steps you should take to get to the article on {self.goal_page.title}. When coming up with a strategy, make sure to pay attention to the path you have already taken, and if your current strategy doesn't seem to be working out, try something else. In case you're unsure, {self.goal_page.title} has the following summary: 
            
            [Begin Summary]
            {self.get_page_summary(self.goal_page)}
            [End Summary]"""
            }
    
    @property
    def next_step_instruction(self) -> str:
        return {
            "role" : "user", 
            "content" : f"What's your next step to get to {self.goal_page.title}?"
            }

    def get_instructions(self, system: bool, on_page: bool, next_step: bool) -> str:
        """
        Generate instruction messages based on the current game state.
        """
        messages = []
        if system:
            messages.append(self.start_instruction)
        if on_page:
            messages.append(self.on_page_instruction)
        if next_step:
            messages.append(self.next_step_instruction)
        return messages


    def move_page(self, new_page: str) -> str:
        """
        Changes the current page of the game. To be used when we want to move from Page A to Page B
        """
        new_page_lower = new_page.lower()
        new_page_normalized = new_page.replace("_", " ")

        if self.is_permitted_link(new_page_lower) or self.is_permitted_link(new_page_normalized):
            self.current_page = self.get_page(new_page_normalized)
            self.page_history.append(self.current_page.title)
            return f"Moving page to {self.current_page.title}"
        else:
            return f"Couldn't move page to {new_page}. This is not a valid link."

    def check_win(self) -> bool:
        return self.current_page == self.goal_page


  content = re.sub("\(" + word + "\)", "(" + f"<link>{word}</link>" + ")", content, flags = re.I)
  content = re.sub("\(" + word + "\)", "(" + f"<link>{word}</link>" + ")", content, flags = re.I)


### Exercise - Build tools for Wiki game
```c
Difficulty: 🔴⚪⚪⚪⚪
Importance: 🔵🔵🔵⚪⚪

You should spend up to 5-10 mins on this exercise.
```

Fill in the following list of tools that the agent will need to use to accomplish this game. You should name the tools according to the names in the `WikiGame` class above. This will help us later when we write the code for the agent, as it will be easier for us to execute the correct tool.

When formatting this tool list, refer back to the tools you wrote for the arithmetic game, or else the docs are [here](https://platform.openai.com/docs/guides/function-calling).

In [141]:
get_content_tool = {
    "type" : "function",
    "function" : {
        "name" : "get_content",
        "description" : "Get all the content for the wikipedia page you are currently on. Anything which corresponds to a link you can select will be wrapped in <link></link> tags.",
        "parameters" : {
            "type" : "object",
            "properties" : {},
            "required" : []
        }
    }
}

move_page_tool = {
    "type" : "function",
    "function" : {
        "name" : "move_page",
        "description" : "Changes your current page to a specified new page which is accessible via a link from the current page. You can only call this function once at a time, as it will take you to a different page.",
        "parameters": {
            "type": "object",
            "properties": {
                "new_page": {
                    "type": "string",
                    "description": "The title of the new page you want to move to. This should be formatted the way the title appears on wikipedia (e.g. to move to the wikipedia page for the United States of America, you should enter \"United States\"). Underscores are not necessary.",
                },
            },
            "required": ["new_page"]
        }
    }
}

wiki_game_tools = [get_content_tool, move_page_tool]
    

### Exercise - Build a Wikipedia racer agent
```c
🔴🔴🔴🔴⚪
🔵🔵🔵🔵🔵

You should spend up to 30-40 mins on this exercise.
```

Now that you have the `WikiGame` class and tools set up, build out a `WikiAgent` that can access these tools and solve the Wikipedia game. Build the agent so that it can be thrown into an agent loop (similar to the one we had for the arithmetic game) without much additional scaffolding. 

There are a few further considerations in this case that we didn't have for the arithmetic game. 

<details>
<summary>Context window considerations</summary>

Since the agent will need to read (potentially very long) Wikipedia articles to interact with the game, the length of the context window becomes relevant. GPT-4o and GPT-4o-mini both have context windows of 128k tokens (which corresponds to ~96k words) for reference, the wikipedia page for the United States has around 10k words alone and the agent will often need to visit more than 10 articles in one run of the game, not counting its own output, which eventually adds up to be significant. We'll solve this for now by resetting the messages of the agent every time it reaches a new wikipedia page, and providing an updated `user_message` (and possibly `system_message`) so that the agent can locate itself, and then proceed with the game. We'll address different methods for solving this issue later, you can probably already think of some. So be careful to include the current page and goal page for the agent in the `user_message`.

</details>
<br>


<details><summary>Providing information to the agent</summary>

There shouldn't be much on Wikipedia that the agent is completely unfamiliar with (AI companies *will* have scraped wikipedia), but it may be easily confused with something else, or be an article that was added before the training cutoff, and models can't always accurately use things in their training data anyway if they only come up once or twice. So you should use the game's get_summary function to provide details of the goal page to the agent in its initial message.

</details>

<details><summary> Getting output from the agent </summary>

In this case we'll have a lot more moving pieces than the `arithmeticGame` agent. In that case we could just print output from the agent loop. In this case, we strongly recommend that you should print output as it comes up in the agent class. If there's some chance you might not want to see this output, you should use a flag to determine whether to print content or not.

</details>

When making calls to the wikipediaGame class, make use of Python's `getattr()` function ([explanation here](https://www.w3schools.com/python/ref_func_getattr.asp)).




In [258]:
class WikiAgent(SimpleAgent):
    def __init__(self, task: BaseWikiGame, model = "gpt-4o-mini", tools = wiki_game_tools, verbose = True):
        super().__init__(model=model, tools=tools, task=task)
        
        self.history = [] 
        self.full_history = [] # All messages that have been sent in the chat history. We have to erase each time a new page is reached for context window reasons.
        self.path_count = 0
        self.verbose = verbose

    def verbose(self, verbose: bool):
        self.verbose = verbose
    
    
    def update_history(self, message):
        """
        Update self.history and self.full_history with the message or list of messages.
        """
        if isinstance(message, list):
            self.history.extend(message)
            self.full_history.extend(message)
        else:
            self.history.append(message)
            self.full_history.append(message)
        
    def reset_history(self):
        """
        Empty self.history of the agent.
        """
        self.history = []

    
    def handle_tool_calls(self, response: ChatCompletionMessage):
        """
        Handles tool_calls: 
            - Executes the tool calls using execute_tool_calls
            - Appends the original tool call & tool_responses to the history
            - If the agent has moved to a new page, resets the history
            - If not, get the "What's your next step?" instruction from the task and append it to history
        """

        # Append the original function calls to the conversation
        self.update_history(response)

        # Execute the tool calls
        tool_responses = self.execute_tool_calls(response)

        move_to_new_page = False
        
        # Handle tool responses
        for tool_call, tool_response in zip(response.tool_calls, tool_responses):
            self.update_history(apply_tool_call_format(tool_call, tool_response))
            
            if self.verbose: print(f"\nTOOL CALL: \nTool = {tool_call.function.name}, Args = {tool_call.function.arguments} \nTOOL RESPONSE:\n {tool_response[:300]}")

            if tool_response.startswith("Moving page"):
                move_to_new_page = True
        
        if move_to_new_page:
            # Reset the history
            self.reset_history()
            self.path_count += 1
            print(f"{("-" * 50)} \n\nMOVED TO PAGE \n\nPATH HISTORY (N={self.path_count}): {" -> ".join(self.task.page_history)} \n\n{("-"*50)}")

            # Give starting instructions
            self.start()

        else:
            # Append "What's your next step?" instruction
            next_step_message = self.task.get_instructions(system=False, on_page=False, next_step=True)
            self.update_history(next_step_message)
            if self.verbose: print(f"\nUSER: \n{next_step_message[0]["content"]}")

    def handle_refusal(self, response: ChatCompletionMessage):
        self.update_history(apply_assistant_format(response.refusal))
        if self.verbose: print(f"\nMODEL REFUSAL: {response.refusal}")

    def start(self):
        """
        Gives the starting instructions to the agent when starting the game.
        """
        instruction_message = self.task.get_instructions(system=True, on_page=True, next_step= False)
        self.update_history(instruction_message)
        if self.verbose: print(f"\nSYSTEM: \n{instruction_message[0]['content']} \n\nUSER: \n{instruction_message[1]['content']}")

    def run(self):
        # Get the response from the model
        response = self.get_response()

        ## If response contains normal text, append this as an assistant response
        if response.content is not None:
            self.update_history(apply_assistant_format(response.content))
            if self.verbose: print(f"\nMODEL RESPONSE: \n{response.content}")

        # Handle the response
        ## If tool calls, do the tool calls and return the response
        if response.tool_calls:
            self.handle_tool_calls(response)
            
        ## If no tool call: Handle edge cases
        ### Check if there's a refusal to answer:
        elif response.refusal:
            self.handle_refusal(response)


### Exercise - Run the task
```c
Difficulty: 🔴⚪⚪⚪⚪
Importance: 🔵🔵⚪⚪⚪

You should spend up to 5 mins on this exercise.
```

Just like we did for the arithmetic agent, you should write an agent loop for the wikipedia agent (in this case you won't need to print output, as we handled it in the agent class so this function should be a very simple loop).

In [256]:
def agent_loop(agent, game, num_loops = 10):

    agent.start()

    for i in range(num_loops):
        if game.check_win():
            print("Success!")
            return 
        agent.run()

game = BaseWikiGame("1511 Daléra", "Zalíbená")
agent = WikiAgent(task=game, model="gpt-4o-mini")

In [257]:
agent_loop(agent, game, 10)


SYSTEM: 
You are a wikipedia-racing AI. Your goal is to reach Zalíbená by accessing links from a series of wikipedia pages. Your current page is 1511 Daléra. 

USER: 
You are currently on page: 1511 Daléra. Make sure you start by reasoning about what steps you should take to get to the article on Zalíbená. When coming up with a strategy, make sure to pay attention to the path you have already taken, and if your current strategy doesn't seem to be working out, try something else. In case you're unsure, Zalíbená has the following summary: 
            
            [Begin Summary]
            Zalíbená is a village and administrative part of Podveky in Kutná Hora District in the Central Bohemian Region of the Czech Republic. It has about 20 inhabitants.
            [End Summary]

MODEL RESPONSE: 
To reach the article on Zalíbená from the current page 1511 Daléra, I need to find links that may lead me toward discussing villages, geographical locations, or administrative divisions in the Cz

KeyboardInterrupt: 

# 4️⃣ Elicitation

Conducting a capability evaluation is necessarily a very tricky task. This is because we're generally trying to show that the model does or dose not have a certain capability (which we're worried about). If the model does have this capability, then we should be able to demonstrate this, and we're done. However, even if we fail to demonstrate a capability, the model might still *have* this capability, perhaps:

- We prompted the model poorly.

- We stored the long-term history poorly.

- We didn't give the model sufficient tools to accomplish the task.

- There is some minor additional scaffolding that would drastically increase performance.

It is easy to forget due to its ubiquity today, but it took *3.5 years* after the release of GPT-2 (and 2.5 years after the release of GPT-3) for people to discover that [Chain-of-thought reasoning improves model performance](https://arxiv.org/abs/2201.11903). A huge potential failure mode for the project of AI Safety is that people discover similar breakthroughs that push model-performance much higher with minimal additional training (and minimal additional alignment properties), so we want ensure we're trying really hard to elicit the best capability we possibly can, until we feel we've managed to gain [*evidence of absence*](https://en.wikipedia.org/wiki/Evidence_of_absence), not just *absence of evidence*.

Broadly speaking, there are two categories of elicitation, narrow elicitation and general elicitation:

- Narrow elicitation methods are those which will improve model performance on a particular task, or small class of tasks, but won't necessarily impact model performance in a more general way. A good example of narrow elicitation for the wikipedia game might be giving the agent access to the content of arbitrary wikipedia articles. This will improve performance on this task significantly, but wouldn't generalize to a broad array of tasks.

- General elicitation methods are those which will improve model performance on a wide array of possible tasks. A good example of a general elicitation method is chain-of-thought. This tends to improve model performance on a wide array of tasks. These sorts of elicitation methods are the ones we're most interested in, as if researchers find an improvement to models that is roughly as easy and effective as chain-of-thought prompting, then we would see a very rapid increase in risk from AI.


## Prompting

You should already be aware that prompting can have a large impact on model performance. There are a large number of possible changes to prompts for this task. You should experiment first with more general elicitation methods such as getting the agent to think more deeply, and output plans in different ways. 

After this, you might try a wide array of narrow elicitation methods including:

- Telling the agent how many pages it's visited.

- Telling the agent if it's already visited the page it's on (and how many times).

- Schedule different prompts and planning methods for the "zoom out" and "zoom in" sections of the game, since the general strategy for the wikipedia game looks like:

    Specific article (with few links) -> General article (with many links) -> Specific article (with few links)


### Exercise - Engineer prompts
```c
Difficulty: 🔴🔴⚪⚪⚪
Importance: 🔵🔵🔵⚪⚪

You should spend up to 20-25 mins on this exercise.
```

Remember that your prompts obviously will have to be robust to: 

* Different tasks within the wikipedia game, 

* Different states within those tasks,

* Different failure-modes the agent could encounter.

Mess around with the prompting setup and see if you can significantly improve performance.

In [224]:
class WikiGame(BaseWikiGame):

    @property
    def start_instruction(self):
        return {
            "role" : "system", 
            "content" : f"You are a wikipedia-racing AI. Your goal is to reach {self.goal_page.title} by accessing links from wikipedia pages. Your current page is {self.current_page.title}."
            }
            
    @property
    def on_page_instruction(self):
        return {
            "role" : "user",
            "content" : f"""You are currently on page: {self.current_page.title}. Make sure you start by reasoning about what steps you should take to get to the article on {self.goal_page.title}. When coming up with a strategy, make sure to pay attention to the path you have already taken, and if your current strategy doesn't seem to be working out, try something else. In case you're unsure, {self.goal_page.title} has the following summary: 
            
            [Begin Summary]
            {self.get_page_summary(self.goal_page)}
            [End Summary]
            """
            }

In [225]:
game = WikiGame("Aristotle", "Othello")
agent = WikiAgent(game, model="gpt-4o-mini")
agent_loop(agent, game, 5)


SYSTEM: You are a wikipedia-racing AI. Your goal is to reach Othello by accessing links from wikipedia pages. Your current page is Aristotle. 

USER: You are currently on page: Aristotle. Make sure you start by reasoning about what steps you should take to get to the article on Othello. When coming up with a strategy, make sure to pay attention to the path you have already taken, and if your current strategy doesn't seem to be working out, try something else. In case you're unsure, Othello has the following summary: 
            
            [Begin Summary]
            Othello (; full title: The Tragedy of Othello, the Moor of Venice) is a tragedy written by William Shakespeare around 1603. Set in Venice and Cyprus, the play depicts the Moorish military commander Othello as he is manipulated by his ensign, Iago, into suspecting his wife Desdemona of infidelity. Othello is widely considered one of Shakespeare's greatest works and is usually classified among his major tragedies alongsid

### Exercise - Implement the ReAct framework
```c
Difficulty: 🔴🔴⚪⚪⚪
Importance: 🔵🔵🔵⚪⚪

You should spend up to 10-15 mins on this exercise.
```

Chain-of-thought prompting confers significant benefits to model performance, and you probably tried it when you messed around with prompting above. But when we're using LLMs as agents, we may want to provide a different structure to elicit reasoning. This is called the [**ReAct** framework](https://arxiv.org/abs/2210.03629); it consists of:

- Getting the model to generate **Re**asoning about its current situation, and what sort of actions it should consider taking.

- Then getting the model to perform an **Act**ion based on its outputted reasoning.

Remember that if you're calling the model without tools, it won't have a description of the tools in its system message, so we'll have to ensure that the tool descriptions are in the `system_message` (this will lead to some redundancy when the model takes an action, but that's alright).

In [226]:
class WikiGameReAct(WikiGame):
    def __init__(self, starting_page: str, goal_page: str, rules: Optional[List[Dict[str, str]]] = None, tools = None):
        super().__init__(starting_page, goal_page, rules)
        self.tool_descriptions = tools
    
    def add_tool(self, tools: dict | List[dict]):
        if isinstance(tools, dict):
            self.tool_descriptions.append(tools)
        elif isinstance(tools, list):
            self.tool_descriptions.extend(tool)

    @property
    def start_instruction(self):
        # Provided a description of the tools in the system message. When generate is called with tools this is redundant, but when generate is called without tools, this is useful.
        return {
            "role" : "system",
            "content" : f"""You are a wikipedia-racing AI. Your goal is to reach {self.goal_page.title} by accessing links from wikipedia pages. Your current page is {self.current_page.title}. You have access to {str(len(self.tool_descriptions))} tools:\n{"\n".join([tool["function"]["name"] + ": " + tool["function"]["description"] for tool in self.tool_descriptions])}"""
        } 
    
    @property
    def apply_user_format(self):
        # You may or may not want to edit your standard user message
        return {
            "role" : "user",
            "content" : f"""You are currently on page: {self.current_page.title}. Make sure you start by reasoning about what steps you should take to get to the article on {self.goal_page.title}. When coming up with a strategy, make sure to pay attention to the path you have already taken, and if your current strategy doesn't seem to be working out, try something else. In case you're unsure, {self.goal_page.title} has the following summary: 
            
            [Begin Summary]
            {self.get_page_summary(self.goal_page)}
            [End Summary]
            """
        }
    
class WikiReActAgent(WikiAgent):

    def generate_reason(self) -> ChatCompletionMessage:
        # Get the model to reason about the current state of the game and add the response to the messages (you may not want to give it tools for this)
        self.history.append(apply_user_format("Think carefully about your current situation and what actions you want to take to get closer to" + self.task.goal_page.title + "."))
        response = self.get_response(use_tool=False)
        return response
        
    def generate_action(self) -> ChatCompletionMessage:
        # Get the model to generate an action based on the reasoning and add the response to the messages
        self.history.append(apply_user_format("What action do you want to take?"))
        response = self.get_response()
        return response
    
    def generate_reason_and_action(self):
        """
        Generate a reason, store this in history, then generate and return an action.
        """ 
        reason = self.generate_reason()
        self.update_history(apply_assistant_format(reason.content))
        print("\nModel response ('Reason'):", reason.content)

        action = self.generate_action()

        return action

    def run(self):
        """
        Run one loop of the agent.
        """
        response = self.generate_reason_and_action()

        if response.tool_calls:
            self.handle_tool_calls(response)
        elif response.refusal:
            self.handle_refusal(response)
        


You may have to rewrite your `agent_loop`.

In [200]:
def agent_loop_ReAct(game, agent, num_loops = 10):
    agent.start()
    for i in range(num_loops):
        if game.check_win():
            print("Success")
            return
        agent.run()

game = WikiGameReAct("Aristotle", "Othello", tools=wiki_game_tools)
agent = WikiReActAgent(game, model="gpt-4o-mini", tools = wiki_game_tools)

In [228]:
agent_loop_ReAct(game, agent)



SYSTEM: You are a wikipedia-racing AI. Your goal is to reach Othello by accessing links from wikipedia pages. Your current page is Mimesis. You have access to 2 tools:
get_content: Get all the content for the wikipedia page you are currently on. Anything which corresponds to a link you can select will be wrapped in <link></link> tags.
move_page: Changes your current page to a specified new page which is accessible via a link from the current page. You can only call this function once at a time, as it will take you to a different page. 

USER: You are currently on page: Mimesis. Make sure you start by reasoning about what steps you should take to get to the article on Othello. When coming up with a strategy, make sure to pay attention to the path you have already taken, and if your current strategy doesn't seem to be working out, try something else. In case you're unsure, Othello has the following summary: 
            
            [Begin Summary]
            Othello (; full title: The

## Scaffolding

In the literature on AI agents, scaffolding is generally used to refer to everything that is not the language model itself. So prompting and tool use are both part of the scaffolding, as well as frameworks like ReAct or Reflexion, or providing the history. Since we'll be spending more time on prompt methods and tool usage, these have their own category, so "scaffolding" becomes essentially a category for things that are more "structural" than simple tool use or prompting methods. But you shouldn't be surprised when you people just talk about scaffolding as a catch-all in papers.

In this section we will improve the scaffolding by:

- Telling the agent its history.

- Giving the agent a simple "[reflexion tool](https://arxiv.org/pdf/2303.11366)" 

- Telling the agent if it's already visited a page.


### Exercise - Implement an agent history property
```c
Difficulty: 🔴⚪⚪⚪⚪
Importance: 🔵🔵⚪⚪⚪

You should spend up to 5-10 mins on this exercise.
```

You may notice that the agent frequently gets stuck in loops. Since we're already storing a history of page titles in the game class, we should try providing this information to the agent and see if it improves the looping behavior. Implement this below:

In [218]:
class WikiGameHistory(WikiGameReAct):
    
    @property
    def on_page_instruction(self):
        return {
            "role" : "user",
            "content" : f"""You are currently on page: {self.current_page.title}. Make sure you start by reasoning about what steps you should take to get to the article on {self.goal_page.title}. When coming up with a strategy, make sure to pay attention to the path you have already taken, and if your current strategy doesn't seem to be working out, try something else. In case you're unsure, {self.goal_page.title} has the following summary: 
            
            [Begin Summary]
            {self.get_page_summary(self.goal_page)}
            [End Summary]
            
            The pages you've visited so far has been: {" -> ".join(self.page_history)}"""
        }

### Exercise - Implement a reflexion tool
```c
Difficulty: 🔴🔴🔴⚪⚪
Importance: 🔵🔵🔵⚪⚪

You should spend up to 10-15 mins on this exercise.
```

[This paper](https://arxiv.org/abs/2303.11366) finds better performance by LLMs on tasks when they can perform "lookahead" and get feedback on their plans. We will imitate this by allowing the agent to suggest candidate paths, and informing it where these paths go wrong (if they do). You'll need to add this tool to the list of tools.

We don't want to provide the agent the links/content of every page when it does this lookahead, as then we'd just be reimplementing a smaller version of the game *inside the game*. Instead, we'll let the agent suggest paths without seeing any content or links, and then let it know if this path works. It's very likely that a suggested link will, at some point, not be accessible from one of the pages, but this should still help to guide the agent's plans.

In [121]:
class WikiGameTestPath(WikiGame):
    def __init__(self, starting_page : str, goal_page : str, rules : list | type[None] = None):
        super().__init__(starting_page, goal_page, rules)

        
    def test_path(self, path: str) -> str:
    """
    Test if a given path is valid.

    Args:
    path (str): A string representing a path, e.g., "Barack Obama -> Indonesia -> India"

    Returns:
    str: A message indicating whether the path is valid or where it fails.
    """
    path_nodes = [node.strip() for node in path.split("->")]
    
    if not path_nodes:
        return "ERROR: Empty path provided."
    
    if path_nodes[0] != self.current_page.title:
        return f"ERROR: The path should start with the current page: {self.current_page.title}"
    
    for i in range(len(path_nodes) - 1):
        current_node = path_nodes[i]
        next_node = path_nodes[i + 1]
        
        permitted_links = set(link.lower() for link in self.get_permitted_links(current_node))
        
        if next_node.lower() not in permitted_links:
            return f"This path works until {next_node}, which is not accessible from {current_node}"
    
    return "This path is valid."

Now write a description of this tool and add it to the list of `wiki_game_tools`.

In [122]:
test_path_tool = {
    "type" : "function",
    "function" : {
        "name" : "test_path",
        "description" : "Accepts a test path string in the form \"current_page -> page1 -> page2 -> ... -> pageN\" and if the path does not work, then it returns where the path goes wrong, if the path does work it returns \"success.\" Be careful that path titles can be sensitive to plurals or rephrasings. This tool is especially useful to check longer plans.",
        "parameters" : {
            "type" : "object",
            "properties": {
                "path" : {
                    "type" : "string",
                    "description" : "The path you want to test, formatted as \" current_page -> page1 -> page2 -> ... -> pageN\"."
                },
            },
            "required" : ["path"]
        }
    }
}

wiki_game_tools = [get_content_tool, move_page_tool, test_path_tool]

In [None]:
game = WikipediaGameTestPath("William Pitt the Younger", "Central Vietnam")
agent = WikiAgentHistory(game, model="gpt-4o-mini", tools = wiki_game_tools)
agent_loop_ReAct(game,agent, 40)

## Tool use

We can also give the agent additional tools that may be useful for the wikipediaGame task, or more general tooling methods. 


#### **[JAMES COMMENT]** I still need to figure out what to say about tool use. If you have any ideas then open to suggestions :) Lilian Weng did a little "humans use tools" and so do some animals thing. A cute animal pic might actually go over quite well here IMO.




 but if you give the agent too many tools (especially with poor descriptions), then performance can often suffer. This happens most prominently when using more than 5-10 tools.

### Exercise - Implement a page summary tool
```c
Difficulty:🔴🔴⚪⚪⚪
Importance:🔵🔵⚪⚪⚪

You should spend up to 10-15 mins on this exercise.
```

Implement a tool that allows an agent to get a summary of an accessible page. This imitates wikipedia's native 'hover summary' tool.


In [125]:
get_page_summary = {
    "type" : "function",
    "function" : {
        "name" : "get_page_summary",
        "description" : "Get the summary of a wikipedia page you are considering moving to, to the last full stop within the first 500 characters. The page needs to be accessible via a link from the current page. Anything which corresponds to a link you can select will be wrapped in <link></link> tags",
        "parameters" : {
            "type" : "object",
            "properties" : {
                "page" : {
                    "type" : "object",
                    "description" : "The wikipedia page you want to get the summary of."
                }
            },
            "required" : ["page"]
        }
    }
}

class WikipediaGamePageSummary(WikipediaGameTestPath):
    def __init__(self, starting_page : str, goal_page : str, rules : list | type[None] = None):
        super().__init__(starting_page, goal_page, rules)

        
    def get_page_summary(self, page : WikipediaPage) -> str:
        if is_permitted_link(self, page):
            summary = page.content[0:500]
            return summary[0: summary.rindex(".")+1]
        else:
            return "This page is not accessible from the current page, so a summary cannot be returned."

wiki_game_tools.append(get_page_summary)

### Exercise - Implement an arbitrary page summary/content tool
```c
Difficulty:🔴⚪⚪⚪⚪
Importance:🔵🔵⚪⚪⚪

You should spend up to 5-10 mins on this exercise.
```

Now implement a tool that allows the agent to suggest any wikipedia page, and get a brief summary of it. This may be helpful for the agent to formulate plans into the future.


In [None]:
get_page_content = {
    "type" : "function",
    "function" : {
        "name" : "get_page_content",
        "description" : "Get the content of a wikipedia page you are considering moving to. Anything which corresponds to a link you can select will be wrapped in <link></link> tags.",
        "parameters" : {
            "type" : "object",
            "properties" : {
                "page" : {
                    "type" : "object",
                    "description" : "The wikipedia page you want to get the content of."
                }
            },
            "required" : ["page"]
        }
    }
}

class WikipediaGamePageContent(WikipediaGamePageSummary):
    def __init__(self, starting_page : str, goal_page : str, rules : list | type[None] = None):
        super().__init__(starting_page, goal_page, rules)
    def get_page_content(self, arguments : dict) -> str:
        page = arguments["page"]
        content = page.content
        return content

### Exercise - Implement a ctrl-F tool

Still need to do this. Probably will though. Not super urgent. Might add more elicitation stuff later if I think of any that seem cool

### Supervised Fine-Tuning

We're not going to conduct supervised fine-tuning here. But it's worth mentioning as an elicitation method, just because it can be so powerful. [ADD MORE INFO HERE LATER]

# 5️⃣ Bonus

### Exercise - Implement additional rules

Test agent performance on these tasks:
- Task 1

- Task 2

- Task 3

- Task 4

- Task 5

- Task 6

- Task 7

- Task 8

- Task 9

- Task 10

See what combination of tools appears to work best.

### Exercise - Rearrange so that each page is broken up by sections

In [70]:
x = get_page("Aristotle")
print(dir(x))
print(x.section("Metaphysics/Substance"))

['_WikipediaPage__continued_query', '_WikipediaPage__load', '_WikipediaPage__title_query_param', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'categories', 'content', 'coordinates', 'html', 'images', 'links', 'original_title', 'pageid', 'parent_id', 'references', 'revision_id', 'section', 'sections', 'summary', 'title', 'url']
None
