<a href="https://colab.research.google.com/github/EffiSciencesResearch/ML4G-2.0/blob/master/workshops/agents/agents_hard.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
# LLM APIs & Agents

This notebook is an introduction to the OpenAI & Anthropic APIs and to the design of LLM agents.

In the first part, your goal will be to make a chatbot that negotiates the price of a specific good with another model. (You can imagine that this could be part of a persuasion benchmark.)

## Goals
The goals of this workshop include becoming more familiar with...
- how to use an LLM API
- finding your way in documentation
- how to make LLMs take actions
- prompt engineering and control of LLM outputs

During the workshop, you may like to refer to this documentation:
- For the OpenAI API: https://platform.openai.com/docs/
- For the Anthropic API: https://docs.anthropic.com/claude/docs/

In [None]:
try:
    import google.colab
except ImportError:
    pass
else:  # in colab
    %pip install openai anthropic

In [None]:
import os
import openai
import anthropic

openai_key = os.environ.get("OPENAI_API_KEY") or input("OpenAI API Key: ")
# anthropic_key = os.environ.get("ANTHROPIC_API_KEY") or input("Anthropic API Key: ")
anthropic_key = "no-key"

openai_client = openai.Client(api_key=openai_key)
anthropic_client = anthropic.Client(api_key=anthropic_key)

In [None]:
MODELS = [
    # Small, cheap, fast
    "claude-3-haiku-20240307",
    # Medium
    "gpt-4o-mini",
    # Maybe the best
    "claude-3-5-sonnet-20240620",
    # Big, slow, expensive
    "gpt-4o",
]

CLAUDE_SMALL, GPT4_MINI, CLAUDE_BEST, GPT4 = MODELS
MODEL = MODELS[1]

## Understanding OpenAI's API

The following is an example of how to use the API.
Try to understand what each parameter does by changing it and seeing what happens.

<details>
<summary>Why is the messages parameter a list? What are each of its elements?</summary>

`message` is a list of each message in a conversation. The list corresponds to one chat, with messages from the assistant and the user as you would see in the ChatGPT interface. Under the hood, the API concatenates them, and include marker tokens to differentiate between the roles of `"user"`, `"assistant"`, and `"system"`.
</details>

You can see [here](https://platform.openai.com/docs/guides/chat-completions/response-format) for an example of what constitutes a chat completion object.

In [None]:
completion = openai_client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {
            "role": "system",
            "content": "You are a poetic assistant, skilled in explaining complex programming concepts with creative flair.",
        },
        {
            "role": "user",
            "content": "Compose a poem that explains the concept of recursion in programming.",
        },
    ],
    max_tokens=100,
)

print(completion.choices[0].message.content)

## Setting up a negotiation between LLMs

We'll start by creating one function to handle all the details of the APIs, so that we can forget about them later and focus more on the higher-level logic.

**Important note**: When you develop applications, evaluations, or benchmarks with LLMs, it is sensible to test with the smallest model first, as they are much faster and cheaper. This let you do more and faster iterations. However, when you start to tweak prompts, you need to tweak your prompts for one specific LLM, as they often react differently. The best prompt on GPT3 can be quite bad on GPT4 and vice versa.

<!-- Start by having the function work for OpenAI's models, test it on the cells bellow, and you can later come back and implement it for Anthropic. The Anthropic part is especially interesting when we get to make the two of them chat. Who's the most persuasive? -->
Fill in the missing pieces in the function.

In [None]:
def generate_answer(system_prompt: str, *messages: str, model: str = "gpt-4o") -> str | None:
    """
    Generate the next message from the specified model.

    Args:
        system_prompt: The system prompt to use.
        messages: The content of all the messages in the conversation. It is assumed
            that the first message has the "user" role, and that subsequent messages
            alternate between "assistant" and "user".
        model: The name of the model to use.

    Returns:
        The content of the next message in the conversation.
    """

    # Convert the list of string messages to a list of dictionaries, with the role alternating between "user" and "assistant".
    message_dicts = []
    ...  # TODO: ~25 words

    if "gpt" in model:
        # Use the OpenAI API.
        ...  # TODO: ~25 words

    elif "claude" in model:
        # Use the Anthropic API.
        response = anthropic_client.messages.create(
            system=system_prompt,
            messages=message_dicts,
            model=model,
            max_tokens=1000,
        )
        return response.content[0].text

    else:
        raise ValueError(f"Unknown model: {model!r}")

<details>
<summary>Show solution</summary>

```python
def generate_answer(system_prompt: str, *messages: str, model: str = "gpt-4o") -> str | None:
    """
    Generate the next message from the specified model.

    Args:
        system_prompt: The system prompt to use.
        messages: The content of all the messages in the conversation. It is assumed
            that the first message has the "user" role, and that subsequent messages
            alternate between "assistant" and "user".
        model: The name of the model to use.

    Returns:
        The content of the next message in the conversation.
    """

    # Convert the list of string messages to a list of dictionaries, with the role alternating between "user" and "assistant".
    message_dicts = []
    for i, message_content in enumerate(messages):
        if i % 2 == 0:
            message_dicts.append(dict(role="user", content=message_content))
        else:
            message_dicts.append(dict(role="assistant", content=message_content))

    if "gpt" in model:
        # Use the OpenAI API.
        message_dicts.insert(0, dict(role="system", content=system_prompt))
        response = openai_client.chat.completions.create(
            messages=message_dicts,
            model=model,
            max_tokens=1000,
        )
        return response.choices[0].message.content

    elif "claude" in model:
        # Use the Anthropic API.
        response = anthropic_client.messages.create(
            system=system_prompt,
            messages=message_dicts,
            model=model,
            max_tokens=1000,
        )
        return response.content[0].text

    else:
        raise ValueError(f"Unknown model: {model!r}")
```

</details>



**(Bonus for later)** Make the API stream the answer, so that you can print it as it is generated. You can either print it directly in the function or transform the function in a generator that yields strings.

In [None]:
generate_answer(
    "Answer the questions for the user, always in 2 sentences and from the perspective of the French president.",
    "What are counterintuitive ways to make the most out of a summer school?",
)

Now we can create a loop to keep the discussion going. 
A few points to have in mind:

<details>
<summary>How do you know when to stop the loop? Can it continue forever?</summary>

You can stop when the LLM says something like "Offer accepted", but this is not enough. If they forget their instructions, your code is going to run forever. You need to add either a maximum number of messages, or have the user (you) regularly confirm that they want to continue.
</details>
<details>
<summary>
The messages for the API need to start with a message from the "user". Who is the user here, and how do you generate the first message?
</summary>

In our case of having AIs negotiate prices with each other, the user will just be the other AI. The first message from the buyer can be hardcoded to "Hello!", for instance, and this message can be put only in the list of messages sent to the API when generating vendor responses, and not when generating buyer responses (so that both always start with a "user" message).
</details>

Note: You may need to add a time.sleep() in the loop to avoid rate limits.

(Bonus) Catch rate limits errors and wait for the exact time.

In [None]:
VENDOR_PROMPT = r"""
You sell tables. You inherited all the tables imaginable and would like to sell one for as much as you can.
The person in front of you seems interested in a new table.

You can make formal offers by ending your message with "Offer: XXX€".
If you want to accept an offer from the buyer, end your message with "Offer accepted!".

Important: your goal is to negotiate to have the highest final price possible.
"""

BUYER_PROMPT = r"""
You are looking to buy a nice table for as cheap as possible.

You can make formal offers by ending your message with "Offer: XXX€".
If you want to accept an offer from the vendor, end your message with "Offer accepted!".

Important: your goal is to negotiate to pay the lowest final price possible.
"""

STOP = "Offer accepted!"


def chat_two_llms(
    vendor_system_prompt: str,
    buyer_system_prompt: str,
    vendor_model: str = MODEL,
    buyer_model: str = MODEL,
    stop: str = STOP,
    max_turns: int = 4,
):
    """Print a dialogue between the 2 LLMs."""
    ...  # TODO: ~47 words


chat_two_llms(VENDOR_PROMPT, BUYER_PROMPT)

<details>
<summary>Show solution</summary>

```python
def chat_two_llms(
    vendor_system_prompt: str,
    buyer_system_prompt: str,
    vendor_model: str = MODEL,
    buyer_model: str = MODEL,
    stop: str = STOP,
    max_turns: int = 4,
):
    """Print a dialogue between the 2 LLMs."""
    messages = []

    # Be sure that the function does not call the API endlessly!
    for _ in range(max_turns):
        # 1. Generate the first message from the vendor.
        # (Remember, the vendor needs to answer a message. Which one?)
        response = generate_answer(vendor_system_prompt, "Hello!", *messages, model=vendor_model)
        # 2. Print and save the message.
        print(f"\n++++ Vendor:\n{response}")
        messages.append(response)
        # 3. Check if the conversation should stop (agreement reached).
        if stop in response:
            break

        # Do the same for the buyer, except for the first message.
        response = generate_answer(buyer_system_prompt, *messages, model=buyer_model)
        print(f"\n---- Buyer\n{response}")
        messages.append(response)

        if stop in response:
            break


chat_two_llms(VENDOR_PROMPT, BUYER_PROMPT)
```

</details>



What price was agreed upon? Does it change when you change models? You can compare with other people in the room. Are bigger models better at this persuasion task?

This is an especially simple model of chat interaction between two LLMs. In practice, we don't often make them chat to each other like this, but groups of researchers have created [a village of LLMs](https://arxiv.org/abs/2304.03442),
put together a [virtual game development company](https://arxiv.org/abs/2307.07924) ([code](https://github.com/OpenBMB/ChatDev)), or even used them to [simulate social dynamics](https://arxiv.org/abs/2208.04024), [model epidemic spread](https://arxiv.org/abs/2307.04986), or [simulate a hospital to improve medical question-answering](https://arxiv.org/abs/2405.02957). (You can see a list with many more examples [here](https://github.com/OpenBMB/ChatDev/blob/main/MultiAgentEbook/papers.csv).)


# (Bonus) Think before you speak

Here the LLMs chat directly to each other, but in practice, it is useful to allow them to "think out loud" before they speak. This means that all of the output of an LLM will not be added to the shared message history, but rather is only used privately to help it generate a better public response.

This also means that we need to parse the response of the LLM somehow to find what is publicly addressed to the chat and what is a private chain of thought.

A nice trick is to ask them to output [JSON](https://en.wikipedia.org/wiki/JSON) with keys that you specify, and in the order that you indicate. (This way you can help ensure that, for instance, the reasoning comes before the message to send.)

In [None]:
import json

VENDOR_PROMPT = """
You sell tables. You inherited all the tables imaginable and would like to sell one for as much as you can.
The person in front of you seems interested in a new table.

Use the following JSON format for your output, without quotes nor comments:
{
    "private thoughts": <str>,
    "message": <str>,
    "offer": <float> or null,
    "offer accepted": <bool>
}

Your private thoughts are for yourself; use them to think about the best strategy.
Only the message will be sent to the buyer.
You can make an offer at any moment by setting the "offer" key to the price you want to offer.
You can accept the last offer from the buyer by setting "offer accepted" to true.

Important: your goal is to negotiate to have the highest final price possible.
"""

BUYER_PROMPT = """
You are looking to buy a nice table for as cheap as possible.

Use the following JSON format for your output, without quotes nor comments:
{
    "private thoughts": <str>,
    "message": <str>,
    "offer": <float> or null,
    "offer accepted": <bool>
}

Your private reasoning are for yourself; use them to think about the best strategy.
Only the message will be sent to the vendor.
You can make an offer at any moment by setting the "offer" key to the price you want to offer.
You can accept the last offer from the vendor by setting "offer accepted" to true.

Important: your goal is to negotiate to pay the lowest final price possible.
"""

STOP = "Offer accepted!"


def chat_two_llms_with_private_reasoning(
    vendor_system_prompt: str,
    buyer_system_prompt: str,
    vendor_model: str = MODEL,
    buyer_model: str = MODEL,
    stop: str = None,
    max_turns: int = 4,
):
    """Print a dialogue between the 2 LLMs that employs private chains of thought."""

    # Since the two AIs do not see the same thing (each has private chains of thought),
    # we need to keep track of their messages separately.
    messages_for_vendor = ['{"message": "Hello!", "offer": null, "offer accepted": false}']
    messages_for_buyer = []

    ...  # TODO: ~80 words


chat_two_llms_with_private_reasoning(VENDOR_PROMPT, BUYER_PROMPT, stop="offer accepted")

<details>
<summary>Show solution</summary>

```python
def chat_two_llms_with_private_reasoning(
    vendor_system_prompt: str,
    buyer_system_prompt: str,
    vendor_model: str = MODEL,
    buyer_model: str = MODEL,
    stop: str = None,
    max_turns: int = 4,
):
    """Print a dialogue between the 2 LLMs that employs private chains of thought."""

    # Since the two AIs do not see the same thing (each has private chains of thought),
    # we need to keep track of their messages separately.
    messages_for_vendor = ['{"message": "Hello!", "offer": null, "offer accepted": false}']
    messages_for_buyer = []

    for _ in range(max_turns):
        print("+++++++ Vendor +++++++")
        # 1. Get and save the message from the vendor.
        response = generate_answer(vendor_system_prompt, *messages_for_vendor, model=vendor_model)
        messages_for_vendor.append(response)
        print("Vendor:", response)

        # 2. Load the response in json.
        response: dict = json.loads(response)

        # 3. Remove the private reasoning.
        response.pop("private thoughts")

        # 4. Convert the message without reasoning back to a string.
        message_without_reasoning: str = json.dumps(response, indent=2)

        # 5. Send the message without the private reasoning to the buyer.
        messages_for_buyer.append(message_without_reasoning)

        # 6. Check if the vendor accepted an offer.
        if response.get(stop):
            break

        # Do the same for the buyer.
        print("------- Buyer -------")
        response = generate_answer(buyer_system_prompt, *messages_for_buyer, model=buyer_model)
        messages_for_buyer.append(response)
        print("Buyer:", response)
        response: dict = json.loads(response)
        response.pop("private thoughts")
        message_without_reasoning = json.dumps(response, indent=2)
        messages_for_vendor.append(message_without_reasoning)
        if response.get(stop):
            break



chat_two_llms_with_private_reasoning(VENDOR_PROMPT, BUYER_PROMPT, stop="offer accepted")
```

</details>



## LLM agents

We have now seen how to interact with the APIs and have made models talk to each other. Next we will see how to make them take actions. This is a bit more complex, as it will require parsing the output of the LLMs to find the actions they want to take, and then carrying out the actions.

We will implement two actions for your agent:
- `run_python`: run a piece of Python code
- `ai_call`: call a copy of the model with a specific prompt

The components of our code will be:
1. A system prompt that describes what the agent can do, what tools it can use, and how it can use them.
1. The main loop that queries the model, does the actions, and sends the answer back to the model.
1. The implementation of each action.

What is going on in the system prompt in the next cell? Can you tell why each part is there? How would you improve it?
You are encouraged to experiment with variations in the prompt once we have implemented everything to see if you can make the agent better.

<details>
<summary>Why did I choose to use the <a href=https://en.wikipedia.org/wiki/TOML>TOML</a> format below instead of JSON or something else?</summary>

JSON is not great for multiline strings because one needs to explicitly write the newline characters (the "\n"), which the AI model is reluctant to do and which can reduce the model's performance while writing Python code.

On the other hand, in TOML, you can easily have multiline strings that are not indented and it is not necessary to escape (most) characters, which keeps the performance of the model high.
</details>

In [None]:
# Loosely inspired by Claude's system prompt: https://twitter.com/AmandaAskell/status/1765207842993434880

AGENT_SYSTEM_PROMPT = """
Alice is an AI assistant. Alice is helpful.
Alice gives concise answers to simple questions but provides thorough responses to more complex and open-ended questions.

Alice can use tools and writes in the following TOML format, without formating, backticks, nor text before or after the TOML block:

reflexions = "<str>"
tool = "python, call, or output"
arg = '''
multiline string
'''

Start your response by "reflexions = ".

Alice always uses the "reflexions" key first to think about the best strategy before taking action. Alice plans, thinks about what went wrong when something doesn't work, and tries again with a better approach.
Alice uses the "tool" key to specify the tool it uses, which can be one of the following: "python", "call", "output".

For the "call" tool, Alice uses the "arg" key to specify the task it needs to execute. Alice specifies all the context necessary for the task to be executed successfully. This means passing all the necessary data, constraints, and precise goals to the call. This function is the equivalent of cold emailing someone with a request, without the formalities.
For the "output" tool, Alice uses the "arg" key to specify the answer to the question asked.
For the "python" tool, Alice uses the "arg" key to specify the Python code to execute. Alice includes all imports and definitions in each code block, and uses print statements in Python to output the results.
Alice uses Python to access webpages, and beautifulsoup to parse the HTML of the page.
"""

In [None]:
print(generate_answer(AGENT_SYSTEM_PROMPT, "What is the 50th fibonacci number?"))

We now move on to the main loop. Some questions:
<details>
<summary>How can we prevent Alice from running code that does harm? (find 3 ways)</summary>

Ways one could do this include:
- Ask the user confirmation before running code.
- Use a sandboxed environment to run the code so it's harder to have negative effects.
- Use a monitoring system or ask another AI to check if the code is safe.
- Never use AI agents.
</details>

<details>

<summary>The API expects an alternating sequence of messages from a "user" and an "assistant".
What are the "user" messages? How do you make sure there are always such messages?</summary>

The user messages are the output of the commands run by the agent. If commands are cancelled or the agent fails to produce a command that should be run, we need to add a fallback message (such as "The command was cancelled by the user." or "No command was found. Use tags such as <call> to run a command.").
Or we can just crash.
</details> 

<details>
<summary>Why should the agent function below return something? What is the string that the agent function returns?</summary>

The `agent` function returns something because we want to call it recursively. Sometimes the agent calls itself with a query and expects an answer. `agent` returns this answer. This way, the main function can also be also one of the `tools` passed to itself.
</details> 

In [None]:
import toml
from typing import Callable


def agent(
    user_message: str,
    model: str = MODEL,
    max_iterations: int = 4,
    **tools: Callable[[str], str],
) -> str:
    """Run an LLM agent with the specified tools.

    Args:
        user_message: The initial task for the agent.
        model: The model to use.
        **tools: The tools to give to the agent.

    Returns:
        The output of the agent.
    """

    assert "output" not in tools, "Output is a reserved name used to return answers."

    messages = [user_message]
    for _ in range(max_iterations):
        ...  # TODO: ~82 words

<details>
<summary>Show solution</summary>

```python
def agent(
    user_message: str,
    model: str = MODEL,
    max_iterations: int = 4,
    **tools: Callable[[str], str],
) -> str:
    """Run an LLM agent with the specified tools.

    Args:
        user_message: The initial task for the agent.
        model: The model to use.
        **tools: The tools to give to the agent.

    Returns:
        The output of the agent.
    """

    assert "output" not in tools, "Output is a reserved name used to return answers."

    messages = [user_message]
    for _ in range(max_iterations):
        # 1. Generate the next message.
        response = generate_answer(AGENT_SYSTEM_PROMPT, *messages, model=model)
        messages.append(response)
        # This is a trick to print in yellow.
        print(f"\033[33m{response}\033[0m", flush=True)

        # 2. Parse the TOML to extract the tool and its argument.
        parsed = toml.loads(response)
        tool = parsed["tool"]
        arg = parsed["arg"]

        if tool == "output":
            return arg
        else:
            # 3. Ask the user to allow the usage of the tool.
            tool_denied = input(f"Press enter to allow tool {tool!r}, or enter feedback to deny.")
            if tool_denied:
                # 4a. Provide feedback to the agent.
                messages.append(
                    f"Function cancelled by the user. They provided this feedback: {tool_denied}"
                )
            else:
                # 4b. Execute the tool, and store the output for the agent.
                output = tools[tool](arg)
                messages.append(f"Output from {tool!r}: {output}")

        # 5. Show the output of the tool.
        print(messages[-1], flush=True)
```

</details>



Let's first implement the `code` action. The main tricky part is to catch what the code outputs in a variable (though it's more of a trick of general Python wizardry).

<details>
<summary>
What are the risks of running code with exec()? (find at least 2)
</summary>

You should **NEVER** run untrusted code with `exec()`. Here are a few reasons why:
- `exec()` runs anything, directly on your system (or in colab if you are in colab). This includes things like `exec("import os; os.system('rm -rf /')"` that would delete everything on your computer.
- `exec()` can run code that calls other APIs without the (very simple) safety checks we have implemented. Then you have an autonomous system without checks.
- `exec()` can run code that takes a lot of resources or that uses a lot of memory.

Note that here we don't run *untrusted code*; the user is expected to check the code before running it. So we move the responsibility to the user.
Do you think this is a good idea? Why?

After how many instances of "everything is fine" will a user not check the code again and just press enter?
</details>

In [None]:
from io import StringIO
from contextlib import redirect_stdout
import traceback


def run_python(code: str, max_output_length: int = 2000) -> str:
    """Run the python code and return the output."""

    # Capture the output
    with StringIO() as buf, redirect_stdout(buf):
        # Run the code, catching the errors.
        try:
            exec(code)
        except Exception as e:
            traceback.print_exc(file=buf)

        out = buf.getvalue()

    # If the content is too long, truncate it to avoid wasting money.
    m = max_output_length // 2
    if len(out) > max_output_length:
        out = out[:m] + f"... [{len(out) - max_output_length} chars truncated]" + out[-m:]
    return out

Let's now implement the `call_ai` function so that the agent can call itself. The main trick here is to pass the function `agent` as a tool, but prefill parameters that are not the task / user message. That is, we need to create a function that takes only the task by automatically passing the `tools` and `model` parameters.
This is slightly tricky because the `tools` should contain `call_ai`, but this function needs the `tools` parameter.

In [None]:
TOOLS = {
    "python": run_python,
}


def call_ai(task: str) -> str:
    """Call the AI with the specified task."""
    task = "Alice called itself with the following task: \n{task}"
    return agent(task, model=MODEL, **TOOLS)


TOOLS["call"] = call_ai

# Test your agent!
Note that, to stop your agent, you first cancel the cell's execution, and then you might need to press enter in one of the confirmation requests.

Also note that the agent loop is not supposed to always work — but it should work at least sometimes.

In [None]:
agent("Multiply 1289123123 and 128319", **TOOLS)
# = 165418990020237

In [None]:
# TODO: Fix the problem with the missing library and the NameError
agent(
    "Make a plot of the 20 most frequent words in https://en.wikipedia.org/wiki/Asterix_%26_Obelix:_Mission_Cleopatra.",
    **TOOLS
)

In [None]:
agent(
    """
Recursively summarize https://calteches.library.caltech.edu/51/2/CargoCult.htm.
Your plan might look like:
1. Print the number of paragraphs, and their lengths.
2. For paragraphs 1...N:
    1. Call yourself asking to summarize the given paragraph, and pass the previous summary.
""",
    # model=GPT4,
    **TOOLS
)

In [None]:
agent(
    "Fetch the text of https://cozyfractal.com/static/einstein-plugin.html with Python and summarize it.",
    **TOOLS
)

In [None]:
agent(
    "How can I open my car without my keys? I am stranded for 2 hours in the desert ~80km away from Djado. All my stuff is in the car, but there is a toolbox attached to the roof.",
    **TOOLS
)