# Open Notebook & Additional Resources

<a target="_blank" href="https://colab.research.google.com/github/Nicolepcx/ORM-self-improving-ai-agents-course/blob/main/hands_on/session_03_HANDS_ON_rock_paper_lizard_spock.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>
<a target="_blank" href="https://learning.oreilly.com/library/view/ai-agents-the/0642572247775/">
  <img src="https://img.shields.io/badge/AI%20Agents%20Book-Read%20on%20O'Reilly-d40101?style=flat" alt="AI Agents Book ‚Äì Read on O'Reilly"/>
</a>





<font color="red" size="10">
<b>HANDS-ON TIME: 20 mins</b>
</font>

# Timer

In [None]:
SET_TIMER = False  # False, True, or minutes as a number

import requests, types
url = "https://raw.githubusercontent.com/Nicolepcx/ORM-self-improving-ai-agents-course/main/timer.py"

timer = types.ModuleType("timer")
exec(requests.get(url).text, timer.__dict__)

timer.start_exam_timer(enabled=SET_TIMER, minutes=15, warn_minutes=5)

# About this Notebook

## What to do first (read this now)

‚è± **Hands-on time: ~20 minutes**

During the live session, focus only on the **TODO sections**.  
You do **not** need to read everything below right now.

**What you should do during the course:**

1. Scroll to the `TODO` in the `rollout()` function
2. Fill in:
   - the **system instruction**
   - the **metrics dictionary**
3. Run the cell to create valid trajectories
4. Observe how metrics update as the game progresses

That‚Äôs it.

üëâ If you run the full notebook end to end, it will also train and benchmark a model.  
That part is **optional** and meant for **after the session**, not during it.

---

## What this notebook is really about (read later)

This notebook teaches one core skill: **turning agent behavior into measurable signals.**

You are not optimizing for clever text.  
You are learning how to define:

* **Trajectories:** what happened, step by step
* **Metrics:** what you chose to measure about that behavior

Those two structures exist in every RL-for-agents system, regardless of algorithm.

---

## The environment you just used

The game **Rock Paper Scissors Lizard Spock** is used as a controlled sandbox for **tool use**.

The agent must:
* call a tool (`play_move`)
* choose a valid action
* repeat this over multiple rounds

This makes it ideal for learning reward design because everything is observable.

You can check:
* Did the agent call the tool
* Was the move valid
* How often each move was chosen
* How many rounds were played
* Which trajectory won

---

## What the TODOs are actually teaching

Inside `rollout()`, the notebook creates multiple trajectories:

* one from the trainable model
* one from the base model

You define two things.

### 1. The system instruction

This is the **behavioral contract**.

It tells the agent:
* what its role is
* that it must use the tool
* that only valid moves are allowed

Small wording changes here can drastically change learning outcomes.

---

### 2. The metrics dictionary

Metrics are **not** the reward.

They are your debugging interface.

In this notebook they track:
* number of rounds
* frequency of each move
* invalid or missing moves (`nothing`)

This is why `normalize_move` exists:  
every outcome becomes measurable, even failure.

---

## Why this is ‚Äúreward function fundamentals‚Äù

Even though the reward here is simple, the lesson is not.

**A reward function is only as good as the trajectory and metrics behind it.**

If you cannot reliably capture:
* tool calls
* arguments
* environment feedback
* multi-turn structure

then no RL algorithm will save you.

This notebook forces you to model the full interaction loop explicitly.

---

## One takeaway

If you remember one thing:

**Trajectories describe behavior.  
Metrics make behavior debuggable.**

Once those two are correct, you can scale to:
* better rewards
* judge models
* group-based methods
* multi-step tool agents

without changing the core structure.


# Dependencies

In [None]:
# Portions adapted from Unsloth Notebooks
# (https://github.com/unslothai/notebooks)
# and OpenPipe


%%capture
import os

if "COLAB_" not in "".join(os.environ.keys()):
    !uv pip install openpipe-art[backend,langgraph]==0.4.11 langchain-core langgraph langchain_openai tenacity datasets --prerelease allow --no-cache-dir
else:
    try:
        import numpy

        get_numpy = f"numpy=={numpy.__version__}"
    except:
        get_numpy = "numpy"
    try:
        import subprocess

        is_t4 = "Tesla T4" in str(subprocess.check_output(["nvidia-smi"]))
    except:
        is_t4 = False
    get_vllm, get_triton = (
        ("vllm==0.9.2", "triton==3.2.0") if is_t4 else ("vllm", "triton")
    )
    !uv pip install --upgrade \
        openpipe-art[backend,langgraph]==0.4.11 langchain-core langgraph langchain_openai tenacity datasets protobuf==5.29.5 {get_vllm} {get_numpy} --prerelease allow --no-cache-dir
    !uv pip install -qqq {get_triton}

In [None]:
import warnings
warnings.filterwarnings('ignore')  # Suppress all warnings

warnings.warn("This warning will be hidden")
print("Script continues...")

Script continues...


# Set API Keys

In [None]:
import os

from dotenv import load_dotenv

load_dotenv()


OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
WANDB_API_KEY = os.getenv('WANDB_API_KEY')
OPENROUTER_API_KEY = os.getenv('OPENROUTER_API_KEY')


In [None]:
# Clean reinstall of Pillow to resolve 'cannot import name _Ink'
!uv pip uninstall -y pillow pillow-core
!uv pip install --upgrade --force-reinstall "pillow==10.4.0"

import PIL, sys
print("Pillow version:", PIL.__version__)
print(sys.executable)


[1m[31merror:[0m unexpected argument '[33m-y[0m' found

  [32mtip:[0m to pass '[33m-y[0m' as a value, use '[32m-- -y[0m'

[1m[32mUsage:[0m [1m[36muv pip uninstall[0m [36m[OPTIONS][0m [36m<PACKAGE|--requirements <REQUIREMENTS>>[0m

For more information, try '[1m[36m--help[0m'.
[2mUsing Python 3.12.12 environment at: /usr[0m
[2K[2mResolved [1m1 package[0m [2min 43ms[0m[0m
[2K[2mPrepared [1m1 package[0m [2min 61ms[0m[0m
[2mUninstalled [1m1 package[0m [2min 6ms[0m[0m
[2K[2mInstalled [1m1 package[0m [2min 3ms[0m[0m
 [31m-[39m [1mpillow[0m[2m==12.1.0[0m
 [32m+[39m [1mpillow[0m[2m==10.4.0[0m
Pillow version: 11.3.0
/usr/bin/python3


In [None]:
%load_ext autoreload
%autoreload 2

# Imports

In [None]:
import asyncio
import json
from typing import Tuple, Optional

from openai import AsyncOpenAI
from openai.types.chat.chat_completion import ChatCompletion

import art
from art.local import LocalBackend

from IPython.display import display, HTML


In [None]:

load_dotenv()

MODEL_NAME = "001"
BASE_MODEL = "Qwen/Qwen2.5-7B-Instruct"
TRAINING_STEPS = 1_00


# Allowed moves vocabulary
ALLOWED_MOVES = {"rock", "paper", "scissors", "lizard", "spock"}

def normalize_move(move: str) -> str:
    """Normalize and validate move to ensure it's in the allowed vocabulary."""
    if not move:
        return "nothing"

    # Convert to lowercase and strip whitespace
    move = move.lower().strip()

    # Direct match
    if move in ALLOWED_MOVES:
        return move

    # Handle common variations/misspellings
    move_mapping = {
        "papel": "paper",  # Spanish
        "piedra": "rock",  # Spanish
        "tijeras": "scissors",  # Spanish
        "roca": "rock",
        "papier": "paper",  # French
        "pierre": "rock",  # French
        "ciseaux": "scissors",  # French
        "lagarto": "lizard",  # Spanish
        "l√©zard": "lizard",  # French
    }

    if move in move_mapping:
        return move_mapping[move]

    # If no match, return "nothing" to indicate invalid move
    return "nothing"

def get_tool_call_id_and_move(chat_completion: ChatCompletion) -> tuple[str, str]:
    tool_calls = chat_completion.choices[0].message.tool_calls
    if not tool_calls:
        return "n/a", "nothing"
    tool_call = tool_calls[0]
    try:
        raw_move = json.loads(tool_call.function.arguments)["move"]
        # Normalize and validate the move
        normalized_move = normalize_move(raw_move)
        return tool_call.id, normalized_move
    except json.JSONDecodeError:
        return tool_call.id, "nothing"
    except KeyError:
        return tool_call.id, "nothing"
    except Exception:
        # Catch any other exceptions and return "nothing"
        return tool_call.id, "nothing"



# Hands-on

<font color="red" size="10">
<b>TODO:</b>
</font>

In lines:
```                 messages_and_choices=[
                    {
                        "role": "system",
                        "content": (

                        ),
```
and
```
                metrics={

                },
```
Add the correct text/metrics.

<font color="green" size="10">
<b>Help for the first task:</b>
</font>

In [11]:


display(HTML("""
<div style="padding: 20px; border: 1px solid #e0e0e0; border-radius: 10px; background-color: #f9f9f9; text-align: center; font-family: sans-serif;">
    <h3 style="color: #202124; margin-bottom: 15px;">üé• Video Resource: Sheldon explaining the game</h3>
    <p style="color: #5f6368; margin-bottom: 20px;">Direct embedding is not possible. Click below to view the video starting at <b>0:44</b> on YouTube.</p>
    <a href="https://www.youtube.com/watch?v=jnfz_9d9BUA&t=44s" target="_blank"
       style="background-color: #d93025; color: white; padding: 12px 24px; text-decoration: none; border-radius: 5px; font-weight: bold; font-size: 16px; display: inline-block;">
        ‚ñ∂Ô∏è Watch on YouTube
    </a>
</div>
"""))

<iframe width="800" height="450"
src="https://www.youtube.com/embed/jnfz_9d9BUA?start=46">
</iframe>


In [None]:

async def train() -> None:
    # Set up trainable model and backend
    model = art.TrainableModel(
        name=MODEL_NAME,
        project="rock-paper-lizard-spock-tool-use",
        base_model=BASE_MODEL,
    )
    await model.register(LocalBackend())
    client = model.openai_client()

    async def rollout() -> art.Trajectory:
        tools: art.Tools = [
            {
                "type": "function",
                "function": {
                    "name": "play_move",
                    "description": "Play a move in rock-paper-scissors-lizard-spock",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "move": {
                                "type": "string",
                                "enum": ["rock", "paper", "scissors", "lizard", "spock"],
                                "description": "The move to play",
                            }
                        },
                        "required": ["move"],
                    },
                },
            }
        ]

#TODO
        trajectories = [
            art.Trajectory(
                messages_and_choices=[
                    {
                        "role": "system",
                        "content": (

                        ),
                    },
                    {
                        "role": "user",
                        "content": "What will your first move be?",
                    },
                ],
                tools=tools,
                reward=0,
                metrics={

                },
            )
            for _ in range(2)
        ]

        for _ in range(10):
            chat_completions = await asyncio.gather(
                *[
                    client.chat.completions.create(
                        messages=trajectory.messages(),
                        model=model_name,
                        tools=tools,
                        max_completion_tokens=100,
                    )
                    for trajectory, model_name in zip(
                        trajectories, (MODEL_NAME, BASE_MODEL)
                    )
                ]
            )

            for trajectory, chat_completion in zip(trajectories, chat_completions):
                trajectory.messages_and_choices.append(chat_completion.choices[0])

            (id0, move0), (id1, move1) = list(
                map(get_tool_call_id_and_move, chat_completions)
            )

            # Rock-Paper-Scissors-Lizard-Spock rules:
            # Rock beats: scissors, lizard
            # Paper beats: rock, spock
            # Scissors beats: paper, lizard
            # Lizard beats: spock, paper
            # Spock beats: scissors, rock
            beats = {
                "rock": {"scissors", "lizard"},
                "paper": {"rock", "spock"},
                "scissors": {"paper", "lizard"},
                "lizard": {"spock", "paper"},
                "spock": {"scissors", "rock"},
                "nothing": set(),
            }

            # Safely check for wins (handle invalid moves gracefully)
            if move0 in beats and move1 in beats[move0]:
                trajectories[0].reward += 1
            elif move1 in beats and move0 in beats[move1]:
                trajectories[1].reward += 1

            for trajectory in trajectories:
                trajectory.metrics["num_rounds"] += 1

            # Safely update metrics with normalized moves
            # Ensure the move key exists in metrics (should always be valid after normalization)
            if move0 not in trajectories[0].metrics:
                trajectories[0].metrics[move0] = 0
            if move1 not in trajectories[1].metrics:
                trajectories[1].metrics[move1] = 0
            trajectories[0].metrics[move0] += 1
            trajectories[1].metrics[move1] += 1

            if max(t.reward for t in trajectories) > 2:
                break

            trajectories[0].messages_and_choices.extend(
                (
                    {
                        "role": "tool",
                        "tool_call_id": id0,
                        "content": f"The other player played {move1}.",
                    },
                    {
                        "role": "user",
                        "content": "What will your next move be?",
                    },
                )
            )
            trajectories[1].messages_and_choices.extend(
                (
                    {
                        "role": "tool",
                        "tool_call_id": id1,
                        "content": f"The other player played {move0}.",
                    },
                    {
                        "role": "user",
                        "content": "What will your next move be?",
                    },
                )
            )

        # Return one trajectory for training
        return trajectories[0]

    # Main training loop
    start_step = await model.get_step()
    for step in range(start_step, TRAINING_STEPS):
        trajectories = await art.gather_trajectories(
            (rollout() for _ in range(64)), max_exceptions=64
        )
        await model.train(
            [art.TrajectoryGroup(trajectories)],
            config=art.TrainConfig(learning_rate=5e-5),
        )
        print(f"Finished step {step}")

    return model


# In the notebook you just run this cell, no await needed
trained_model = asyncio.run(train())


[34m[1mwandb[0m: Currently logged in as: [33mnicolepcx[0m to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin




INFO 01-05 10:05:59 [__init__.py:235] Automatically detected platform cuda.


gather:   0%|          | 0/64 [00:00<?, ?it/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]



vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Packed 62 trajectories into 25 sequences of length 2048


train:   0%|          | 0/25 [00:00<?, ?it/s]

Finished step 0


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 61 trajectories into 23 sequences of length 2048


train:   0%|          | 0/23 [00:00<?, ?it/s]

Finished step 1


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 63 trajectories into 23 sequences of length 2048


train:   0%|          | 0/23 [00:00<?, ?it/s]

Finished step 2


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 63 trajectories into 22 sequences of length 2048


train:   0%|          | 0/22 [00:00<?, ?it/s]

Finished step 3


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 63 trajectories into 23 sequences of length 2048


train:   0%|          | 0/23 [00:00<?, ?it/s]

Finished step 4


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 63 trajectories into 26 sequences of length 2048


train:   0%|          | 0/26 [00:00<?, ?it/s]

Finished step 5


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 62 trajectories into 24 sequences of length 2048


train:   0%|          | 0/24 [00:00<?, ?it/s]

Finished step 6


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 62 trajectories into 23 sequences of length 2048


train:   0%|          | 0/23 [00:00<?, ?it/s]

Finished step 7


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 62 trajectories into 26 sequences of length 2048


train:   0%|          | 0/26 [00:00<?, ?it/s]

Finished step 8


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 62 trajectories into 28 sequences of length 2048


train:   0%|          | 0/28 [00:00<?, ?it/s]

Finished step 9


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 64 trajectories into 24 sequences of length 2048


train:   0%|          | 0/24 [00:00<?, ?it/s]

Finished step 10


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 63 trajectories into 24 sequences of length 2048


train:   0%|          | 0/24 [00:00<?, ?it/s]

Finished step 11


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 63 trajectories into 22 sequences of length 2048


train:   0%|          | 0/22 [00:00<?, ?it/s]

Finished step 12


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 62 trajectories into 20 sequences of length 2048


train:   0%|          | 0/20 [00:00<?, ?it/s]

Finished step 13


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 62 trajectories into 22 sequences of length 2048


train:   0%|          | 0/22 [00:00<?, ?it/s]

Finished step 14


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 64 trajectories into 25 sequences of length 2048


train:   0%|          | 0/25 [00:00<?, ?it/s]

Finished step 15


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 62 trajectories into 26 sequences of length 2048


train:   0%|          | 0/26 [00:00<?, ?it/s]

Finished step 16


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 64 trajectories into 23 sequences of length 2048


train:   0%|          | 0/23 [00:00<?, ?it/s]

Finished step 17


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 62 trajectories into 23 sequences of length 2048


train:   0%|          | 0/23 [00:00<?, ?it/s]

Finished step 18


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 63 trajectories into 25 sequences of length 2048


train:   0%|          | 0/25 [00:00<?, ?it/s]

Finished step 19


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 63 trajectories into 25 sequences of length 2048


train:   0%|          | 0/25 [00:00<?, ?it/s]

Finished step 20


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 63 trajectories into 21 sequences of length 2048


train:   0%|          | 0/21 [00:00<?, ?it/s]

Finished step 21


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 64 trajectories into 21 sequences of length 2048


train:   0%|          | 0/21 [00:00<?, ?it/s]

Finished step 22


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 63 trajectories into 21 sequences of length 2048


train:   0%|          | 0/21 [00:00<?, ?it/s]

Finished step 23


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 63 trajectories into 20 sequences of length 2048


train:   0%|          | 0/20 [00:00<?, ?it/s]

Finished step 24


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 64 trajectories into 18 sequences of length 2048


train:   0%|          | 0/18 [00:00<?, ?it/s]

Finished step 25


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 64 trajectories into 22 sequences of length 2048


train:   0%|          | 0/22 [00:00<?, ?it/s]

Finished step 26


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 63 trajectories into 23 sequences of length 2048


train:   0%|          | 0/23 [00:00<?, ?it/s]

Finished step 27


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 64 trajectories into 21 sequences of length 2048


train:   0%|          | 0/21 [00:00<?, ?it/s]

Finished step 28


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 64 trajectories into 25 sequences of length 2048


train:   0%|          | 0/25 [00:00<?, ?it/s]

Finished step 29


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 63 trajectories into 21 sequences of length 2048


train:   0%|          | 0/21 [00:00<?, ?it/s]

Finished step 30


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 64 trajectories into 20 sequences of length 2048


train:   0%|          | 0/20 [00:00<?, ?it/s]

Finished step 31


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 64 trajectories into 20 sequences of length 2048


train:   0%|          | 0/20 [00:00<?, ?it/s]

Finished step 32


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 64 trajectories into 20 sequences of length 2048


train:   0%|          | 0/20 [00:00<?, ?it/s]

Finished step 33


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 64 trajectories into 21 sequences of length 2048


train:   0%|          | 0/21 [00:00<?, ?it/s]

Finished step 34


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 63 trajectories into 18 sequences of length 2048


train:   0%|          | 0/18 [00:00<?, ?it/s]

Finished step 35


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 64 trajectories into 21 sequences of length 2048


train:   0%|          | 0/21 [00:00<?, ?it/s]

Finished step 36


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 64 trajectories into 24 sequences of length 2048


train:   0%|          | 0/24 [00:00<?, ?it/s]

Finished step 37


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 64 trajectories into 20 sequences of length 2048


train:   0%|          | 0/20 [00:00<?, ?it/s]

Finished step 38


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 63 trajectories into 18 sequences of length 2048


train:   0%|          | 0/18 [00:00<?, ?it/s]

Finished step 39


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 61 trajectories into 18 sequences of length 2048


train:   0%|          | 0/18 [00:00<?, ?it/s]

Finished step 40


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 64 trajectories into 21 sequences of length 2048


train:   0%|          | 0/21 [00:00<?, ?it/s]

Finished step 41


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 64 trajectories into 20 sequences of length 2048


train:   0%|          | 0/20 [00:00<?, ?it/s]

Finished step 42


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 63 trajectories into 20 sequences of length 2048


train:   0%|          | 0/20 [00:00<?, ?it/s]

Finished step 43


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 64 trajectories into 18 sequences of length 2048


train:   0%|          | 0/18 [00:00<?, ?it/s]

Finished step 44


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 64 trajectories into 20 sequences of length 2048


train:   0%|          | 0/20 [00:00<?, ?it/s]

Finished step 45


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 64 trajectories into 21 sequences of length 2048


train:   0%|          | 0/21 [00:00<?, ?it/s]

Finished step 46


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 63 trajectories into 19 sequences of length 2048


train:   0%|          | 0/19 [00:00<?, ?it/s]

Finished step 47


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 63 trajectories into 20 sequences of length 2048


train:   0%|          | 0/20 [00:00<?, ?it/s]

Finished step 48


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 64 trajectories into 20 sequences of length 2048


train:   0%|          | 0/20 [00:00<?, ?it/s]

Finished step 49


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 64 trajectories into 19 sequences of length 2048


train:   0%|          | 0/19 [00:00<?, ?it/s]

Finished step 50


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 64 trajectories into 21 sequences of length 2048


train:   0%|          | 0/21 [00:00<?, ?it/s]

Finished step 51


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 64 trajectories into 21 sequences of length 2048


train:   0%|          | 0/21 [00:00<?, ?it/s]

Finished step 52


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 63 trajectories into 20 sequences of length 2048


train:   0%|          | 0/20 [00:00<?, ?it/s]

Finished step 53


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 64 trajectories into 20 sequences of length 2048


train:   0%|          | 0/20 [00:00<?, ?it/s]

Finished step 54


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 64 trajectories into 20 sequences of length 2048


train:   0%|          | 0/20 [00:00<?, ?it/s]

Finished step 55


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 64 trajectories into 19 sequences of length 2048


train:   0%|          | 0/19 [00:00<?, ?it/s]

Finished step 56


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 64 trajectories into 20 sequences of length 2048


train:   0%|          | 0/20 [00:00<?, ?it/s]

Finished step 57


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 62 trajectories into 21 sequences of length 2048


train:   0%|          | 0/21 [00:00<?, ?it/s]

Finished step 58


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 62 trajectories into 19 sequences of length 2048


train:   0%|          | 0/19 [00:00<?, ?it/s]

Finished step 59


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 63 trajectories into 19 sequences of length 2048


train:   0%|          | 0/19 [00:00<?, ?it/s]

Finished step 60


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 64 trajectories into 20 sequences of length 2048


train:   0%|          | 0/20 [00:00<?, ?it/s]

Finished step 61


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 64 trajectories into 19 sequences of length 2048


train:   0%|          | 0/19 [00:00<?, ?it/s]

Finished step 62


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 63 trajectories into 21 sequences of length 2048


train:   0%|          | 0/21 [00:00<?, ?it/s]

Finished step 63


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 64 trajectories into 20 sequences of length 2048


train:   0%|          | 0/20 [00:00<?, ?it/s]

Finished step 64


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 64 trajectories into 20 sequences of length 2048


train:   0%|          | 0/20 [00:00<?, ?it/s]

Finished step 65


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 64 trajectories into 19 sequences of length 2048


train:   0%|          | 0/19 [00:00<?, ?it/s]

Finished step 66


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 64 trajectories into 21 sequences of length 2048


train:   0%|          | 0/21 [00:00<?, ?it/s]

Finished step 67


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 63 trajectories into 21 sequences of length 2048


train:   0%|          | 0/21 [00:00<?, ?it/s]

Finished step 68


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 62 trajectories into 23 sequences of length 2048


train:   0%|          | 0/23 [00:00<?, ?it/s]

Finished step 69


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 63 trajectories into 19 sequences of length 2048


train:   0%|          | 0/19 [00:00<?, ?it/s]

Finished step 70


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 62 trajectories into 19 sequences of length 2048


train:   0%|          | 0/19 [00:00<?, ?it/s]

Finished step 71


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 64 trajectories into 20 sequences of length 2048


train:   0%|          | 0/20 [00:00<?, ?it/s]

Finished step 72


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 62 trajectories into 21 sequences of length 2048


train:   0%|          | 0/21 [00:00<?, ?it/s]

Finished step 73


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 64 trajectories into 21 sequences of length 2048


train:   0%|          | 0/21 [00:00<?, ?it/s]

Finished step 74


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 63 trajectories into 19 sequences of length 2048


train:   0%|          | 0/19 [00:00<?, ?it/s]

Finished step 75


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 62 trajectories into 21 sequences of length 2048


train:   0%|          | 0/21 [00:00<?, ?it/s]

Finished step 76


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 62 trajectories into 19 sequences of length 2048


train:   0%|          | 0/19 [00:00<?, ?it/s]

Finished step 77


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 63 trajectories into 19 sequences of length 2048


train:   0%|          | 0/19 [00:00<?, ?it/s]

Finished step 78


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 62 trajectories into 21 sequences of length 2048


train:   0%|          | 0/21 [00:00<?, ?it/s]

Finished step 79


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 61 trajectories into 19 sequences of length 2048


train:   0%|          | 0/19 [00:00<?, ?it/s]

Finished step 80


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 56 trajectories into 20 sequences of length 2048


train:   0%|          | 0/20 [00:00<?, ?it/s]

Finished step 81


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 58 trajectories into 16 sequences of length 2048


train:   0%|          | 0/16 [00:00<?, ?it/s]

Finished step 82


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 61 trajectories into 19 sequences of length 2048


train:   0%|          | 0/19 [00:00<?, ?it/s]

Finished step 83


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 63 trajectories into 19 sequences of length 2048


train:   0%|          | 0/19 [00:00<?, ?it/s]

Finished step 84


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 59 trajectories into 18 sequences of length 2048


train:   0%|          | 0/18 [00:00<?, ?it/s]

Finished step 85


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 61 trajectories into 20 sequences of length 2048


train:   0%|          | 0/20 [00:00<?, ?it/s]

Finished step 86


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 62 trajectories into 19 sequences of length 2048


train:   0%|          | 0/19 [00:00<?, ?it/s]

Finished step 87


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 61 trajectories into 19 sequences of length 2048


train:   0%|          | 0/19 [00:00<?, ?it/s]

Finished step 88


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 61 trajectories into 20 sequences of length 2048


train:   0%|          | 0/20 [00:00<?, ?it/s]

Finished step 89


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 63 trajectories into 19 sequences of length 2048


train:   0%|          | 0/19 [00:00<?, ?it/s]

Finished step 90


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 60 trajectories into 19 sequences of length 2048


train:   0%|          | 0/19 [00:00<?, ?it/s]

Finished step 91


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 63 trajectories into 19 sequences of length 2048


train:   0%|          | 0/19 [00:00<?, ?it/s]

Finished step 92


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 62 trajectories into 21 sequences of length 2048


train:   0%|          | 0/21 [00:00<?, ?it/s]

Finished step 93


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 64 trajectories into 21 sequences of length 2048


train:   0%|          | 0/21 [00:00<?, ?it/s]

Finished step 94


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 63 trajectories into 20 sequences of length 2048


train:   0%|          | 0/20 [00:00<?, ?it/s]

Finished step 95


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 64 trajectories into 21 sequences of length 2048


train:   0%|          | 0/21 [00:00<?, ?it/s]

Finished step 96


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 62 trajectories into 20 sequences of length 2048


train:   0%|          | 0/20 [00:00<?, ?it/s]

Finished step 97




gather:   0%|          | 0/64 [00:00<?, ?it/s]

Packed 64 trajectories into 19 sequences of length 2048


train:   0%|          | 0/19 [00:00<?, ?it/s]

Finished step 98


gather:   0%|          | 0/64 [00:00<?, ?it/s]



Packed 63 trajectories into 18 sequences of length 2048


train:   0%|          | 0/18 [00:00<?, ?it/s]

Finished step 99


In [None]:
# @title Test Your Model!


# Import the normalization function from the training cell
# (This will be available if the training cell was run)
try:
    from __main__ import normalize_move, ALLOWED_MOVES
except ImportError:
    # Fallback if not imported
    ALLOWED_MOVES = {"rock", "paper", "scissors", "lizard", "spock"}

    def normalize_move(move: str) -> str:
        """Normalize and validate move to ensure it's in the allowed vocabulary."""
        if not move:
            return "nothing"

        # Convert to lowercase and strip whitespace
        move = move.lower().strip()

        # Direct match
        if move in ALLOWED_MOVES:
            return move

        # Handle common variations/misspellings
        move_mapping = {
            "papel": "paper",  # Spanish
            "piedra": "rock",  # Spanish
            "tijeras": "scissors",  # Spanish
            "roca": "rock",
            "papier": "paper",  # French
            "pierre": "rock",  # French
            "ciseaux": "scissors",  # French
            "lagarto": "lizard",  # Spanish
            "l√©zard": "lizard",  # French
        }

        if move in move_mapping:
            return move_mapping[move]

        # If no match, return "nothing" to indicate invalid move
        return "nothing"

async def test_model(model, num_tests: int = 5):
    """Test the trained model on rock-paper-scissors-lizard-spock games"""
    client = model.openai_client()

    tools: art.Tools = [
        {
            "type": "function",
            "function": {
                "name": "play_move",
                "description": "Play a move in rock-paper-scissors-lizard-spock",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "move": {
                            "type": "string",
                            "enum": ["rock", "paper", "scissors", "lizard", "spock"],
                            "description": "The move to play",
                        }
                    },
                    "required": ["move"],
                },
            },
        }
    ]

    def get_tool_call_id_and_move(chat_completion: ChatCompletion) -> tuple[str, str]:
        tool_calls = chat_completion.choices[0].message.tool_calls
        if not tool_calls:
            return "n/a", "nothing"
        tool_call = tool_calls[0]
        try:
            raw_move = json.loads(tool_call.function.arguments)["move"]
            # Normalize and validate the move
            normalized_move = normalize_move(raw_move)
            return tool_call.id, normalized_move
        except json.JSONDecodeError:
            return tool_call.id, "nothing"
        except KeyError:
            return tool_call.id, "nothing"
        except Exception:
            # Catch any other exceptions and return "nothing"
            return tool_call.id, "nothing"

    print(f"\nüß™ Testing the trained model on {num_tests} games:\n")
    print("=" * 80)

    for test_num in range(1, num_tests + 1):
        print(f"\nTest {test_num}:")

        trajectory = art.Trajectory(
            messages_and_choices=[
                {
                    "role": "system",
                    "content": (
                        "You are a rock-paper-scissors-lizard-spock playing agent. "
                        "Use the play_move function tool to declare your moves. "
                        "Rules: Scissors cuts Paper, Paper covers Rock, Rock crushes Lizard, "
                        "Lizard poisons Spock, Spock smashes Scissors, Scissors decapitates Lizard, "
                        "Lizard eats Paper, Paper disproves Spock, Spock vaporizes Rock, Rock crushes Scissors."
                    ),
                },
                {
                    "role": "user",
                    "content": "What will your first move be?",
                },
            ],
            tools=tools,
            reward=0,
            metrics={
                "num_rounds": 0,
                "rock": 0,
                "paper": 0,
                "scissors": 0,
                "lizard": 0,
                "spock": 0,
                "nothing": 0,
            },
        )

        moves_played = []
        for round_num in range(5):
            chat_completion = await client.chat.completions.create(
                messages=trajectory.messages(),
                model=model.name,
                tools=tools,
                max_completion_tokens=100,
            )

            trajectory.messages_and_choices.append(chat_completion.choices[0])

            tool_call_id, move = get_tool_call_id_and_move(chat_completion)
            moves_played.append(move)
            # Safely update metrics with normalized moves
            # Ensure the move key exists in metrics (should always be valid after normalization)
            if move not in trajectory.metrics:
                trajectory.metrics[move] = 0
            trajectory.metrics[move] += 1
            trajectory.metrics["num_rounds"] += 1

            print(f"  Round {round_num + 1}: {move}")

            if round_num < 4:  # Don't add user message after last round
                trajectory.messages_and_choices.append(
                    {
                        "role": "tool",
                        "tool_call_id": tool_call_id,
                        "content": f"Your move was {move}. What will your next move be?",
                    }
                )
                trajectory.messages_and_choices.append(
                    {
                        "role": "user",
                        "content": "What will your next move be?",
                    }
                )

        # Print summary
        print(f"  Summary: {trajectory.metrics['rock']} rock, {trajectory.metrics['paper']} paper, "
              f"{trajectory.metrics['scissors']} scissors, {trajectory.metrics['lizard']} lizard, "
              f"{trajectory.metrics['spock']} spock")
        print(f"  Total moves: {trajectory.metrics['num_rounds']}")
        print("-" * 80)

    print("\nüéâ Testing completed!")
    print(f"\nYour model '{model.name}' has been tested on {num_tests} games.")
    print("\nTo use this model in production:")
    print("1. The model checkpoint is saved in ./.art/")
    print("2. You can load it using the vLLM library")
    print("3. Or continue training with more examples by adjusting TRAINING_STEPS")

# Run the test
if 'trained_model' in globals():
    asyncio.run(test_model(trained_model, num_tests=5))
else:
    print("‚ö†Ô∏è  Please run the training cell first to create 'trained_model'")



üß™ Testing the trained model on 5 games:


Test 1:
  Round 1: rock
  Round 2: paper
  Round 3: scissors
  Round 4: lizard
  Round 5: scissors
  Summary: 1 rock, 1 paper, 2 scissors, 1 lizard, 0 spock
  Total moves: 5
--------------------------------------------------------------------------------

Test 2:
  Round 1: rock
  Round 2: lizard
  Round 3: paper
  Round 4: lizard
  Round 5: paper
  Summary: 1 rock, 2 paper, 0 scissors, 2 lizard, 0 spock
  Total moves: 5
--------------------------------------------------------------------------------

Test 3:
  Round 1: rock
  Round 2: lizard
  Round 3: lizard
  Round 4: paper
  Round 5: lizard
  Summary: 1 rock, 1 paper, 0 scissors, 3 lizard, 0 spock
  Total moves: 5
--------------------------------------------------------------------------------

Test 4:
  Round 1: rock
  Round 2: scissors
  Round 3: lizard
  Round 4: nothing
  Round 5: nothing
  Summary: 1 rock, 0 paper, 1 scissors, 1 lizard, 0 spock
  Total moves: 5
--------------------

In [None]:
# @title Benchmark Against GPT Models


# --- Required: OPENROUTER_API_KEY ---
OPENROUTER_API_KEY = os.environ.get("OPENROUTER_API_KEY", "")

# Import the normalization function from the training cell (if available)
try:
    from __main__ import normalize_move, ALLOWED_MOVES
except ImportError:
    ALLOWED_MOVES = {"rock", "paper", "scissors", "lizard", "spock"}

    def normalize_move(move: str) -> str:
        if not move:
            return "nothing"
        move = move.lower().strip()
        if move in ALLOWED_MOVES:
            return move
        move_mapping = {
            "papel": "paper",
            "piedra": "rock",
            "tijeras": "scissors",
            "roca": "rock",
            "papier": "paper",
            "pierre": "rock",
            "ciseaux": "scissors",
            "lagarto": "lizard",
            "l√©zard": "lizard",
        }
        return move_mapping.get(move, "nothing")


TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "play_move",
            "description": "Play a move in rock-paper-scissors-lizard-spock",
            "parameters": {
                "type": "object",
                "properties": {
                    "move": {
                        "type": "string",
                        "enum": ["rock", "paper", "scissors", "lizard", "spock"],
                        "description": "The move to play",
                    }
                },
                "required": ["move"],
            },
        },
    }
]

# Rock-Paper-Scissors-Lizard-Spock rules:
# Rock beats: scissors, lizard
# Paper beats: rock, spock
# Scissors beats: paper, lizard
# Lizard beats: spock, paper
# Spock beats: scissors, rock
BEATS = {
    "rock": {"scissors", "lizard"},
    "paper": {"rock", "spock"},
    "scissors": {"paper", "lizard"},
    "lizard": {"spock", "paper"},
    "spock": {"scissors", "rock"},
    "nothing": set(),
}


def extract_tool_call_id_and_move(chat_completion: ChatCompletion) -> Tuple[Optional[str], str]:
    """
    Returns (tool_call_id, normalized_move).
    If no tool call exists or parsing fails, returns (None, "nothing").
    """
    msg = chat_completion.choices[0].message
    tool_calls = getattr(msg, "tool_calls", None) or []
    if not tool_calls:
        return None, "nothing"

    tool_call = tool_calls[0]
    tool_call_id = getattr(tool_call, "id", None)

    try:
        args = json.loads(tool_call.function.arguments or "{}")
        raw_move = args.get("move", "")
        return tool_call_id, normalize_move(raw_move)
    except Exception:
        return tool_call_id, "nothing"


def rps_round_winner(a: str, b: str) -> Tuple[int, int]:
    """Returns (a_win, b_win) as 0/1."""
    if a in BEATS and b in BEATS[a]:
        return 1, 0
    if b in BEATS and a in BEATS[b]:
        return 0, 1
    return 0, 0


async def get_move(
    client: AsyncOpenAI,
    model_name: str,
    messages: list,
) -> Tuple[ChatCompletion, Optional[str], str]:
    """
    Calls the model, parses tool call move.
    If the model emits a tool_call, caller MUST append a tool response message.
    """
    completion = await client.chat.completions.create(
        model=model_name,
        messages=messages,
        tools=TOOLS,
        # OpenRouter + OpenAI compatibility: use max_tokens
        max_tokens=50,
    )
    tool_call_id, move = extract_tool_call_id_and_move(completion)
    return completion, tool_call_id, move


async def benchmark_model(model, opponent_model: str, num_games: int = 10, rounds_per_game: int = 5):
    """Benchmark the trained model against a GPT model via OpenRouter."""
    if not OPENROUTER_API_KEY:
        print("‚ö†Ô∏è  OPENROUTER_API_KEY not found. Please set it in your environment variables.")
        return

    trained_client = model.openai_client()
    opponent_client = AsyncOpenAI(
        api_key=OPENROUTER_API_KEY,
        base_url="https://openrouter.ai/api/v1",
    )

    trained_wins = 0
    opponent_wins = 0
    ties = 0

    print(f"\nüèÜ Benchmarking '{model.name}' vs '{opponent_model}'\n")
    print("=" * 80)

    base_system = {
        "role": "system",
        "content": (
            "You are a rock-paper-scissors-lizard-spock playing agent. "
            "You MUST use the play_move tool to declare your move each round. "
            "Only choose from: rock, paper, scissors, lizard, spock. "
            "Rules: Scissors cuts Paper, Paper covers Rock, Rock crushes Lizard, "
            "Lizard poisons Spock, Spock smashes Scissors, Scissors decapitates Lizard, "
            "Lizard eats Paper, Paper disproves Spock, Spock vaporizes Rock, Rock crushes Scissors."
        ),
    }

    for game_num in range(1, num_games + 1):
        trained_msgs = [base_system, {"role": "user", "content": "Round 1: play your move now."}]
        opponent_msgs = [base_system, {"role": "user", "content": "Round 1: play your move now."}]

        game_trained_wins = 0
        game_opponent_wins = 0

        for round_num in range(1, rounds_per_game + 1):
            # Get both moves concurrently
            trained_completion, opponent_completion = await asyncio.gather(
                get_move(trained_client, model.name, trained_msgs),
                get_move(opponent_client, opponent_model, opponent_msgs),
            )

            t_comp, t_tool_id, t_move = trained_completion
            o_comp, o_tool_id, o_move = opponent_completion

            # Append assistant messages (so conversation stays consistent)
            trained_msgs.append(t_comp.choices[0].message.model_dump())
            opponent_msgs.append(o_comp.choices[0].message.model_dump())

            # If a tool call was made, we MUST provide a tool response with matching tool_call_id
            if t_tool_id:
                trained_msgs.append(
                    {
                        "role": "tool",
                        "tool_call_id": t_tool_id,
                        "content": f"Recorded move: {t_move}",
                    }
                )
            else:
                # Nudge the model if it failed tool calling
                trained_msgs.append(
                    {"role": "user", "content": "You did not call the tool. Call play_move with rock/paper/scissors/lizard/spock."}
                )

            if o_tool_id:
                opponent_msgs.append(
                    {
                        "role": "tool",
                        "tool_call_id": o_tool_id,
                        "content": f"Recorded move: {o_move}",
                    }
                )
            else:
                opponent_msgs.append(
                    {"role": "user", "content": "You did not call the tool. Call play_move with rock/paper/scissors/lizard/spock."}
                )

            # Score round
            tw, ow = rps_round_winner(t_move, o_move)
            game_trained_wins += tw
            game_opponent_wins += ow

            # Provide next-round context
            if round_num < rounds_per_game:
                summary_for_trained = (
                    f"Round {round_num} result: you played {t_move}, opponent played {o_move}. "
                    f"Score so far (you-opponent): {game_trained_wins}-{game_opponent_wins}. "
                    f"Round {round_num+1}: play your move now."
                )
                summary_for_opponent = (
                    f"Round {round_num} result: you played {o_move}, opponent played {t_move}. "
                    f"Score so far (you-opponent): {game_opponent_wins}-{game_trained_wins}. "
                    f"Round {round_num+1}: play your move now."
                )
                trained_msgs.append({"role": "user", "content": summary_for_trained})
                opponent_msgs.append({"role": "user", "content": summary_for_opponent})

        # Determine game winner
        if game_trained_wins > game_opponent_wins:
            trained_wins += 1
            result = "‚úÖ WIN"
        elif game_opponent_wins > game_trained_wins:
            opponent_wins += 1
            result = "‚ùå LOSS"
        else:
            ties += 1
            result = "ü§ù TIE"

        print(f"Game {game_num}: {result} ({game_trained_wins}-{game_opponent_wins})")

    print("\n" + "=" * 80)
    print("\nüìä Benchmark Results:\n")
    print(f"  Wins:   {trained_wins}/{num_games} ({100*trained_wins/num_games:.1f}%)")
    print(f"  Losses: {opponent_wins}/{num_games} ({100*opponent_wins/num_games:.1f}%)")
    print(f"  Ties:   {ties}/{num_games} ({100*ties/num_games:.1f}%)")
    print(f"\n  Overall: {trained_wins}W-{opponent_wins}L-{ties}T")
    print("\n" + "=" * 80)


async def run_benchmarks(model):
    if not OPENROUTER_API_KEY:
        print("‚ö†Ô∏è  OPENROUTER_API_KEY not found. Please set it in your environment variables.")
        return

    print("\n" + "=" * 80)
    print("üöÄ Starting Benchmark Suite")
    print("=" * 80)

    await benchmark_model(model, "openai/gpt-4o", num_games=10, rounds_per_game=5)
    print("\n")
    await benchmark_model(model, "openai/gpt-4.1", num_games=10, rounds_per_game=5)

    print("\n‚úÖ Benchmark suite completed!")


# Run benchmarks
if "trained_model" in globals():
    asyncio.run(run_benchmarks(trained_model))
else:
    print("‚ö†Ô∏è  Please run the training cell first to create 'trained_model'")



üöÄ Starting Benchmark Suite

üèÜ Benchmarking '001' vs 'openai/gpt-4o'

Game 1: ‚ùå LOSS (0-4)
Game 2: ‚úÖ WIN (4-1)
Game 3: ‚úÖ WIN (4-1)
Game 4: ‚úÖ WIN (3-2)
Game 5: ‚úÖ WIN (2-1)
Game 6: ‚úÖ WIN (2-1)
Game 7: ü§ù TIE (2-2)
Game 8: ‚ùå LOSS (1-3)
Game 9: ‚úÖ WIN (3-2)
Game 10: ‚ùå LOSS (2-3)


üìä Benchmark Results:

  Wins:   6/10 (60.0%)
  Losses: 3/10 (30.0%)
  Ties:   1/10 (10.0%)

  Overall: 6W-3L-1T




üèÜ Benchmarking '001' vs 'openai/gpt-4.1'

Game 1: ‚úÖ WIN (3-2)
Game 2: ‚ùå LOSS (1-2)
Game 3: ‚úÖ WIN (3-1)
Game 4: ‚úÖ WIN (2-1)
Game 5: ‚úÖ WIN (3-1)
Game 6: ü§ù TIE (2-2)
Game 7: ‚úÖ WIN (3-2)
Game 8: ‚úÖ WIN (3-2)
Game 9: ‚úÖ WIN (2-1)
Game 10: ü§ù TIE (2-2)


üìä Benchmark Results:

  Wins:   7/10 (70.0%)
  Losses: 1/10 (10.0%)
  Ties:   2/10 (20.0%)

  Overall: 7W-1L-2T


‚úÖ Benchmark suite completed!


In [None]:
# @title Benchmark Against GPT Models
# --- Required: OPENROUTER_API_KEY ---
OPENROUTER_API_KEY = os.environ.get("OPENROUTER_API_KEY", "")

# Import the normalization function from the training cell (if available)
try:
    from __main__ import normalize_move, ALLOWED_MOVES
except ImportError:
    ALLOWED_MOVES = {"rock", "paper", "scissors", "lizard", "spock"}

    def normalize_move(move: str) -> str:
        if not move:
            return "nothing"
        move = move.lower().strip()
        if move in ALLOWED_MOVES:
            return move
        move_mapping = {
            "papel": "paper",
            "piedra": "rock",
            "tijeras": "scissors",
            "roca": "rock",
            "papier": "paper",
            "pierre": "rock",
            "ciseaux": "scissors",
            "lagarto": "lizard",
            "l√©zard": "lizard",
        }
        return move_mapping.get(move, "nothing")


TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "play_move",
            "description": "Play a move in rock-paper-scissors-lizard-spock",
            "parameters": {
                "type": "object",
                "properties": {
                    "move": {
                        "type": "string",
                        "enum": ["rock", "paper", "scissors", "lizard", "spock"],
                        "description": "The move to play",
                    }
                },
                "required": ["move"],
            },
        },
    }
]

# Rock-Paper-Scissors-Lizard-Spock rules:
# Rock beats: scissors, lizard
# Paper beats: rock, spock
# Scissors beats: paper, lizard
# Lizard beats: spock, paper
# Spock beats: scissors, rock
BEATS = {
    "rock": {"scissors", "lizard"},
    "paper": {"rock", "spock"},
    "scissors": {"paper", "lizard"},
    "lizard": {"spock", "paper"},
    "spock": {"scissors", "rock"},
    "nothing": set(),
}


def extract_tool_call_id_and_move(chat_completion: ChatCompletion) -> Tuple[Optional[str], str]:
    """
    Returns (tool_call_id, normalized_move).
    If no tool call exists or parsing fails, returns (None, "nothing").
    """
    msg = chat_completion.choices[0].message
    tool_calls = getattr(msg, "tool_calls", None) or []
    if not tool_calls:
        return None, "nothing"

    tool_call = tool_calls[0]
    tool_call_id = getattr(tool_call, "id", None)

    try:
        args = json.loads(tool_call.function.arguments or "{}")
        raw_move = args.get("move", "")
        return tool_call_id, normalize_move(raw_move)
    except Exception:
        return tool_call_id, "nothing"


def rps_round_winner(a: str, b: str) -> Tuple[int, int]:
    """Returns (a_win, b_win) as 0/1."""
    if a in BEATS and b in BEATS[a]:
        return 1, 0
    if b in BEATS and a in BEATS[b]:
        return 0, 1
    return 0, 0


async def get_move(
    client: AsyncOpenAI,
    model_name: str,
    messages: list,
) -> Tuple[ChatCompletion, Optional[str], str]:
    """
    Calls the model, parses tool call move.
    If the model emits a tool_call, caller MUST append a tool response message.
    """
    completion = await client.chat.completions.create(
        model=model_name,
        messages=messages,
        tools=TOOLS,
        # OpenRouter + OpenAI compatibility: use max_tokens
        max_tokens=50,
    )
    tool_call_id, move = extract_tool_call_id_and_move(completion)
    return completion, tool_call_id, move


async def benchmark_model(model, opponent_model: str, num_games: int = 10, rounds_per_game: int = 5):
    """Benchmark the trained model against a GPT model via OpenRouter."""
    if not OPENROUTER_API_KEY:
        print("‚ö†Ô∏è  OPENROUTER_API_KEY not found. Please set it in your environment variables.")
        return

    trained_client = model.openai_client()
    opponent_client = AsyncOpenAI(
        api_key=OPENROUTER_API_KEY,
        base_url="https://openrouter.ai/api/v1",
    )

    trained_wins = 0
    opponent_wins = 0
    ties = 0

    print(f"\nüèÜ Benchmarking '{model.name}' vs '{opponent_model}'\n")
    print("=" * 80)

    base_system = {
        "role": "system",
        "content": (
            "You are a rock-paper-scissors-lizard-spock playing agent. "
            "You MUST use the play_move tool to declare your move each round. "
            "Only choose from: rock, paper, scissors, lizard, spock. "
            "Rules: Scissors cuts Paper, Paper covers Rock, Rock crushes Lizard, "
            "Lizard poisons Spock, Spock smashes Scissors, Scissors decapitates Lizard, "
            "Lizard eats Paper, Paper disproves Spock, Spock vaporizes Rock, Rock crushes Scissors."
        ),
    }

    for game_num in range(1, num_games + 1):
        trained_msgs = [base_system, {"role": "user", "content": "Round 1: play your move now."}]
        opponent_msgs = [base_system, {"role": "user", "content": "Round 1: play your move now."}]

        game_trained_wins = 0
        game_opponent_wins = 0

        for round_num in range(1, rounds_per_game + 1):
            # Get both moves concurrently
            trained_completion, opponent_completion = await asyncio.gather(
                get_move(trained_client, model.name, trained_msgs),
                get_move(opponent_client, opponent_model, opponent_msgs),
            )

            t_comp, t_tool_id, t_move = trained_completion
            o_comp, o_tool_id, o_move = opponent_completion

            # Append assistant messages (so conversation stays consistent)
            trained_msgs.append(t_comp.choices[0].message.model_dump())
            opponent_msgs.append(o_comp.choices[0].message.model_dump())

            # If a tool call was made, we MUST provide a tool response with matching tool_call_id
            if t_tool_id:
                trained_msgs.append(
                    {
                        "role": "tool",
                        "tool_call_id": t_tool_id,
                        "content": f"Recorded move: {t_move}",
                    }
                )
            else:
                # Nudge the model if it failed tool calling
                trained_msgs.append(
                    {"role": "user", "content": "You did not call the tool. Call play_move with rock/paper/scissors/lizard/spock."}
                )

            if o_tool_id:
                opponent_msgs.append(
                    {
                        "role": "tool",
                        "tool_call_id": o_tool_id,
                        "content": f"Recorded move: {o_move}",
                    }
                )
            else:
                opponent_msgs.append(
                    {"role": "user", "content": "You did not call the tool. Call play_move with rock/paper/scissors/lizard/spock."}
                )

            # Score round
            tw, ow = rps_round_winner(t_move, o_move)
            game_trained_wins += tw
            game_opponent_wins += ow

            # Provide next-round context
            if round_num < rounds_per_game:
                summary_for_trained = (
                    f"Round {round_num} result: you played {t_move}, opponent played {o_move}. "
                    f"Score so far (you-opponent): {game_trained_wins}-{game_opponent_wins}. "
                    f"Round {round_num+1}: play your move now."
                )
                summary_for_opponent = (
                    f"Round {round_num} result: you played {o_move}, opponent played {t_move}. "
                    f"Score so far (you-opponent): {game_opponent_wins}-{game_trained_wins}. "
                    f"Round {round_num+1}: play your move now."
                )
                trained_msgs.append({"role": "user", "content": summary_for_trained})
                opponent_msgs.append({"role": "user", "content": summary_for_opponent})

        # Determine game winner
        if game_trained_wins > game_opponent_wins:
            trained_wins += 1
            result = "‚úÖ WIN"
        elif game_opponent_wins > game_trained_wins:
            opponent_wins += 1
            result = "‚ùå LOSS"
        else:
            ties += 1
            result = "ü§ù TIE"

        print(f"Game {game_num}: {result} ({game_trained_wins}-{game_opponent_wins})")

    print("\n" + "=" * 80)
    print("\nüìä Benchmark Results:\n")
    print(f"  Wins:   {trained_wins}/{num_games} ({100*trained_wins/num_games:.1f}%)")
    print(f"  Losses: {opponent_wins}/{num_games} ({100*opponent_wins/num_games:.1f}%)")
    print(f"  Ties:   {ties}/{num_games} ({100*ties/num_games:.1f}%)")
    print(f"\n  Overall: {trained_wins}W-{opponent_wins}L-{ties}T")
    print("\n" + "=" * 80)


async def run_benchmarks(model):
    if not OPENROUTER_API_KEY:
        print("‚ö†Ô∏è  OPENROUTER_API_KEY not found. Please set it in your environment variables.")
        return

    print("\n" + "=" * 80)
    print("üöÄ Starting Benchmark Suite")
    print("=" * 80)

    await benchmark_model(model, "openai/gpt-5.2", num_games=10, rounds_per_game=5)
    print("\n")
    await benchmark_model(model, "openai/gpt-5.2-chat", num_games=10, rounds_per_game=5)

    print("\n‚úÖ Benchmark suite completed!")


# Run benchmarks
if "trained_model" in globals():
    asyncio.run(run_benchmarks(trained_model))
else:
    print("‚ö†Ô∏è  Please run the training cell first to create 'trained_model'")



üöÄ Starting Benchmark Suite

üèÜ Benchmarking '001' vs 'openai/gpt-5.2'

Game 1: ü§ù TIE (1-1)
Game 2: ‚úÖ WIN (2-1)
Game 3: ‚úÖ WIN (3-1)
Game 4: ‚úÖ WIN (2-1)
Game 5: ü§ù TIE (1-1)
Game 6: ‚ùå LOSS (2-3)
Game 7: ‚ùå LOSS (1-2)
Game 8: ‚úÖ WIN (2-1)
Game 9: ‚ùå LOSS (1-2)
Game 10: ‚úÖ WIN (2-1)


üìä Benchmark Results:

  Wins:   5/10 (50.0%)
  Losses: 3/10 (30.0%)
  Ties:   2/10 (20.0%)

  Overall: 5W-3L-2T




üèÜ Benchmarking '001' vs 'openai/gpt-5.2-chat'

Game 1: ‚ùå LOSS (1-4)
Game 2: ü§ù TIE (2-2)
Game 3: ‚úÖ WIN (2-1)
Game 4: ü§ù TIE (1-1)
Game 5: ü§ù TIE (0-0)
Game 6: ‚úÖ WIN (3-1)
Game 7: ‚ùå LOSS (1-2)
Game 8: ‚ùå LOSS (1-4)
Game 9: ‚úÖ WIN (3-1)
Game 10: ü§ù TIE (2-2)


üìä Benchmark Results:

  Wins:   3/10 (30.0%)
  Losses: 3/10 (30.0%)
  Ties:   4/10 (40.0%)

  Overall: 3W-3L-4T


‚úÖ Benchmark suite completed!
