# Open Notebook & Additional Resources

<a target="_blank" href="https://colab.research.google.com/github/Nicolepcx/ORM-self-improving-ai-agents-course/blob/main/hands_on/session_02_HANDS_ON_reward_function.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>
<a target="_blank" href="https://learning.oreilly.com/library/view/ai-agents-the/0642572247775/">
  <img src="https://img.shields.io/badge/AI%20Agents%20Book-Read%20on%20O'Reilly-d40101?style=flat" alt="AI Agents Book ‚Äì Read on O'Reilly"/>
</a>





<font color="red" size="10">
<b>HANDS-ON TIME: 15 mins</b>
</font>

# Timer

In [4]:
SET_TIMER = False  # False, True, or minutes as a number

import requests, types
url = "https://raw.githubusercontent.com/Nicolepcx/ORM-self-improving-ai-agents-course/main/timer.py"

timer = types.ModuleType("timer")
exec(requests.get(url).text, timer.__dict__)

timer.start_exam_timer(enabled=SET_TIMER, minutes=15, warn_minutes=5)

# About this Notebook

## Train a Small Model with RL-Style Feedback (ART + RULER)

Welcome to this hands-on lab. You will train a small open model to perform a **custom task** reinforcement learning.

This notebook is essentially: **RL intuition applied to instruction following.**

<br>

This lab demonstrates the core building blocks behind **self-improving agents**.


## 1. What you are actually training

You are not training a model from scratch. You start from a strong **base model**:

* `BASE_MODEL = "Qwen/Qwen3-4B-Instruct-2507"` or similar.

Then you train a **LoRA adapter** on top, using [ART‚Äôs training loop](https://art.openpipe.ai/getting-started/about). That is why the model is lightweight to train and fast to iterate on.

### The Mapping

| Component | This Notebook | RL Framing |
| --- | --- | --- |
| **State** | The prompt content, including the system prompt and user input | Current observation |
| **Action** | The model‚Äôs next token at each decoding step | Action sequence |
| **Policy** | Transformer weights (base) + LoRA adapter (trainable) | Stochastic policy |
| **Trajectory** | Messages plus the assistant completion | Episode transcript |
| **Reward** | Judge score in `[0, 1]` | Scalar return |

---

## 2. The loop: generate, judge, learn

At each training step, the notebook does three things:

### A. Generate rollouts

For each training input, the model produces multiple candidate outputs:

* `rollouts_per_group = 2`

This is the core idea behind relative methods like GRPO or RULER-style learning: you do not need one perfect label, you need **comparisons** and **ranking signals**.

### B. Judge rollouts (RULER-style)

A separate judge model scores each candidate output:

* `RULER_MODEL = "openrouter/deepseek/deepseek-r1-0528"`

The judge is instructed to return strict JSON and provide a per-candidate score:

* `1.0` means the output matches the task format and intent
* `0.0` means it violates the format or ignores the task

This notebook uses `robust_score_group(...)` which is resilient to:
* code fences
* extra text around JSON
* partial or malformed responses

### C. Update the policy

ART then trains the LoRA adapter so that high-reward rollouts become more likely.

This is the same conceptual move as policy optimization in RL:
* good behavior becomes more probable
* bad behavior is discouraged

---

## 3. Task descriptions are your ‚Äúreward specification‚Äù

The most important control knob in this lab is the task description:

```python
TASK_DESCRIPTION = GRAMMARLY_TASK_DESCRIPTION
````


Note: Parts of the Notebook are adapted from the [ART examples](https://art.openpipe.ai/getting-started/notebooks)


This notebook is for the *Hands-on* for Session 2 for Develop Self-Improving AI Agents with Reinforcement Learning Live Event with O'Reilly Media by
[Nicole Koenigstein](https://www.linkedin.com/in/nicole-koenigstein/).

<font color="red" size="5">
<b>Attention for the Notebook to work </b>
</font>
<br>

you need an `OPENROUTER_API_KEY`! [Get your key here](https://openrouter.ai/)   

In [5]:
# @title Installation
# Portions adapted from Unsloth Notebooks (https://github.com/unslothai/notebooks)
# Copyright (c) Unsloth contributors.
# License: GNU LGPL v3.0.
# Modifications by OpenPipe

%%capture
import os

if "COLAB_" not in "".join(os.environ.keys()):
    !uv pip install openpipe-art[backend]==0.4.11 tenacity "mcp>=1.11.0" "gql<4" aiohttp --prerelease allow --no-cache-dir
else:
    try:
        import numpy

        get_numpy = f"numpy=={numpy.__version__}"
    except:
        get_numpy = "numpy"
    try:
        import subprocess

        is_t4 = "Tesla T4" in str(subprocess.check_output(["nvidia-smi"]))
    except:
        is_t4 = False
    get_vllm, get_triton = (
        ("vllm==0.9.2", "triton==3.2.0") if is_t4 else ("vllm", "triton")
    )
    !uv pip install --upgrade \
        openpipe-art[backend]==0.4.11 tenacity pillow==11.3.0 protobuf==5.29.5 {get_vllm} {get_numpy} --prerelease allow --no-cache-dir
    !uv pip install -qqq {get_triton}

# Set API Keys

In [6]:
import os
from dotenv import load_dotenv

load_dotenv()

WANDB_API_KEY = os.getenv("WANDB_API_KEY")
OPENROUTER_API_KEY = os.getenv("OPENROUTER_API_KEY")

# OpenRouter normal path
os.environ["OPENROUTER_API_KEY"] = OPENROUTER_API_KEY

# Anthropic compatibility path redirected to OpenRouter
os.environ["ANTHROPIC_BASE_URL"] = "https://openrouter.ai/api"
os.environ["ANTHROPIC_AUTH_TOKEN"] = OPENROUTER_API_KEY
os.environ["ANTHROPIC_API_KEY"] = ""  # must be explicitly empty


In [7]:
if not OPENROUTER_API_KEY:
    raise ValueError("OPENROUTER_API_KEY is required for data generation and RULER evaluation.")

# Optional W&B
if WANDB_API_KEY:
    os.environ["WANDB_API_KEY"] = WANDB_API_KEY
else:
    print("WANDB_API_KEY is not set. We'll skip logging metrics to Weights & Biases.")


In [8]:
#@title Clean reinstall of Pillow to resolve 'cannot import name _Ink'
!uv pip uninstall -y pillow pillow-core
!uv pip install --upgrade --force-reinstall "pillow==10.4.0"

import PIL, sys
print("Pillow version:", PIL.__version__)
print(sys.executable)


[1m[31merror:[0m unexpected argument '[33m-y[0m' found

  [32mtip:[0m to pass '[33m-y[0m' as a value, use '[32m-- -y[0m'

[1m[32mUsage:[0m [1m[36muv pip uninstall[0m [36m[OPTIONS][0m [36m<PACKAGE|--requirements <REQUIREMENTS>>[0m

For more information, try '[1m[36m--help[0m'.
[2mUsing Python 3.12.12 environment at: /usr[0m
[2K[2mResolved [1m1 package[0m [2min 43ms[0m[0m
[2K[2mPrepared [1m1 package[0m [2min 69ms[0m[0m
[2mUninstalled [1m1 package[0m [2min 4ms[0m[0m
[2K[2mInstalled [1m1 package[0m [2min 3ms[0m[0m
 [31m-[39m [1mpillow[0m[2m==11.3.0[0m
 [32m+[39m [1mpillow[0m[2m==10.4.0[0m
Pillow version: 11.3.0
/usr/bin/python3


# Settings

In [9]:
# Model configuration
MODEL_NAME = "jira-model-001"  # Name for your trained model
PROJECT_NAME = "auto-rl"  # Project name for tracking


# Training configuration
TRAINING_CONFIG = {
    "num_training_inputs": 25,  # Number of training inputs to generate
    "groups_per_step": 1,  # Inputs to process per training step
    "num_epochs": 3,  # Number of times through all data
    "rollouts_per_group": 2,  # Different responses per input (for RULER comparison)
    "learning_rate": 1e-5,  # Learning rate
    "max_training_steps": None,  # Maximum training steps (set to None for no limit)
}

NUM_TEST_INPUTS = 5  # Number of test inputs to generate
RULER_MODEL = "openrouter/deepseek/deepseek-r1-0528"  # Model for RULER evaluation
SYSTEM_PROMPT_GENERATION_MODEL = "openrouter/moonshotai/kimi-k2"
INPUT_GENERATION_MODEL = "openrouter/moonshotai/kimi-k2"


# GPU configuration (keep these as-is unless you have a reason to change them, since the setup already leverages almost all memor yfor a A100 with 40GB)
MAX_SEQ_LENGTH = 2048  # Maximum sequence length
GPU_MEMORY_UTILIZATION = 0.6  # GPU memory usage (0.0-1.0)


# Hands-on

<font color="red" size="10">
<b>TODO: </b>
</font>
<br>
<font color="black" size="5">
<b>Teach your model another skill, set <code>TASK_DESCRIPTION</code> to one of the descriptions from Sample Taks or create an own description.</b>
</font>



# Sample Tasks

In [10]:
GRAMMARLY_TASK_DESCRIPTION = """
Read the user's text and check if it has any grammar or spelling errors. If it does, then fix them by wrapping the
erroneous text in <original></original> tags and the corrected text in <corrected></corrected> tags.

For example, if the user's text is "I are going to the store to buy sum eggs", the output should be:

I <original>are</original><corrected>am</corrected> going to the store to buy <original>sum</original><corrected>some</corrected> eggs.
"""

PM_TO_CODER_TASK_DESCRIPTION = """
Convert the user's project manager style text into clear, actionable coding instructions.

Output format must be STRICT and follow exactly this schema, in this order:

TITLE: <one line>
GOAL: <one line>
CONTEXT: <one paragraph, optional if missing>
REQUIREMENTS:
- <bullet>
- <bullet>
ACCEPTANCE_CRITERIA:
- <bullet>
- <bullet>
EDGE_CASES:
- <bullet>
- <bullet>
QUESTIONS:
- <bullet questions that must be clarified, if any>

Rules:
- Preserve intent. Remove fluff, buzzwords, and vague phrases.
- If something is ambiguous, do not guess. Put it into QUESTIONS.
- If the user mentions a system, UI, API, database, auth, or performance, reflect that in REQUIREMENTS or EDGE_CASES.
- Keep it concise and engineering ready.
"""

EMOJIFY_TASK_DESCRIPTION = """
Convert any incoming story provided by the user into a corresponding sequence of emojis.
For example, if the user says, "I went to the store to buy some eggs but forgot my wallet",
you should convert it into something like:"üö∂‚Äç‚ôÇÔ∏è‚û°Ô∏èüè¨üõíü•ö‚Ä¶üò±üí≥‚ùå".
"""

CHANGELOG_TASK_DESCRIPTION = """
Convert the user's description into:
- a concise Git commit message (imperative mood)
- a short changelog entry for end users

Format:
COMMIT:
CHANGELOG:

Rules:
- Commit is max 72 characters.
- Changelog is 1 to 3 sentences, non technical.
"""

LINKEDIN_REWRITE_DIFF_TASK_DESCRIPTION = """
Rewrite the user's LinkedIn post to be more engaging.

Output BOTH versions:

ORIGINAL:
<original text>

REWRITTEN:
<rewritten text>

Rules:
- Max 120 words
- Clear point of view
- Neutral, confident tone
- Assume an informed audience
- Avoid buzzwords like "disrupt", "leverage", "game changer"
"""

CORPORATE_JARGON_TASK_DESCRIPTION = """
Convert any incoming text into a corresponding sequence of corporate jargon.
For example, if the user says, "I went to the store to buy some eggs but forgot my wallet",
you should convert it into something like:
"During a routine procurement initiative, I proceeded to the designated retail partner to acquire
essential inventory units (hen‚Äëderived ova). However, execution was impeded when I identified
a critical absence of my primary fiscal instrument, necessitating immediate reassessment of the
transaction workflow and postponement of asset acquisition.".
"""

In [11]:
# Describe your custom task (be specific!)
# CUSTOM_TASK_DESCRIPTION = """ """

TASK_DESCRIPTION = GRAMMARLY_TASK_DESCRIPTION

# Choose the base model to train
BASE_MODEL = "Qwen/Qwen3-4B-Instruct-2507"  # Options: "Qwen/Qwen2.5-3B-Instruct", "Qwen/Qwen2.5-7B-Instruct", etc.

# Imports

In [12]:
import os
import re
import json
import random
from typing import List, Optional

import torch
import weave
from dotenv import load_dotenv
from litellm import acompletion
from pydantic import BaseModel, Field

import art
from art.local import LocalBackend
from art.utils import iterate_dataset
from art.utils.litellm import convert_litellm_choice_to_openai
import torch
from unsloth import FastLanguageModel

from google.colab import userdata
from huggingface_hub import login, whoami, create_repo


  IMAGEMAGICK_BINARY = r"C:\Program Files\ImageMagick-6.8.8-Q16\magick.exe"
  lines_video = [l for l in lines if ' Video: ' in l and re.search('\d+x\d+', l)]
  rotation_lines = [l for l in lines if 'rotate          :' in l and re.search('\d+$', l)]
  match = re.search('\d+$', rotation_line)
  if event.key is 'enter':





# Robust JSON extraction

In [13]:
def extract_json_object(text: str) -> str:
    if text is None:
        raise ValueError("No text to parse")

    t = text.strip()

    # Strip fenced code blocks if present
    if t.startswith("```"):
        t = re.sub(r"^```[a-zA-Z0-9_-]*\s*", "", t)  # opening fence
        t = re.sub(r"\s*```$", "", t).strip()        # closing fence

    # Find first '{' or '['
    m = re.search(r"[\{\[]", t)
    if not m:
        raise ValueError(f"Could not find JSON start in: {t[:200]!r}")
    start = m.start()

    # Scan to matching closing brace/bracket while respecting strings
    stack: list[str] = []
    in_str = False
    esc = False

    for i in range(start, len(t)):
        ch = t[i]

        if in_str:
            if esc:
                esc = False
            elif ch == "\\":
                esc = True
            elif ch == '"':
                in_str = False
            continue

        if ch == '"':
            in_str = True
            continue

        if ch in "{[":
            stack.append(ch)
        elif ch in "}]":
            if not stack:
                continue
            opening = stack.pop()
            if (opening == "{" and ch != "}") or (opening == "[" and ch != "]"):
                raise ValueError("Mismatched JSON brackets")
            if not stack:
                return t[start : i + 1].strip()

    raise ValueError("Could not find end of JSON object")


# Training input generation (robust)

In [14]:
class TrainingInput(BaseModel):
    input: str = Field(description="The input text for the task")


class TrainingDataset(BaseModel):
    inputs: List[TrainingInput] = Field(description="List of training inputs")


async def generate_training_inputs(task_description: str, num_examples: int = 50) -> List[str]:
    """
    Generate diverse training inputs for the given task.
    Robust to models returning fewer items, wrong shape, code fences, or extra text.
    """
    system_prompt = f"""
You generate training inputs.

Task:
{task_description}

Return STRICT JSON only. No prose. No markdown. No code fences.

Schema:
{{
  "inputs": [
    {{"input": "string"}},
    ...
  ]
}}

Rules:
- Return exactly {num_examples} items in "inputs".
- Each "input" must be realistic and different.
- No duplicates.
""".strip()

    inputs: list[str] = []
    seen: set[str] = set()

    attempt = 0
    while attempt < 8 and len(inputs) < num_examples:
        attempt += 1
        remaining = num_examples - len(inputs)

        user_prompt = f"""
Generate {remaining} more items to complete the dataset.

Already have {len(inputs)} items.
Do not repeat any of these existing inputs:
{json.dumps(inputs, ensure_ascii=False, indent=2)}

Return STRICT JSON only with the same schema.
""".strip()

        print(f"Generating training inputs, attempt {attempt}, remaining {remaining}...")

        raw = ""
        try:
            response = await acompletion(
                model=INPUT_GENERATION_MODEL,
                messages=[
                    {"role": "system", "content": system_prompt},
                    {"role": "user", "content": user_prompt},
                ],
                temperature=0.7,
            )

            raw = response.choices[0].message.content or ""
            if not raw.strip():
                raise ValueError("Empty model content.")

            clean = extract_json_object(raw)
            dataset = TrainingDataset.model_validate_json(clean)

            for ex in dataset.inputs:
                s = (ex.input or "").strip()
                if not s or s in seen:
                    continue
                seen.add(s)
                inputs.append(s)
                if len(inputs) >= num_examples:
                    break

        except Exception as e:
            print(f"Attempt {attempt} failed: {type(e).__name__}: {e}")
            print(f"Raw preview: {raw[:400]!r}")

    if len(inputs) < num_examples:
        raise ValueError(f"Failed to generate {num_examples} training inputs. Got {len(inputs)}.")

    return inputs

# Robust judge scoring (RULER-like) that never assumes perfect JSON

In [15]:
class JudgeItem(BaseModel):
    idx: int
    score: float
    rationale: str


class JudgeResponse(BaseModel):
    items: List[JudgeItem]


async def robust_score_group(
    group: art.TrajectoryGroup,
    judge_model: str,
    task_description: str,
    temperature: float = 0.0,
) -> art.TrajectoryGroup:
    """
    Robust scoring that assigns reward in [0, 1] per trajectory.
    Works even if the judge wraps JSON in markdown fences.
    """
    trajectories = list(group.trajectories)

    candidates = []
    for i, t in enumerate(trajectories):
        msgs = t.messages()
        assistant = msgs[-1]["content"] if msgs else ""
        candidates.append({"idx": i, "assistant_output": assistant})

    system = (
        "You are a strict evaluator.\n"
        "Return STRICT JSON only. No prose. No markdown. No code fences.\n"
        "Score each candidate output for how well it satisfies the task.\n"
        "Scores must be floats in [0, 1].\n"
    )

    user = (
        f"TASK:\n{task_description}\n\n"
        "CANDIDATES:\n"
        f"{json.dumps(candidates, ensure_ascii=False, indent=2)}\n\n"
        "Scoring rubric:\n"
        "- 1.0: Exactly matches required format and content is coherent and extracted from input\n"
        "- 0.5: Mostly matches format but missing details or minor format violations\n"
        "- 0.0: Ignores format, adds extra commentary, or does not perform task\n\n"
        'Return JSON with schema: {"items":[{"idx":0,"score":0.0,"rationale":"..."}]}\n'
    )

    resp = await acompletion(
        model=judge_model,
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": user},
        ],
        temperature=temperature,
    )

    raw = resp.choices[0].message.content or ""
    clean = extract_json_object(raw)
    judged = JudgeResponse.model_validate_json(clean)

    score_by_idx = {}
    for it in judged.items:
        # clamp
        score_by_idx[it.idx] = max(0.0, min(1.0, float(it.score)))

    for i, t in enumerate(trajectories):
        t.reward = score_by_idx.get(i, 0.0)

    return art.TrajectoryGroup(trajectories=trajectories)

# Generate dataset

In [16]:
training_inputs = await generate_training_inputs(
    TASK_DESCRIPTION, num_examples=TRAINING_CONFIG["num_training_inputs"]
)
print(f"\nGenerated {len(training_inputs)} training inputs!")
print("\nFirst 5 examples:")
for i, input_text in enumerate(training_inputs[:5]):
    print(f"\nExample {i + 1}: {input_text}")

Generating training inputs, attempt 1, remaining 25...

Generated 25 training inputs!

First 5 examples:

Example 1: She don't have no idea what <original>thier</original><corrected>their</corrected> talking about.

Example 2: We was planning to visit <original>paris</original><corrected>Paris</corrected> next spring.

Example 3: He <original>loose</original><corrected>lose</corrected> his keys everytime he goes out.

Example 4: The <original>childrens</original><corrected>children's</corrected> toys were scattered across the <original>flor</original><corrected>floor</corrected>.

Example 5: I have went to that restaurant <original>alot</original><corrected>a lot</corrected> of times.


In [17]:
# @title ‚úÖ Pause before training (read + confirm)

# This cell is here on purpose:
# It prevents "Run all" from immediately creating a backend, registering a model,
# and starting a potentially expensive workflow.

print("üõë STOP: Before you continue, make sure you want to train the model!\n")

# Set this to True ONLY when you're ready to proceed.
I_UNDERSTAND_AND_WANT_TO_CONTINUE = False

if not I_UNDERSTAND_AND_WANT_TO_CONTINUE:
    raise RuntimeError(
        "Paused intentionally. Set I_UNDERSTAND_AND_WANT_TO_CONTINUE = True, then re-run this cell."
    )

print("\n‚úÖ Continuing to model creation and backend registration...")


üõë STOP: Before you continue, make sure you want to train the model!



RuntimeError: Paused intentionally. Set I_UNDERSTAND_AND_WANT_TO_CONTINUE = True, then re-run this cell.

# Create model + backend

In [15]:
random.seed(42)

model = art.TrainableModel(
    name=MODEL_NAME,
    project=PROJECT_NAME,
    base_model=BASE_MODEL,
)

# GPU friendly overrides
if torch.cuda.get_device_properties(0).major < 8:
    model._internal_config = art.dev.InternalModelConfig(
        init_args=art.dev.InitArgs(max_seq_length=MAX_SEQ_LENGTH),
        engine_args=art.dev.EngineArgs(
            enforce_eager=True,
            gpu_memory_utilization=GPU_MEMORY_UTILIZATION,
        ),
    )

backend = (
    LocalBackend(in_process=True, path="./.art")
    if torch.cuda.get_device_properties(0).major < 8
    else LocalBackend()
)

await model.register(backend)

print("Model created!")
print("Base model:", BASE_MODEL)
print("Model name:", MODEL_NAME)
print("Project name:", PROJECT_NAME)


  | |_| | '_ \/ _` / _` |  _/ -_)

[34m[1mwandb[0m: Currently logged in as: [33mnicolepcx[0m to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


  * regex for parameter names, must start with `re:`, e.g. `re:language\.layers\..+\.q_proj.weight`.



INFO 01-10 08:31:22 [__init__.py:235] Automatically detected platform cuda.
Model created!
Base model: Qwen/Qwen3-4B-Instruct-2507
Model name: jira-model-001
Project name: auto-rl


# Weave init (optional)

In [16]:
if os.getenv("WANDB_API_KEY", ""):
    weave.init(PROJECT_NAME, settings={"print_call_link": False})

# System prompt generation

In [17]:
async def generate_system_prompt(task_description: str) -> str:
    messages = [
        {
            "role": "system",
            "content": (
                "Generate a clear, concise system prompt for a model that will perform the following task. "
                "The prompt should be direct and instructional."
            ),
        },
        {
            "role": "user",
            "content": f"Task: {task_description}\n\nGenerate a system prompt for this task.",
        },
    ]

    response = await acompletion(
        model=SYSTEM_PROMPT_GENERATION_MODEL,
        messages=messages,
        temperature=0.3,
    )

    return (response.choices[0].message.content or "").strip()


SYSTEM_PROMPT = await generate_system_prompt(TASK_DESCRIPTION)
print(f"Generated system prompt:\n\n{SYSTEM_PROMPT}")


Generated system prompt:

System Prompt:
You are a precise grammar and spelling checker.  
1. Read the user‚Äôs entire text exactly as given.  
2. Identify every grammar or spelling mistake.  
3. For each mistake, wrap the exact erroneous word(s) in <original></original> and the corrected word(s) in <corrected></corrected>.  
4. Output the full text with these tags in place; do not add explanations or extra commentary.


# Rollout function

In [18]:
class TaskInput(BaseModel):
    step: int
    input_text: str


@weave.op
async def rollout(model: art.Model, task_input: TaskInput) -> art.Trajectory:
    traj = art.Trajectory(
        reward=0.0,
        messages_and_choices=[],
        metadata={"step": task_input.step, "input": task_input.input_text},
    )

    traj.messages_and_choices = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": task_input.input_text},
    ]

    litellm_model_name = f"hosted_vllm/{model.name}" if model.trainable else model.name

    response = await acompletion(
        model=litellm_model_name,
        base_url=model.inference_base_url,
        api_key=model.inference_api_key,
        temperature=0.7,
        messages=traj.messages(),
        caching=False,
    )

    traj.messages_and_choices.append(convert_litellm_choice_to_openai(response.choices[0]))
    return traj


print("\nRollout function defined!")



Rollout function defined!


# Judge demo

In [19]:
test_input = "After the last deploy the login page keeps refreshing forever on Chrome 120 on macOS. Works on Firefox. Started yesterday."
base_messages = [
    {"role": "system", "content": TASK_DESCRIPTION},
    {"role": "user", "content": test_input},
]

good_trajectory = art.Trajectory(
    messages_and_choices=[
        *base_messages,
        {
            "role": "assistant",
            "content": (
                "SUMMARY: Login page stuck in redirect loop on Chrome 120 after latest deploy\n"
                "PRIORITY: High\n"
                "STEPS TO REPRODUCE:\n"
                "1. Open Chrome 120 on macOS\n"
                "2. Navigate to /login\n"
                "3. Enter valid credentials and submit\n"
                "4. Observe repeated redirects/refresh and no successful login\n"
                "EXPECTED RESULT: User is authenticated and redirected to the dashboard\n"
                "ACTUAL RESULT: Page redirects/refreshes indefinitely and user cannot log in\n"
                "ENVIRONMENT: Chrome 120, macOS, regression after yesterday's deploy"
            ),
        },
    ],
    reward=0,
)

mediocre_trajectory = art.Trajectory(
    messages_and_choices=[
        *base_messages,
        {
            "role": "assistant",
            "content": (
                "SUMMARY: Login not working on Chrome\n"
                "PRIORITY: Medium\n"
                "STEPS TO REPRODUCE:\n"
                "1. Open Chrome\n"
                "2. Try to login\n"
                "EXPECTED RESULT: Login works\n"
                "ACTUAL RESULT: Login fails\n"
                "ENVIRONMENT: Chrome"
            ),
        },
    ],
    reward=0,
)

bad_trajectory = art.Trajectory(
    messages_and_choices=[*base_messages, {"role": "assistant", "content": "Try clearing cache."}],
    reward=0,
)

sample_group = art.TrajectoryGroup(trajectories=[good_trajectory, mediocre_trajectory, bad_trajectory])

judged_demo = await robust_score_group(
    sample_group,
    judge_model=RULER_MODEL,
    task_description=TASK_DESCRIPTION,
)

sorted_demo = sorted(judged_demo.trajectories, key=lambda t: t.reward, reverse=True)
for rank, traj in enumerate(sorted_demo, 1):
    msg = traj.messages()[-1]["content"]
    print(f"\nDemo Rank {rank}: Score {traj.reward:.3f}")
    print(f"  Response: {msg[:220]}{'...' if len(msg) > 220 else ''}")




Demo Rank 1: Score 0.000
  Response: SUMMARY: Login page stuck in redirect loop on Chrome 120 after latest deploy
PRIORITY: High
STEPS TO REPRODUCE:
1. Open Chrome 120 on macOS
2. Navigate to /login
3. Enter valid credentials and submit
4. Observe repeated ...

Demo Rank 2: Score 0.000
  Response: SUMMARY: Login not working on Chrome
PRIORITY: Medium
STEPS TO REPRODUCE:
1. Open Chrome
2. Try to login
EXPECTED RESULT: Login works
ACTUAL RESULT: Login fails
ENVIRONMENT: Chrome

Demo Rank 3: Score 0.000
  Response: Try clearing cache.


# Training loop

In [20]:
training_task_inputs = [TaskInput(step=0, input_text=inp) for inp in training_inputs]

training_iterator = iterate_dataset(
    training_task_inputs,
    groups_per_step=TRAINING_CONFIG["groups_per_step"],
    num_epochs=TRAINING_CONFIG["num_epochs"],
    initial_step=await model.get_step(),
)

print(f"\nStarting training with {len(training_task_inputs)} inputs...")
print(f"Training for {TRAINING_CONFIG['num_epochs']} epoch(s)")
print(f"Groups per step: {TRAINING_CONFIG['groups_per_step']}")
print(f"Rollouts per group: {TRAINING_CONFIG['rollouts_per_group']}")

for batch in training_iterator:
    print(f"\nTraining step {batch.step}, epoch {batch.epoch}, epoch step {batch.epoch_step}")
    print(f"Batch contains {len(batch.items)} inputs")

    groups = []
    for task_input in batch.items:
        task_input.step = batch.step
        groups.append(
            art.TrajectoryGroup(
                (rollout(model, task_input) for _ in range(TRAINING_CONFIG["rollouts_per_group"]))
            )
        )

    finished_groups = await art.gather_trajectory_groups(
        groups,
        pbar_desc="Generating responses",
        max_exceptions=TRAINING_CONFIG["rollouts_per_group"] * len(batch.items),
    )

    judged_groups = []
    for group in finished_groups:
        judged = None
        for _ in range(10):
            try:
                judged = await robust_score_group(
                    group,
                    judge_model=RULER_MODEL,
                    task_description=TASK_DESCRIPTION,
                )
                break
            except Exception as e:
                print(f"Error scoring group: {e}")

        if judged is None:
            raise RuntimeError("Scoring failed after retries; cannot continue training.")

        judged_groups.append(judged)

    await model.delete_checkpoints()
    await model.train(
        judged_groups,
        config=art.TrainConfig(learning_rate=TRAINING_CONFIG["learning_rate"]),
        _config={"logprob_calculation_chunk_size": 8},
    )

    print(f"Completed training step {batch.step}")

    if TRAINING_CONFIG["max_training_steps"] and batch.step >= TRAINING_CONFIG["max_training_steps"]:
        print(f"Reached maximum training steps ({TRAINING_CONFIG['max_training_steps']})")
        break

print("\n‚úÖ Training completed!")



Starting training with 25 inputs...
Training for 3 epoch(s)
Groups per step: 1
Rollouts per group: 2


Iterating dataset:   0%|          | 0/75 [00:00<?, ?batch/s]


Training step 0, epoch 0, epoch step 0
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]

"./.art/auto-rl/models/jira-model-001/history.jsonl" not found


Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).



tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 0 to 1 (no training occurred)
Completed training step 0

Training step 1, epoch 0, epoch step 1
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 1 to 2 (no training occurred)
Completed training step 1

Training step 2, epoch 0, epoch step 2
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 2 to 3 (no training occurred)
Completed training step 2

Training step 3, epoch 0, epoch step 3
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 3 to 4 (no training occurred)
Completed training step 3

Training step 4, epoch 0, epoch step 4
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 4 to 5 (no training occurred)
Completed training step 4

Training step 5, epoch 0, epoch step 5
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 5 to 6 (no training occurred)
Completed training step 5

Training step 6, epoch 0, epoch step 6
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 6 to 7 (no training occurred)
Completed training step 6

Training step 7, epoch 0, epoch step 7
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 7 to 8 (no training occurred)
Completed training step 7

Training step 8, epoch 0, epoch step 8
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 8 to 9 (no training occurred)
Completed training step 8

Training step 9, epoch 0, epoch step 9
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]

No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 9 to 10 (no training occurred)
Completed training step 9

Training step 10, epoch 0, epoch step 10
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 10 to 11 (no training occurred)
Completed training step 10

Training step 11, epoch 0, epoch step 11
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 11 to 12 (no training occurred)
Completed training step 11

Training step 12, epoch 0, epoch step 12
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 12 to 13 (no training occurred)
Completed training step 12

Training step 13, epoch 0, epoch step 13
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 13 to 14 (no training occurred)
Completed training step 13

Training step 14, epoch 0, epoch step 14
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 14 to 15 (no training occurred)
Completed training step 14

Training step 15, epoch 0, epoch step 15
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 15 to 16 (no training occurred)
Completed training step 15

Training step 16, epoch 0, epoch step 16
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 16 to 17 (no training occurred)
Completed training step 16

Training step 17, epoch 0, epoch step 17
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 17 to 18 (no training occurred)
Completed training step 17

Training step 18, epoch 0, epoch step 18
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 18 to 19 (no training occurred)
Completed training step 18

Training step 19, epoch 0, epoch step 19
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 19 to 20 (no training occurred)
Completed training step 19

Training step 20, epoch 0, epoch step 20
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 20 to 21 (no training occurred)
Completed training step 20

Training step 21, epoch 0, epoch step 21
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 21 to 22 (no training occurred)
Completed training step 21

Training step 22, epoch 0, epoch step 22
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 22 to 23 (no training occurred)
Completed training step 22

Training step 23, epoch 0, epoch step 23
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 23 to 24 (no training occurred)
Completed training step 23

Training step 24, epoch 0, epoch step 24
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 24 to 25 (no training occurred)
Completed training step 24

Training step 25, epoch 1, epoch step 0
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 25 to 26 (no training occurred)
Completed training step 25

Training step 26, epoch 1, epoch step 1
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 26 to 27 (no training occurred)
Completed training step 26

Training step 27, epoch 1, epoch step 2
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 27 to 28 (no training occurred)
Completed training step 27

Training step 28, epoch 1, epoch step 3
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 28 to 29 (no training occurred)
Completed training step 28

Training step 29, epoch 1, epoch step 4
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 29 to 30 (no training occurred)
Completed training step 29

Training step 30, epoch 1, epoch step 5
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 30 to 31 (no training occurred)
Completed training step 30

Training step 31, epoch 1, epoch step 6
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 31 to 32 (no training occurred)
Completed training step 31

Training step 32, epoch 1, epoch step 7
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 32 to 33 (no training occurred)
Completed training step 32

Training step 33, epoch 1, epoch step 8
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 33 to 34 (no training occurred)
Completed training step 33

Training step 34, epoch 1, epoch step 9
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 34 to 35 (no training occurred)
Completed training step 34

Training step 35, epoch 1, epoch step 10
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 35 to 36 (no training occurred)
Completed training step 35

Training step 36, epoch 1, epoch step 11
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 36 to 37 (no training occurred)
Completed training step 36

Training step 37, epoch 1, epoch step 12
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 37 to 38 (no training occurred)
Completed training step 37

Training step 38, epoch 1, epoch step 13
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 38 to 39 (no training occurred)
Completed training step 38

Training step 39, epoch 1, epoch step 14
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 39 to 40 (no training occurred)
Completed training step 39

Training step 40, epoch 1, epoch step 15
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 40 to 41 (no training occurred)
Completed training step 40

Training step 41, epoch 1, epoch step 16
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 41 to 42 (no training occurred)
Completed training step 41

Training step 42, epoch 1, epoch step 17
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 42 to 43 (no training occurred)
Completed training step 42

Training step 43, epoch 1, epoch step 18
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 43 to 44 (no training occurred)
Completed training step 43

Training step 44, epoch 1, epoch step 19
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 44 to 45 (no training occurred)
Completed training step 44

Training step 45, epoch 1, epoch step 20
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 45 to 46 (no training occurred)
Completed training step 45

Training step 46, epoch 1, epoch step 21
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 46 to 47 (no training occurred)
Completed training step 46

Training step 47, epoch 1, epoch step 22
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 47 to 48 (no training occurred)
Completed training step 47

Training step 48, epoch 1, epoch step 23
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 48 to 49 (no training occurred)
Completed training step 48

Training step 49, epoch 1, epoch step 24
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 49 to 50 (no training occurred)
Completed training step 49

Training step 50, epoch 2, epoch step 0
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 50 to 51 (no training occurred)
Completed training step 50

Training step 51, epoch 2, epoch step 1
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 51 to 52 (no training occurred)
Completed training step 51

Training step 52, epoch 2, epoch step 2
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 52 to 53 (no training occurred)
Completed training step 52

Training step 53, epoch 2, epoch step 3
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 53 to 54 (no training occurred)
Completed training step 53

Training step 54, epoch 2, epoch step 4
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 54 to 55 (no training occurred)
Completed training step 54

Training step 55, epoch 2, epoch step 5
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 55 to 56 (no training occurred)
Completed training step 55

Training step 56, epoch 2, epoch step 6
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 56 to 57 (no training occurred)
Completed training step 56

Training step 57, epoch 2, epoch step 7
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 57 to 58 (no training occurred)
Completed training step 57

Training step 58, epoch 2, epoch step 8
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 58 to 59 (no training occurred)
Completed training step 58

Training step 59, epoch 2, epoch step 9
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 59 to 60 (no training occurred)
Completed training step 59

Training step 60, epoch 2, epoch step 10
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 60 to 61 (no training occurred)
Completed training step 60

Training step 61, epoch 2, epoch step 11
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 61 to 62 (no training occurred)
Completed training step 61

Training step 62, epoch 2, epoch step 12
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 62 to 63 (no training occurred)
Completed training step 62

Training step 63, epoch 2, epoch step 13
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 63 to 64 (no training occurred)
Completed training step 63

Training step 64, epoch 2, epoch step 14
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 64 to 65 (no training occurred)
Completed training step 64

Training step 65, epoch 2, epoch step 15
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 65 to 66 (no training occurred)
Completed training step 65

Training step 66, epoch 2, epoch step 16
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 66 to 67 (no training occurred)
Completed training step 66

Training step 67, epoch 2, epoch step 17
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 67 to 68 (no training occurred)
Completed training step 67

Training step 68, epoch 2, epoch step 18
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 68 to 69 (no training occurred)
Completed training step 68

Training step 69, epoch 2, epoch step 19
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 69 to 70 (no training occurred)
Completed training step 69

Training step 70, epoch 2, epoch step 20
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 70 to 71 (no training occurred)
Completed training step 70

Training step 71, epoch 2, epoch step 21
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 71 to 72 (no training occurred)
Completed training step 71

Training step 72, epoch 2, epoch step 22
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 72 to 73 (no training occurred)
Completed training step 72

Training step 73, epoch 2, epoch step 23
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 73 to 74 (no training occurred)
Completed training step 73

Training step 74, epoch 2, epoch step 24
Batch contains 1 inputs


Generating responses:   0%|          | 0/2 [00:00<?, ?it/s]



No "val/reward" metric found in history
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.
Advanced step from 74 to 75 (no training occurred)
Completed training step 74

‚úÖ Training completed!


# Test Your Model

In [21]:


# Generate test inputs
print("Generating test inputs...")
test_inputs = await generate_training_inputs(
    TASK_DESCRIPTION, num_examples=NUM_TEST_INPUTS
)

print(f"\nüß™ Testing the trained model on {len(test_inputs)} new inputs:\n")
print("=" * 80)

for i, test_input in enumerate(test_inputs):
    print(f"\nTest {i + 1}:")
    print(f"Input: {test_input}")

    # Run the model
    test_task_input = TaskInput(step=999, input_text=test_input)
    result_trajectory = await rollout(model, test_task_input)

    # Extract the model's response
    messages = result_trajectory.messages()
    model_response = messages[-1]["content"] if messages else "No response"

    print(f"Model output: {model_response}")
    print("-" * 80)

print("\nüéâ Testing completed!")
print(f"\nYour model '{MODEL_NAME}' has been trained to: {TASK_DESCRIPTION}")
print("\nTo use this model in production:")
print("1. The model checkpoint is saved in ./.art/")
print("2. You can load it using the vLLM library")
print(
    "3. Or continue training with more examples by adjusting the configuration at the top"
)

Generating test inputs...
Generating training inputs, attempt 1, remaining 5...

üß™ Testing the trained model on 5 new inputs:


Test 1:
Input: The cat sleep on the couch all days without moveing.
Model output: The <original>cat</original> <original>sleep</original> on the couch all <original>days</original> without <original>moveing</original>.
--------------------------------------------------------------------------------

Test 2:
Input: She don't likes when peoples interrupt her during work.
Model output: <original>don't</original><corrected>doesn't</corrected> likes when <original>peoples</original><corrected>people's</corrected> interrupt her during work.
--------------------------------------------------------------------------------

Test 3:
Input: We was suppose to meet at the libary yesterday but I forgot my keys.
Model output: We <was> suppose to meet at the <libary> yesterday but I forgot my keys.
---------------------------------------------------------------------------

# Upload to Hugging Face ü§ó

In [22]:
# @title

# Adapted from Unsloth Notebooks (https://github.com/unslothai/notebooks), licensed under GNU LGPL v3.0.
# See THIRD-PARTY-NOTICES and licenses/LGPL-3.0.txt for details.

lora_model_path = (
    f".art/{model.project}/models/{model.name}/checkpoints/{await model.get_step():04d}"
)

peft_model, peft_tokenizer = FastLanguageModel.from_pretrained(
    model_name=lora_model_path,
    max_seq_length=16384,
    dtype=torch.bfloat16,
    load_in_4bit=True,
)



Please restructure your imports with 'import unsloth' at the top of your file.
  from unsloth import FastLanguageModel



ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.
ü¶• Unsloth Zoo will now patch everything to make training faster!


Flax classes are deprecated and will be removed in Diffusers v1.0.0. We recommend migrating to PyTorch classes or pinning your version of Diffusers.
Flax classes are deprecated and will be removed in Diffusers v1.0.0. We recommend migrating to PyTorch classes or pinning your version of Diffusers.
Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


==((====))==  Unsloth 2025.8.6: Fast Qwen3 patching. Transformers: 4.53.2. vLLM: 0.10.0.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.1+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.3.1
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.31. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Unsloth 2025.8.6 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.


In [None]:
HF_ACCOUNT = "your_HF_account"
HF_TOKEN = userdata.get("HF_TOKEN")
assert HF_TOKEN and HF_TOKEN.startswith("hf_"), "HF_TOKEN missing from Colab Secrets"

login(token=HF_TOKEN, add_to_git_credential=False)
print(whoami(token=HF_TOKEN))

safe_name = model.name.replace("/", "-")
repo_id = f"{HF_ACCOUNT}/{safe_name}"
create_repo(repo_id, token=HF_TOKEN, exist_ok=True)

peft_model.push_to_hub_merged(repo_id, peft_tokenizer, token=HF_TOKEN)


# Next Steps

Congrats! üéâüöÄ You've trained your own custom model using just:

Here is a rephrased version that explicitly highlights **varying and refining task descriptions** as a first-class improvement lever, while keeping the tone instructional and clean.

* A task description
* Example inputs (no outputs required)
* RULER's automatic evaluation

To further improve performance, you can iterate along several dimensions:

1. **Multiple task descriptions**
   Introduce alternative or complementary task descriptions that emphasize different aspects of ‚Äúgood‚Äù behavior. This helps RULER generalize across interpretations of the task rather than overfitting to a single phrasing.

2. **More diverse inputs**
   Generate a broader and more varied set of input examples to cover edge cases and realistic usage patterns.

3. **Longer training**
   Increase the number of training steps to allow the policy to stabilize and converge.

4. **More comparisons**
   Increase `rollouts_per_group` to give RULER richer comparative signals when ranking candidate behaviors.

5. **Task refinement**
   Make task descriptions more precise and explicit about priorities, constraints, and trade-offs.

Remember: RULER learns what ‚Äúgood‚Äù means entirely from your task descriptions and relative comparisons‚Äîno labeled outputs are required.

For more info see the [ART documentation](https://art.openpipe.ai).
