To train this agent, click *Runtime* and press *Run all*. Make sure you've enabled a free Tesla T4 GPU!
<div class="align-center">
<a href="https://github.com/openpipe/art"><img src="https://github.com/openpipe/art/raw/main/assets/ART_pill.png" height="50"></a>
<a href="https://discord.com/invite/dnseNZuQ"><img src="https://github.com/openpipe/art/raw/main/assets/Discord_pill.png" height="50"></a>
<a href="https://openpipe.ai/blog/art-trainer-a-new-rl-trainer-for-agents"><img src="https://github.com/openpipe/art/raw/main/assets/Launch_pill.png" height="50"></a>

Questions? Join the Discord and ask away! For feature requests or to leave a star, visit our [Github](https://github.com/openpipe/art).
</div>

<a href="https://art.openpipe.ai/"><img src="https://github.com/openpipe/art/raw/main/assets/Header_separator.png" height="5"></a>

This notebook shows how to train a Qwen 2.5 7B model to play 2048. It will demonstrate how to set up a multi-turn agent, how to train it, and how to evaluate it.

Completions will be logged to OpenPipe, and metrics will be logged to Weights & Biases.

 
You will learn how to construct an [agentic environment](#Environment), how to define a [rollout](#Rollout), and how to run a [training loop](#Loop).

In [1]:
!pip install "numpy<2.0.0"

[0m

### WARNING:
If you are running in Google Colab and installing numpy does not say "Requirement already satisfied: numpy<2.0.0" then click "Runtime" and "Restart Session."

In [2]:
# make sure we're using numpy 1.*.*
import numpy as np

if (np.__version__).startswith("1."):
    print("Numpy version is 1.*.*, you're good to go!")
else:
    raise ValueError("Please restart your runtime using the above instructions!")

Numpy version is 1.*.*, you're good to go!


### Environment Variables

Later on in the notebook, we'll be creating a model that automatically logs metrics to Weights & Biases. In order to do so, you'll need to provide your Weights & Biases API key as an environment variable.

You can also optionally initiate an OpenPipe client to report completions to a [dashboard](https://app.openpipe.ai) to get a feel for what the completions your model is generating look like, and how they change over time. Logging to OpenPipe is free, but is not required for training!

In [3]:
import os


# Required
WANDB_API_KEY=""
if WANDB_API_KEY:
    os.environ["WANDB_API_KEY"] = WANDB_API_KEY

# Optional
OPENPIPE_API_KEY=""
if OPENPIPE_API_KEY:
    os.environ["OPENPIPE_API_KEY"] = OPENPIPE_API_KEY

# Make sure a WANDB_API_KEY is set
assert os.getenv("WANDB_API_KEY") is not None, "Please set a WANDB_API_KEY environment variable either here, in your .env file, or in your Colab environment variables"


### Installation

In [4]:
%%capture
!uv pip install openpipe-art openpipe --prerelease allow

### Agentic Environment
<a name="Environment"></a>

ART allows your agent to learn by interacting with its environment. In this example, we'll create an environment in which the agent can play 2048.

Feel free to read as much or as little of this section's code as you'd like. The important thing to understand is that we're defining the rules of this agent's environment. In many cases, this will already be defined by the task you're trying to solve, but if you need to define a custom environment, this is how you do it.

NOTE: To avoid OOM errors on a T4, we're reducing the winning value from 2048 to 256, which in turn reduces the minimum number of moves to win from 939 to 117.

In [19]:
from dotenv import load_dotenv
import random
from typing import TypedDict
from typing import Literal
import string
import xml.etree.ElementTree as ET

load_dotenv()

WINNING_VALUE = 256

# Class that keeps track of state for a single game of 2048
class TwentyFortyEightGame(TypedDict):
    id: str
    board: list[list[int | None]]

# Randomly populates a cell on the board with a 2 or 4
def populate_random_cell(game: TwentyFortyEightGame) -> None:
    all_clear_coordinates = [
        (i, j)
        for i in range(len(game["board"]))
        for j in range(len(game["board"][i]))
        if game["board"][i][j] is None
    ]
    random_clear_coordinates = random.choice(all_clear_coordinates)
    # 90% chance to populate a 2, 10% chance to populate a 4
    game["board"][random_clear_coordinates[0]][random_clear_coordinates[1]] = (
        2 if random.random() < 0.9 else 4
    )

# Generates a new game of 2048
def generate_game(board_length: int = 4) -> TwentyFortyEightGame:
    # random 6 character string
    id = "".join(random.choices(string.ascii_letters + string.digits, k=6))
    game = {
        "id": id,
        "board": [[None for _ in range(board_length)] for _ in range(board_length)],
    }

    # populate two random cells
    populate_random_cell(game)
    populate_random_cell(game)

    return game

# Renders the board in a human-readable format
def render_board(game: TwentyFortyEightGame) -> str:
    board = game["board"]
    # print something like this:
    # _    | 2    | _    | 4
    # 4    | 8    | 2    | 16
    # 16   | 32   | 64   | 128
    # _    | 2    | 2    | 4
    # where _ is an empty cell

    max_cell_width = max(
        [len(str(cell)) for row in board for cell in row if cell is not None]
    )

    board_str = ""
    for row in board:
        # pad the cells with spaces to make them the same width
        board_str += "|".join(
            [
                str(cell).rjust(max_cell_width)
                if cell is not None
                else "_".rjust(max_cell_width)
                for cell in row
            ]
        )
        board_str += "\n"
    return board_str


# condense, privileging matches at the start of the sequence
# sequences should be passed starting with cells that are the furthest in the direction in which the board is being condensed
def condense_sequence(sequence: list[int | None]) -> list[int | None]:
    condensed_sequence = []

    gapless_sequence = [cell for cell in sequence if cell is not None]

    i = 0
    while i < len(gapless_sequence):
        if (
            i + 1 < len(gapless_sequence)
            and gapless_sequence[i] == gapless_sequence[i + 1]
        ):
            condensed_sequence.append(gapless_sequence[i] * 2)
            i += 2
        else:
            condensed_sequence.append(gapless_sequence[i])
            i += 1

    # pad the sequence with None at the end
    return condensed_sequence + [None] * (4 - len(condensed_sequence))

# Condenses the board in a given direction
def condense_board(
    game: TwentyFortyEightGame, direction: Literal["left", "right", "up", "down"]
) -> None:
    if direction == "left":
        for row in game["board"]:
            condensed_row = condense_sequence(row)
            for i in range(len(row)):
                row[i] = condensed_row[i]

    if direction == "right":
        for row in game["board"]:
            reversed_row = row[::-1]
            # reverse the row before and after condensing
            condensed_row = condense_sequence(reversed_row)[::-1]
            for i in range(len(row)):
                row[i] = condensed_row[i]

    if direction == "up":
        for col_index in range(len(game["board"][0])):
            column = [row[col_index] for row in game["board"]]

            condensed_column = condense_sequence(column)
            for row_index in range(len(column)):
                game["board"][row_index][col_index] = condensed_column[row_index]

    if direction == "down":
        for col_index in range(len(game["board"][0])):
            column = [row[col_index] for row in game["board"]]
            reversed_column = column[::-1]
            condensed_column = condense_sequence(reversed_column)[::-1]
            for row_index in range(len(column)):
                game["board"][row_index][col_index] = condensed_column[row_index]


# Applies an agent move to the game board
def apply_agent_move(game: TwentyFortyEightGame, move_xml: str) -> None:
    direction = None
    # parse the move
    try:
        root = ET.fromstring(move_xml)
        direction = root.text
    except Exception as e:
        raise ValueError("Invalid xml")

    if direction not in ["left", "right", "up", "down"]:
        raise ValueError("Invalid direction")

    condense_board(game, direction)

    populate_random_cell(game)

# Returns the maximum cell value on the board
def max_cell_value(game: TwentyFortyEightGame) -> int:
    return max([cell for row in game["board"] for cell in row if cell is not None])

# Returns True if the game is finished
def check_game_finished(game: TwentyFortyEightGame) -> bool:
    if max_cell_value(game) >= WINNING_VALUE:
        return True

    # check if any cell is empty
    if any(cell is None for row in game["board"] for cell in row):
        return False

    return True

# Returns the sum of all the cell values on the board
def total_board_value(game: TwentyFortyEightGame) -> int:
    return sum([cell for row in game["board"] for cell in row if cell is not None])


### Creating a Model

Now that we've defined the rules of our environment, we can create a model that will learn to play 2048. We'll use a Qwen 2.5 7B model for this example. The `name` parameter will be associated with a wandb run, and the `base_model` parameter is the model that we'll be training a LoRA on top of.

In [6]:
import art
from dotenv import load_dotenv
import os
from openpipe.client import OpenPipe
import random


load_dotenv()

random.seed(42)

# Uncomment this line and set your WANDB_API_KEY environment variable
# or add it to your .env file or Colab environment variables
# os.environ["WANDB_API_KEY"] = "YOUR_WANDB_API_KEY"
assert (
    os.getenv("WANDB_API_KEY") is not None
    and os.getenv("WANDB_API_KEY") != "YOUR_WANDB_API_KEY"
), "You need to set your WANDB_API_KEY environment variable either here, in your .env file, or in your Colab environment variables"

# Initialize the server
api = art.LocalAPI(
    # Normally we don't want to run the server in-process, but for the output
    # to show up properly on Google Colab we'll enable this.
    in_process=True
)

# Declare the model
model = await api.get_or_create_model(
    name="001",
    project="2048-multi-turn",
    base_model="Qwen/Qwen2.5-3B-Instruct",
    # To run on a T4, we need to override some config defaults.
    _config=art.config.ModelConfig(
        init_args=art.config.InitArgs(
            max_seq_length=4096,
            enforce_eager=True,
            enable_sleep_mode=False,
        ),
        engine_args=art.config.EngineArgs(
            gpu_memory_utilization=0.57,
            num_scheduler_steps=1,
        ),
    ),
)

# Optional logging client
op_client = OpenPipe(
    api_key=os.getenv("OPENPIPE_API_KEY"),
)

### Defining a Rollout
<a name="Rollout"></a>

A rollout is a single episode of an agent performing its task. It is generates one or more trajectories, which are lists of messages and choices.

In this example, the rollout function generates a game of 2048, and the agent plays it until the game is finished. It then returns a trajectory which contains all the `system` and `user` messages presented to the agent, as well as all the `choices` that the agent made.

When the game is finished the `reward` for the agent's performance is calculated based on the highest cell value on the board, which is then assigned to the trajectory.

This rollout function will be called many times in parallel during each iteration of the training loop.

In [7]:
import art
from art.utils.get_trajectory_messages import get_trajectory_messages
import openai
import time
import math
import requests


@art.retry(exceptions=(openai.LengthFinishReasonError, requests.ReadTimeout))
async def rollout(
    client: openai.AsyncOpenAI, iteration: int, is_validation: bool
) -> art.Trajectory:

    game = generate_game()

    move_number = 0

    trajectory = art.Trajectory(
        messages_and_choices=[
            {
                "role": "system",
                "content": "You are an excellent 2048 player. Always choose the move most likely to lead to combine cells to eventually reach the number 2048. Optional moves are 'left', 'right', 'up', 'down'. Return your move as an XML object with a single property 'move', like so: <move>left</move>",
            }
        ],
        reward=0,
        metrics={"test": 5},
    )

    while True:

        trajectory.messages_and_choices.append(
            {"role": "user", "content": render_board(game)}
        )

        requested_at = int(time.time() * 1000)
        messages = get_trajectory_messages(trajectory)

        async def get_completion():
            return await client.chat.completions.create(
                max_completion_tokens=128,
                messages=messages,
                model=model.name,
            )

        try:
            chat_completion = await get_completion()
            last_completion = chat_completion
        except openai.LengthFinishReasonError as e:
            raise e
        except Exception as e:
            print("caught exception generating chat completion", e)
            raise e

        try:
            if op_client.api_key:
                op_client.report(
                    requested_at=requested_at,
                    received_at=int(time.time() * 1000),
                    req_payload={
                        "model": model.name,
                        "messages": messages,
                        "metadata": {
                            "game_id": game["id"],
                            "notebook-id": "2048",
                            "iteration": str(iteration),
                            "validation": str(is_validation),
                            "move_number": str(move_number),
                        },
                    },
                    resp_payload=chat_completion,
                    status_code=200,
                )
        except Exception as e:
            print(f"Error reporting to OpenPipe: {e}")

        choice = chat_completion.choices[0]
        content = choice.message.content
        assert isinstance(content, str)
        trajectory.messages_and_choices.append(choice)

        try:
            apply_agent_move(game, content)
            move_number += 1
        except ValueError:
            trajectory.reward = -1
            break

        if check_game_finished(game):
            max_value = max_cell_value(game)
            board_value = total_board_value(game)
            trajectory.metrics["max_value"] = max_value
            trajectory.metrics["board_value"] = board_value

            if max_value < WINNING_VALUE:
                # scale board value logarithmically between 0 for 2 * 16 and 1 for WINNING_VALUE * 16
                trajectory.reward = (math.log(board_value, 2) - 1) / (
                    math.log(WINNING_VALUE * 16, 2) - 1
                )
            else:
                # double reward if the agent wins
                trajectory.reward = 2
            break

    try:
        if op_client.api_key:
            op_client.update_log_metadata(
                filters=[
                    {
                        "field": "completionId",
                        "equals": last_completion.id,
                    }
                ],
                metadata={
                    "reward": str(trajectory.reward),
                    "reward_assigned": "true",
                },
            )
    except Exception as e:
        print(f"Error updating log metadata: {e}")

    return trajectory

<a name="Loop"></a>
### Training Loop

The training loop is where the magic happens. For each of the 50 iterations defined below, the rollout function will be called 18 times in parallel. This means that 18 games will be played at once. Each game will produce a trajectory, which will be used to update the model.

The `gather` step will wait for all of the trajectories to be generated, then it will delete all but the most recent checkpoint and train the model on the new trajectories.

Inference will be blocked until the training is complete.



In [12]:
openai_client = await model.openai_client()
for i in range(await model.get_step(), 10):
    train_groups = await art.gather_trajectory_groups(
        (
            art.TrajectoryGroup(
                rollout(openai_client, i, is_validation=False) for _ in range(18)
            )
            for _ in range(1)
        ),
        pbar_desc="gather",
        max_exceptions=18,
    )
    await model.delete_checkpoints()
    await model.train(train_groups, config=art.TrainConfig(learning_rate=3e-5))

gather:   0%|          | 0/18 [00:00<?, ?it/s]

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33marctic_fly[0m ([33mbased-op[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


No "val/reward" metric found in history
Deleted checkpoint ./.art/2048-multi-turn/models/001/0002
Packed 18 trajectories into 5 sequences of length 4096


train:   0%|          | 0/5 [00:00<?, ?it/s]

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 10,000,000 | Num Epochs = 3 | Total steps = 30,000,000
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 1 x 1) = 2
 "-____-"     Trainable parameters = 14,966,784/3,000,000,000 (0.50% trained)


Unsloth: Will smartly offload gradients to save VRAM!




gather:   0%|          | 0/18 [00:00<?, ?it/s]



No "val/reward" metric found in history
Deleted checkpoint ./.art/2048-multi-turn/models/001/0003
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.


gather:   0%|          | 0/18 [00:00<?, ?it/s]



No "val/reward" metric found in history
Packed 18 trajectories into 5 sequences of length 4096


train:   0%|          | 0/5 [00:00<?, ?it/s]

gather:   0%|          | 0/18 [00:00<?, ?it/s]



No "val/reward" metric found in history
Deleted checkpoint ./.art/2048-multi-turn/models/001/0004
Packed 18 trajectories into 6 sequences of length 4096


train:   0%|          | 0/6 [00:00<?, ?it/s]

gather:   0%|          | 0/18 [00:00<?, ?it/s]



No "val/reward" metric found in history
Deleted checkpoint ./.art/2048-multi-turn/models/001/0005
Packed 18 trajectories into 6 sequences of length 4096


train:   0%|          | 0/6 [00:00<?, ?it/s]

gather:   0%|          | 0/18 [00:00<?, ?it/s]



No "val/reward" metric found in history
Deleted checkpoint ./.art/2048-multi-turn/models/001/0006
Skipping tuning as there is no suitable data. This can happen when all the trajectories in the same group have the same reward and thus no advantage to train on.


gather:   0%|          | 0/18 [00:00<?, ?it/s]

No "val/reward" metric found in history
Packed 18 trajectories into 6 sequences of length 4096


train:   0%|          | 0/6 [00:00<?, ?it/s]



### Using the Model

Just like that, you've trained an agent to play 2048! Now it's time to use your model outside of ART, in the wild! The easiest way to do that is to load it from disk, where it was saved after each training iteration, and either run inference on it locally or upload it to a central hub like HuggingFace.

Check out the code below for small demo of the model you just trained playing 2048!

In [None]:
import torch
from unsloth import FastLanguageModel


# example: .art/2048-multi-turn/models/001/0003
lora_model_path = f".art/{model.project}/models/{model.name}/{await model.get_step():04d}"

print(f"loading model from {lora_model_path}")

peft_model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = lora_model_path,
    max_seq_length = 16384,
    dtype = torch.float16,
    load_in_4bit = True,
)
FastLanguageModel.for_inference(peft_model)

game = generate_game()
move_number = 0

messages = [
    {"role": "system", "content": "You are an excellent 2048 player. Always choose the move most likely to lead to combine cells to eventually reach the number 2048. Optional moves are 'left', 'right', 'up', 'down'. Return your move as an XML object with a single property 'move', like so: <move>left</move>"},
]

while not check_game_finished(game):
    rendered_board = render_board(game)
    messages.append({"role": "user", "content": rendered_board})


    inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to("cuda")

    content = ""

    def get_completion() -> str:
        with torch.no_grad():
            outputs = peft_model.generate(
                input_ids=inputs,
                max_new_tokens=100,
                do_sample=True,
                temperature=0.7,
                top_p=0.9
            )
            return tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)

    try:
        content = get_completion()
    except Exception as e:
        print("caught exception generating chat completion", e)
        raise e

    messages.append({"role": "assistant", "content": content})
    

    try:
        apply_agent_move(game, content)
        move_number += 1
    except ValueError:
        raise ValueError(f"Invalid move on move {move_number}: {content}")

    # print the board every 10 moves
    if (move_number % 10 == 0):
        print(f"\nmove {move_number}")
        print(f"board:\n{rendered_board}")
        print(f"agent move: {content}")
        print(f"updated board:\n{render_board(game)}")

        


max_value = max_cell_value(game)
board_value = total_board_value(game)

print(f"game finished in {move_number} moves")

if max_value >= WINNING_VALUE:
    print(f"game won! 💪")
else:
    print(f"game lost! 😢")

print(f"final board:\n\n{render_board(game)}")

print(f"max value: {max_value}")
print(f"board value: {board_value}")




==((====))==  Unsloth 2025.3.19: Fast Qwen2 patching. Transformers: 4.51.1. vLLM: 0.7.3.
   \\   /|    NVIDIA H100 PCIe. Num GPUs = 1. Max memory: 79.097 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 9.0. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!

move 10
board:
2|_|2|8
2|_|_|8
_|_|_|4
_|_|_|_

agent move: <move>left</move>
updated board:
4|8|_|_
2|8|_|2
4|_|_|_
_|_|_|_


move 20
board:
 2| 2|32| 2
 _| _| _| 4
 _| _| _| 4
 2| _| _| _

agent move: <move>down</move>
updated board:
 2| _| _| _
 _| _| _| _
 _| _| _| 2
 4| 2|32| 8


move 30
board:
 _| _| _| _
 _| 2| 8| 2
 8| 2|32| 2
 2| 2| 8| 2

agent move: <move>up</move>
updated board:
 8| 4| 8| 4
 2| 2|32| 2
 _| _| 8| _
 _| _| _| 2


move 40
board:
 4| 8| 8| _
16|32|16| _
 4| 2| 2| _
 2| _| _| _

agent move: <mo

<div class="align-center">
<a href="https://github.com/openpipe/art"><img src="https://github.com/openpipe/art/raw/notebooks/assets/ART_pill.png" height="50"></a>
<a href="https://discord.com/invite/dnseNZuQ"><img src="https://github.com/openpipe/art/raw/notebooks/assets/Discord_pill.png" height="50"></a>
<a href="https://openpipe.ai/blog/art-trainer-a-new-rl-trainer-for-agents"><img src="https://github.com/openpipe/art/raw/main/assets/Launch_pill.png" height="50"></a>


Questions? Join the Discord and ask away! For feature requests or to leave a star, visit our [Github](https://github.com/openpipe/art).
</div>
