# Домашнее задание 2: DPO и PPO

В этой домашке поближе познакомимся с двумя крайне популярными методами алаймента языковых моделей. В первой части вам предоставится возможность самостоятельно заимплементить DPO c нуля. Во второй части мы уже будем использовать библиотеку TRL и обучим PPO.

Обученные модели можно и нужно выложить на [🤗 HuggingFace](https://huggingface.co/). Зарегистрируйтесь там, подпишитесь на [deep vk](https://huggingface.co/deepvk) и создайте себе API токен.

Следуйте ячейкам тетрадки и заполняйте пропущенные ячейки. В конце тетрадки вы найдете задачи со звездочкой, чтобы получить максимальный балл!

In [7]:
import torch
from torch import linalg as LA

a = torch.arange(12, dtype=torch.float)
b = a.reshape((3, 4))

In [9]:
LA.norm(b, dim=0)

tensor([ 8.9443, 10.3441, 11.8322, 13.3791])

In [6]:
b

tensor([[0., 1., 2.],
        [3., 4., 5.],
        [6., 7., 8.]])

## Импорты и вспомогательные функции

In [1]:
# Установим необходимые дополнительные библиотеки

%pip install --quiet datasets trl


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


In [2]:
%pip install --quiet sentencepiece


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


In [3]:
# Необходимые импорты (для обоих частей)
import inspect
import random
from functools import partial

import numpy as np

import torch
import torch.nn.functional as F
from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence


import wandb
from datasets import load_dataset
from huggingface_hub import HfApi, interpreter_login

from tqdm.auto import tqdm
from transformers import (
    AutoModelForCausalLM,
    AutoModelForSequenceClassification,
    AutoTokenizer,
    PreTrainedTokenizerBase
)
from trl import PPOConfig, PPOTrainer, RewardConfig, RewardTrainer


2025-03-30 19:23:35.592055: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-03-30 19:23:37.122080: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [4]:
interpreter_login()


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|



Enter your token (input will not be visible):  ········
Add token as git credential? (Y/n)  y


Token has not been saved to git credential helper.


[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in your terminal in case you want to set the 'store' credential helper as default.

git config --global credential.helper store

Read https://git-scm.com/book/en/v2/Git-Tools-Credential-Storage for more details.[0m


In [8]:
# Подготовим репозиторий для будущей модели и токенизатора
username = HfApi().whoami()["name"]
REPO_NAME = f"{username}/llm-course-hw2"  # Или как вам хочется

print(f"Homework repository: '{REPO_NAME}'")


Homework repository: 'fridalex/llm-course-hw2'


In [9]:
def set_seed(seed=42):
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    np.random.seed(seed)
    random.seed(seed)


# Этой функцией будут помечены все места, которые необходимо дозаполнить
# Это могут быть как целые функции, так и отдельные части внутри них
# Всегда можно воспользоваться интроспекцией и найти места использования этой функции :)
def todo():
    stack = inspect.stack()
    caller_frame = stack[1]
    function_name = caller_frame.function
    line_number = caller_frame.lineno
    raise NotImplementedError(f"TODO at {function_name}, line {line_number}")


def disable_dropout_in_model(model):
    for module in model.modules():
        if isinstance(module, torch.nn.Dropout):
            module.p = 0


# Часть 1: DPO

Крайне простой метод, который в свое время произвел фурор, т.к. выгодно выделялся на фоне PPO. В отличие от PPO, требующего отдельно обучать Reward Model, Value Model и больших усилий в имплементации, DPO не требует явной ревард модели, а только датасета с человеческими преференсами вида: промпт, выбранный человеком ответ, отвергнутный человеком ответ. Простота также видна из лосса, по сути это весь метод:
$$
L_\text{DPO}(\pi_{\theta}; \pi_\text{ref}) = -E_{(x, y_w, y_l)\sim D}\left[\log \sigma \left(
\beta \log \frac{\pi_{\theta}(y_w\mid x)}{\pi_\text{ref}(y_w\mid x)} \thinspace
{- \beta \log \frac{\pi_{\theta}(y_l\mid x)}{\pi_\text{ref}(y_l\mid x)}}\right)\right]
$$

где:

- $\pi_{\theta}$ LLM которую мы хотим заалайнить
- $\pi_\text{ref}$ референсная модель для регуляризации, как правило просто начальный чекпоинт
- $D$ датасет с преференсами
- $x$ промпт из датасета $D$
- $y_w$ ответ на промпт $x$ выбранный человеком (или тем кто размечал преференсы, это может быть и большая LLM)
- $y_l$ ответ на промпт $x$ отвергнутый человеком (или тем кто размечал преференсы, это может быть и большая LLM)
- $\beta$ гиперепараметр отвечающий за то, как далеко мы можем отходить от референсной модели

Во время имплементации советум внимательно прочитать оригинальную статью: [Direct Preference Optimization: Your Language Model is Secretly a Reward Model](https://arxiv.org/abs/2305.18290).

Для файнтюна мы будем использовать модель [HuggingFaceTB/SmolLM-135M-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM-135M-Instruct), т.к. она маленького размера (поместится на Colab), но при этом умеет достаточно, чтобы увидеть изменения от алаймента. Более того, данная модель даже прошла стадию SFT, а поэтому в отличие от базовой модели (без Instruct) понимает формат чата (chat-template в transformers, дальше разберем) и имеет 'осознание' себя языковым ассистентом.

P.S. Если у вас есть доступ к вычислительным ресурсам типо A100 и больше, вы можете попробовать зафайнтюнить модель большего размера из этой же [линейки](https://huggingface.co/blog/smollm). Будьте внимательны, смотрите, чтобы она была с добавкой Instruct.

In [7]:
MODEL_ID = "HuggingFaceTB/SmolLM-135M-Instruct"
DATASET_ID = "HumanLLMs/Human-Like-DPO-Dataset"

## Подготовка данных [2 балла]

Для начала нужно подготовить данные. В качестве датасета преференсов мы будем использовать [HumanLLMs/Human-Like-DPO-Dataset](https://huggingface.co/datasets/HumanLLMs/Human-Like-DPO-Dataset), который значительно повышает эмоциональность модели, количество используемых эмодзи и в целом снижает строгость следования шаблону "As a conversational AI, I ...".

Чтобы подготовить датасет нужно несколько простых этапов:
1. Привест данные к формату chat-template
2. После применить этот chat-template с помощью 'tokenizer.apply_chat_template'
3. Токенизировать получившиеся данные, попутно обрезав промпт и ответы до нужной длины, если надо.

Внимательно прочитайте [документацию по chat-templates](https://huggingface.co/docs/transformers/chat_templating). Для удобства данные приводят в начале в более верхне-уровневый формат такого вида:
```python
messages = [
    {"role": "system", "content": "You are a helpful assistant focused on technical topics."},
    {"role": "user", "content": "Can you explain what a chat template is?"},
    {"role": "assistant", "content": "A chat template structures conversations between users and AI models..."}
]
```
То есть модели можно задать разные роли, такие как например системный промпт, и в целом структурировать диалог между ассистентом и человеком. Обычно обучение этому происходит на этапе SFT. Данная репрезентация абстрагирует детали (конкретные токены) как этот формат используют разные модели. Чтобы перевести его в неспоредственно текстовый инпут в формате специфичном конкретной модели используется `tokenizer.apply_chat_template`.

In [8]:
# понадобится для подготовки данных
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
tokenizer.pad_token = tokenizer.eos_token


In [14]:
dataset = load_dataset(DATASET_ID, split="train")
dataset[0]


{'prompt': 'Oh, I just saw the best meme - have you seen it?',
 'chosen': "😂 Ah, no I haven't! I'm dying to know, what's the meme about? Is it a funny cat or a ridiculous situation? Spill the beans! 🤣",
 'rejected': "I'm an artificial intelligence language model, I don't have personal experiences or opinions. However, I can provide you with information on highly-rated and critically acclaimed films, as well as recommendations based on specific genres or themes. Would you like me to suggest some notable movies or discuss a particular genre of interest?"}

Приведите датасет к формату чата, где у промпта роль user, а у ответов assistant, а потом примените чат темплейт:

In [20]:
import json

def apply_chat_template(example: dict[str, str], tokenizer: PreTrainedTokenizerBase) -> dict[str, str]:
    """
    Transforms a dataset example into a formatted chat template using the provided tokenizer.

    Args:
        example (Dict[str, str]): A dictionary containing the following keys:
            - "prompt": The initial user prompt.
            - "chosen": The assistant's chosen response.
            - "rejected": The assistant's rejected response.
        tokenizer (PreTrainedTokenizerBase): An object that provides the `apply_chat_template` method
            for formatting the conversation.

    Returns:
        Dict[str, str]: A dictionary with the following keys:
            - "prompt": The formatted prompt string including the generation prompt.
            - "chosen": The formatted assistant's chosen response (with the prompt prefix removed).
            - "rejected": The formatted assistant's rejected response (with the prompt prefix removed).
    """
    result = {}
    template_prompt = """{{"<|im_start|>user\n" + messages['prompt'] + "<|im_end|>\n<|im_start|>assistant\n"}}"""
    result['prompt'] = tokenizer.apply_chat_template(example, chat_template=template_prompt, tokenize=False)
    template_chosen = '''{{messages['chosen'] + "<|im_end|>\n"}}'''
    result['chosen'] = tokenizer.apply_chat_template(example, chat_template=template_chosen, tokenize=False)
    template_rejected = '''{{messages['rejected'] + "<|im_end|>\n"}}'''
    result['rejected'] = tokenizer.apply_chat_template(example, chat_template=template_rejected, tokenize=False)
    # {% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}
    return result


In [11]:
apply_chat_template(dataset[0], tokenizer)


{'prompt': '<|im_start|>user\nOh, I just saw the best meme - have you seen it?<|im_end|>\n<|im_start|>assistant\n',
 'chosen': "😂 Ah, no I haven't! I'm dying to know, what's the meme about? Is it a funny cat or a ridiculous situation? Spill the beans! 🤣<|im_end|>\n",
 'rejected': "I'm an artificial intelligence language model, I don't have personal experiences or opinions. However, I can provide you with information on highly-rated and critically acclaimed films, as well as recommendations based on specific genres or themes. Would you like me to suggest some notable movies or discuss a particular genre of interest?<|im_end|>\n"}

In [12]:
dataset = dataset.map(apply_chat_template, fn_kwargs={"tokenizer": tokenizer})
dataset[0]


{'prompt': '<|im_start|>user\nOh, I just saw the best meme - have you seen it?<|im_end|>\n<|im_start|>assistant\n',
 'chosen': "😂 Ah, no I haven't! I'm dying to know, what's the meme about? Is it a funny cat or a ridiculous situation? Spill the beans! 🤣<|im_end|>\n",
 'rejected': "I'm an artificial intelligence language model, I don't have personal experiences or opinions. However, I can provide you with information on highly-rated and critically acclaimed films, as well as recommendations based on specific genres or themes. Would you like me to suggest some notable movies or discuss a particular genre of interest?<|im_end|>\n"}

После этих двух этапов данные должны выглядеть так (**обратите внимание на положение <|im_start|>assistant\n**, это важно!):
```
{
    'prompt': "<|im_start|>user\nOh, I just saw the best meme - have you seen it <|im_end|>\n<|im_start|>assistant\n",
    'chosen': "😂 Ah, no I haven't! I'm dying to know, what's the meme about? Is it a funny cat or a ridiculous situation? Spill the beans! 🤣<|im_end|>\n",
    'rejected': "I'm an artificial intelligence language model, I don't have personal experiences or opinions. However, I can provide you with information on highly-rated and critically acclaimed films, as well as recommendations based on specific genres or themes. Would you like me to suggest some notable movies or discuss a particular genre of interest?<|im_end|>\n"
}
```

Токенизируйте датасет с помощью токенизатора, обрезав длину если необходимо. В датасете должны остаться только ID токенов:
```
Dataset({
    features: ['prompt_input_ids', 'chosen_input_ids', 'rejected_input_ids'],
    num_rows: 10884
})
```

Обрезайте промпт слева, а не с конца. Подумайте почему так лучше. **Напишите свой ответ**.

Обрезаем промт слева, так как когда мы используем фиксированное окно контекста, нам важнее видеть последние токены запроса, а не первые. Поэтому обрезаем конец.

In [18]:
def tokenize_row(
    example: dict[str, str],
    tokenizer: PreTrainedTokenizerBase,
    max_prompt_length: int = 512,
    max_completion_length: int | None = None,
) -> dict[str, list[int]]:
    """
    Tokenizes a single row of a dataset example for use in language model training or evaluation.

    This function processes an example containing textual fields for a prompt, a chosen response,
    and a rejected response. It tokenizes each text field using the provided tokenizer. If specified,
    it truncates the tokenized prompt to the last `max_prompt_length` tokens and the tokenized responses
    (chosen and rejected) to the first `max_completion_length` tokens.

    Args:
        example (dict[str, str]): A dictionary with the following keys:
            - "prompt": The initial prompt text.
            - "chosen": The assistant's chosen response.
            - "rejected": The assistant's rejected response.
        tokenizer (PreTrainedTokenizerBase): A tokenizer that converts text into token IDs. It must return a dictionary
            with the key "input_ids" when called.
        max_prompt_length (Optional[int], optional): Maximum number of tokens to retain for the prompt.
            The function keeps the last `max_prompt_length` tokens. Defaults to 512.
        max_completion_length (Optional[int], optional): Maximum number of tokens to retain for the completion
            responses (chosen and rejected). The function keeps the first `max_completion_length` tokens.
            If None, no truncation is applied. Defaults to None.

    Returns:
        dict[str, list[int]]: A dictionary containing:
            - "prompt_input_ids": The token IDs for the prompt, possibly truncated.
            - "chosen_input_ids": The token IDs for the chosen response, possibly truncated.
            - "rejected_input_ids": The token IDs for the rejected response, possibly truncated.
    """
    
    example['prompt'] = tokenizer.tokenize(example['prompt'])
    example['chosen'] = tokenizer.tokenize(example['chosen'])
    example['rejected'] = tokenizer.tokenize(example['rejected'])
    
    example['prompt'] = example['prompt'][-1 * max_prompt_length:]
    
    if max_completion_length:
        example['chosen'] = example['chosen'][:max_completion_length]
        example['rejected'] = example['rejected'][:max_completion_length]
    result = {}
    result['prompt_input_ids'] = tokenizer.convert_tokens_to_ids(example['prompt'])
    result['chosen_input_ids'] = tokenizer.convert_tokens_to_ids(example['chosen'])
    result['rejected_input_ids'] = tokenizer.convert_tokens_to_ids(example['rejected'])
    return result


In [14]:
dataset = dataset.map(
    tokenize_row,
    fn_kwargs={
        "tokenizer": tokenizer,
        "max_prompt_length": 256,
        "max_completion_length": None,
    },
    remove_columns=["prompt", "chosen", "rejected"],
)

dataset[0]


{'prompt_input_ids': [1,
  4093,
  198,
  16912,
  28,
  339,
  915,
  3680,
  260,
  1450,
  1169,
  85,
  731,
  457,
  346,
  2269,
  357,
  47,
  2,
  198,
  1,
  520,
  9531,
  198],
 'chosen_input_ids': [10813,
  242,
  220,
  12947,
  28,
  787,
  339,
  8540,
  982,
  17,
  339,
  5248,
  11888,
  288,
  699,
  28,
  732,
  506,
  260,
  1169,
  85,
  563,
  47,
  1431,
  357,
  253,
  17025,
  2644,
  355,
  253,
  31404,
  3223,
  47,
  1691,
  388,
  260,
  9973,
  17,
  15107,
  114,
  113,
  2,
  198],
 'rejected_input_ids': [57,
  5248,
  354,
  6416,
  5290,
  1789,
  1743,
  28,
  339,
  1326,
  982,
  457,
  2143,
  2647,
  355,
  8428,
  30,
  1423,
  28,
  339,
  416,
  1538,
  346,
  351,
  1096,
  335,
  3452,
  29,
  3119,
  284,
  9603,
  32246,
  9411,
  28,
  347,
  876,
  347,
  7400,
  1552,
  335,
  1678,
  14009,
  355,
  5535,
  30,
  13651,
  346,
  702,
  549,
  288,
  1820,
  634,
  7703,
  10026,
  355,
  1692,
  253,
  1542,
  10265,
  282,
  1384,
  

Теперь надо подготовить DataLoader. Для этого надо написать кастомный `collate_fn` который будет делать следующее:
1. Принимать лист примеров с ключами `prompt_input_ids`, `chosen_input_ids`, `rejected_input_ids`.
2. Паддить до максимальной длины в батче по каждому ключу. По итогу `prompt_input_ids` и `chosen_input_ids` могут иметь разную длину, это нормально. Важно, чтобы внутри одинаковых ключей длина была консистентна.
3. Для каждого ключа создавать паддинг маску такого же шейпа, где 0 используется для паддинг-токенов и 1 для токенов последовательности.

Для паддинга дополнительно реализуйте функцию `pad`. В качестве токена используйте `tokenizer.pad_token_id` и 0 для маски. **Опять же, подумайте откуда лучше паддить `prompt_input_ids`?**

In [42]:
def pad(tensors: list[torch.Tensor], padding_value: int = 0, padding_side: str = "right") -> torch.Tensor:
    """
    Pads a list of tensors to the same size along their leading dimension.

    Args:
        tensors (list[torch.Tensor]): A list of tensors to be padded.
            All tensors in the list should be of the same type and device.
        padding_value (int, default=0): The value used to pad the tensors.
        padding_side (str, default="right"): Specifies which side of the tensor to apply padding: either 'left' or 'right'.

    Returns:
        torch.Tensor: A tensor containing all the padded tensors, [N; max_length]
            where N is the number of tensors and `max_length` is the shape of the largest tensor.
    """
    max_len = 0
    for t in tensors:
        max_len = max(t.shape[0], max_len)
    new_tensors = []
    for i in range(len(tensors)):
        padd_tensor = torch.full([max_len-tensors[i].shape[0]], padding_value)
        new_tensors.append(torch.cat([padd_tensor, tensors[i]]) if padding_side == "left" else torch.cat([tensors[i], padd_tensor]))
    
    return torch.stack(new_tensors)


def pad_collate_fn(batch: list[dict[str, torch.Tensor]], pad_token_id: int) -> dict[str, torch.Tensor]:
    """
    Collates and pads a batch of tokenized examples for model input.

    This function takes a batch of examples where each example is a dictionary containing
    token IDs for the prompt, the chosen response, and the rejected response. For each field,
    it extracts the list of token IDs, creates a corresponding attention mask (with ones for each token),
    and then pads the sequences using a `pad` function. The prompt sequences and their attention masks
    are padded on the left, while the chosen and rejected sequences are padded on the right (default).

    Args:
        batch (list[dict[str, torch.Tensor]]): A list of dictionaries, where each dictionary has the keys:
            - "prompt_input_ids": Tensor of token IDs for the prompt.
            - "chosen_input_ids": Tensor of token IDs for the chosen response.
            - "rejected_input_ids": Tensor of token IDs for the rejected response.
        pad_token_id (int): Padding value for token IDs.

    Returns:
        dict[str, torch.Tensor]: A dictionary containing the following keys with padded tensors:
            - "prompt_input_ids": Padded token IDs for the prompt (padded on the left).
            - "prompt_attn_mask": Padded attention mask for the prompt (padded on the left, with 1s for actual tokens).
            - "chosen_input_ids": Padded token IDs for the chosen response.
            - "chosen_attn_mask": Padded attention mask for the chosen response.
            - "rejected_input_ids": Padded token IDs for the rejected response.
            - "rejected_attn_mask": Padded attention mask for the rejected response.
    """
    result = {}
    
    prompt_input_ids = [b['prompt_input_ids'] for b in batch]
    chosen_input_ids = [b['chosen_input_ids'] for b in batch]
    rejected_input_ids = [b['rejected_input_ids'] for b in batch]
    
    result['prompt_input_ids'] = pad(prompt_input_ids, pad_token_id, 'left')
    result['prompt_attn_mask'] = (result['prompt_input_ids'] != pad_token_id).int()
    
    result['chosen_input_ids'] = pad(chosen_input_ids, pad_token_id, 'right')
    result['chosen_attn_mask'] = (result['chosen_input_ids'] != pad_token_id).int()    
    
    result['rejected_input_ids'] = pad(rejected_input_ids, pad_token_id, 'right')
    result['rejected_attn_mask'] = (result['rejected_input_ids'] != pad_token_id).int()
    
    return result

dataloader = DataLoader(
    dataset.with_format("torch"),
    batch_size=2,
    shuffle=True,
    collate_fn=partial(pad_collate_fn, pad_token_id=tokenizer.pad_token_id)
)


In [16]:
next(iter(dataloader))


{'prompt_input_ids': tensor([[   1, 4093,  198, 5519,  346, 2042, 3485,  253,  725, 5753,  355, 1789,
            47,    2,  198,    1,  520, 9531,  198],
         [   1, 4093,  198, 1780,  506,  469, 4932, 1114,  288,  919,  418, 1478,
            47,    2,  198,    1,  520, 9531,  198]]),
 'prompt_attn_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1],
         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1]],
        dtype=torch.int32),
 'chosen_input_ids': tensor([[22096,    17,   339,  3543,   761,   957,  3506,  2419,   282,  1380,
            725,  1495,    30,   339,  3543,  1811,   719, 19049,   411,  4575,
             28,   284,   339,  3543, 44263, 14101,   281,   253,  1443,   690,
            260,   929,    30,   339,  5655,   288,   835,  5071,   281,  4708,
             28,   564,  1303,   506,   915,  1643,   357,  3683,   982,  3869,
           6420, 40303,   220,    30,   339,  1441,    28,   339,   416,  1361,
           2915,   634,

## DPO Loss [5 баллов]

Начнем с имплементации самой функции потерь. Она достаточно простая, следуйте формуле дословно и все получится.

$$
L_\text{DPO}(\pi_{\theta}; \pi_\text{ref}) = -E_{(x, y_w, y_l)\sim D}\left[\log \sigma \left(
\beta \log \frac{\pi_{\theta}(y_w\mid x)}{\pi_\text{ref}(y_w\mid x)} \thinspace
{- \beta \log \frac{\pi_{\theta}(y_l\mid x)}{\pi_\text{ref}(y_l\mid x)}}\right)\right]
$$

где:

- $\pi_{\theta}$ LLM которую мы хотим заалайнить
- $\pi_\text{ref}$ референсная модель для регуляризации, как правило просто начальный чекпоинт
- $D$ датасет с преференсами
- $x$ промпт из датасета $D$
- $y_w$ ответ на промпт $x$ выбранный человеком (или тем кто размечал преференсы, это может быть и большая LLM)
- $y_l$ ответ на промпт $x$ отвергнутый человеком (или тем кто размечал преференсы, это может быть и большая LLM)
- $\beta$ гиперепараметр отвечающий за то, как далеко мы можем отходить от референсной модели

In [17]:
def dpo_loss(
    chosen_logps: torch.Tensor,
    rejected_logps: torch.Tensor,
    ref_chosen_logps: torch.Tensor,
    ref_rejected_logps: torch.Tensor,
    beta: float = 0.1,
) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
    """
    Computes the Direct Preference Optimization (DPO) loss and associated reward metrics.

    Args:
        chosen_logps (Tensor): A tensor of shape (batch_size,) containing the log-probabilities of the chosen responses.
        rejected_logps (Tensor): A tensor of shape (batch_size,) containing the log-probabilities of the rejected responses.
        ref_chosen_logps (Tensor): A tensor of shape (batch_size,) containing the reference log-probabilities for chosen responses.
        ref_rejected_logps (Tensor): A tensor of shape (batch_size,) containing the reference log-probabilities for rejected responses.
        beta (float, optional): A scaling factor applied to the differences in log-probabilities. Defaults to 0.1.

    Returns:
        tuple[Tensor, Tensor, Tensor]:
            - loss (Tensor): The computed DPO loss as a scalar tensor.
            - reward_accuracies (Tensor): The fraction of examples where the chosen reward exceeds the rejected reward.
            - reward_margins (Tensor): The average difference between the chosen and rejected rewards.
    """
    # print(chosen_logps.requires_grad)
    # print(rejected_logps.requires_grad)
    # print(ref_chosen_logps.requires_grad)
    # print(ref_rejected_logps.requires_grad)
    
    loss = torch.log(F.sigmoid(beta * torch.log(chosen_logps / ref_chosen_logps) - beta* torch.log(rejected_logps / ref_rejected_logps))).mean()
    chosen_rewards = []
    
    rejected_rewards = []

    for i in range(chosen_logps.shape[0]):
        chosen_rewards.append(beta * torch.log(chosen_logps[i] / ref_chosen_logps[i]))
        rejected_rewards.append(beta* torch.log(rejected_logps[i] / ref_rejected_logps[i]))
    #print(chosen_rewards, 'chosen_rewards')
    #print(rejected_rewards, 'chosen_rewards')
    reward_accuraces = torch.greater(torch.tensor(chosen_rewards), torch.tensor(rejected_rewards)).sum() / len(chosen_rewards)
    reward_margins = (torch.tensor(chosen_rewards) - torch.tensor(rejected_rewards)).sum() / len(chosen_rewards)
    #print(loss.requires_grad, 'lossssss')
    return -1 * loss, reward_accuraces, reward_margins


In [18]:
# a = (torch.arange(100, 110) / 10).clone().detach().requires_grad_(False)
# b = (torch.arange(13, 23) / 7).clone().detach().requires_grad_(False)
# c = (torch.arange(10, 20) / 7).clone().detach().requires_grad_(False)
# d = (torch.arange(20, 30) / 7).clone().detach().requires_grad_(False)


In [19]:
# dpo_loss(a, b, c, d)

Для удобста также определим отдельную функцию чтобы считать лог-пробы по логитам. Вам нужно вытащить логиты реальных токенов из последовательности. Не забудьте замаскировать лог-пробы промпта перед аггрегацией. Маска здесь уже дана.

Подсказка: внимательно подумайте как соотносятся логпробы и настоящие индексы, иначе рискуете ошибиться на 1

In [14]:
def get_log_prob(logits: torch.Tensor, labels: torch.Tensor, mask: torch.Tensor) -> torch.Tensor:
    """
    Computes the log probability for each sequence in a batch.

    Args:
        logits (Tensor): A tensor of shape [batch_size, seq_len, vocab_size]
            representing the model's output logits.
        labels (Tensor): A tensor of shape [batch_size, seq_len] containing the target token indices.
        mask (Tensor): A tensor of shape [batch_size, seq_len] indicating which tokens to include
            in the log probability (e.g., 1 for valid tokens and 0 for padding or prompt).

    Returns:
        Tensor: A tensor of shape [batch_size,] containing the log probability for each sequence.
    """
    probs = torch.gather(logits, 2, labels.unsqueeze(2)).squeeze(2)
    probs = probs.masked_fill_(abs(1- mask).to(bool), 1)
    log_probs = torch.log(probs)
    return log_probs.sum(dim=-1)


## Обучение DPO [5 баллов]

На всякий случай инициализируем модель, токенизатор и датасет с нуля.
Для простоты ограничимся обычным циклом, без конфигов, классов и прочего.
Вы можете переписать как удобно вам, главное сохранить корректность.

Все нужное у нас уже есть, осталось собрать это все вместе.
Для этого нужно получить логпробы для промпт+выбранный и промпт+отвергнутый ответы.
Не забыть правильно собрать маску для лосса.
В конце обрезать финальные входы для модели до `MAX_SEQ_LEN` (с нужной стороны!).

Обучение занимает примерно час на Colab T4 GPU, 2 минут на H100. В Colab лучше использовать float16 и AMP.
Не забудьте про скейлинг. Для bf16 он не обязателен.

**NB**: для обучения лучше использовать Kaggle Notebooks, т.к. они не вылетают если долго не взаимодействовать с тетрадкой. Их можно оставлять на час без боязни, что они упадут.

In [21]:
BATCH_SIZE = 8  # in colab make it smaller, or implement grad accumulation
NUM_EPOCHS = 1
LR = 5e-5
MAX_SEQ_LEN = 1024  # this also can be adjusted
MAX_PROMPT_LEN = 256 # this also can be adjusted
MAX_COMPLETION_LEN = None
BETA = 0.1

# опционально, если вам хочется логгировать метрики в W&B
ENABLE_WANDB = True

if torch.cuda.is_available():
    DEVICE = "cuda"
elif torch.backends.mps.is_available():
    DEVICE = "mps"
else:
    DEVICE = "cpu"
print(f"Using '{DEVICE}' device")


Using 'cuda' device


In [22]:
set_seed(42)

if ENABLE_WANDB:
    wandb.init(project="hw2-rlhf", group="dpo")

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    attn_implementation="sdpa",
    # only if you have A/H100 GPU
    # torch_dtype=torch.bfloat16,
    device_map=DEVICE,
)
model.train()
disable_dropout_in_model(model)

ref_model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    attn_implementation="sdpa",
    # only if you have A/H100 GPU
    # torch_dtype=torch.bfloat16,
    device_map=DEVICE,
)
ref_model.eval()
disable_dropout_in_model(ref_model)

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
tokenizer.pad_token = tokenizer.eos_token

dataset = load_dataset(DATASET_ID, split="train")
dataset = dataset.map(apply_chat_template, fn_kwargs={"tokenizer": tokenizer})
dataset = dataset.map(
    tokenize_row,
    fn_kwargs={
        "tokenizer": tokenizer,
        "max_prompt_length": MAX_PROMPT_LEN,
        "max_completion_length": MAX_COMPLETION_LEN,
    },
    remove_columns=["prompt", "chosen", "rejected"],
)
dataloader = DataLoader(
    dataset.with_format("torch"),
    batch_size=BATCH_SIZE,
    shuffle=True,
    pin_memory=False,
    collate_fn=partial(pad_collate_fn, pad_token_id=tokenizer.pad_token_id),
)
optimizer = torch.optim.AdamW(model.parameters(), lr=LR)


wandb: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
wandb: Currently logged in as: fridalex (fridalex-yandex) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.19.8
wandb: Run data is saved locally in /home/jupyter/work/resources/vk/hw2/wandb/run-20250325_154252-pth1wlqi
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run sandy-field-10
wandb: ⭐️ View project at https://wandb.ai/fridalex-yandex/hw2-rlhf
wandb: 🚀 View run at https://wandb.ai/fridalex-yandex/hw2-rlhf/runs/pth1wlqi


In [23]:
# ?? model.forward

In [24]:
# chosen_input = torch.cat([batch['prompt_input_ids'], batch['chosen_input_ids']], dim=-1).to('cuda')
# chosen_mask = torch.cat([batch['prompt_attn_mask'], batch['chosen_attn_mask']], dim=-1).to('cuda')

# rejected_input = torch.cat([batch['prompt_input_ids'], batch['rejected_input_ids']], dim=-1).to('cuda')
# rejected_mask = torch.cat([batch['prompt_attn_mask'], batch['rejected_attn_mask']], dim=-1).to('cuda')

In [25]:
# chosen_probs = model.forward(chosen_input[:, :-1], chosen_mask[:, :-1])
# ref_chosen_probs = ref_model.forward(chosen_input[:, :-1], chosen_mask[:, :-1])

# rejected_probs = model.forward(rejected_input[:, :-1], rejected_mask[:, :-1])
# ref_rejected_probs = ref_model.forward(rejected_input[:, :-1], rejected_mask[:, :-1])

In [26]:
# chosen_input[:, :-1].shape

In [None]:
scaler = torch.amp.GradScaler()
dtype = torch.float16

for epoch in range(NUM_EPOCHS):
    losses, accs, margins = [], [], []

    pbar = tqdm(dataloader, desc="Epoch", leave=False)
    for batch in pbar:
        with torch.amp.autocast(device_type="cuda", dtype=dtype):
            batch = {k: v.to(DEVICE) for k, v in batch.items()}
        
            # 1. Concatenate the prompt and completion inputs for chosen & rejected
            chosen_input = torch.cat([batch['prompt_input_ids'], batch['chosen_input_ids']], dim=-1)
            chosen_mask = torch.cat([batch['prompt_attn_mask'], batch['chosen_attn_mask']], dim=-1)
        
            rejected_input = torch.cat([batch['prompt_input_ids'], batch['rejected_input_ids']], dim=-1)
            rejected_mask = torch.cat([batch['prompt_attn_mask'], batch['rejected_attn_mask']], dim=-1)

            # 2. Calculate logits for current and reference models for chosen and rejected samples
            chosen_probs =  model.forward(chosen_input[:, :-1], chosen_mask[:, :-1])['logits']
            ref_chosen_probs = ref_model.forward(chosen_input[:, :-1], chosen_mask[:, :-1])['logits']

            rejected_probs = model.forward(rejected_input[:, :-1], rejected_mask[:, :-1])['logits']
            ref_rejected_probs = ref_model.forward(rejected_input[:, :-1], rejected_mask[:, :-1])['logits']

            # print(chosen_probs.sum(), 'logits chosen')
            # print(rejected_probs.sum(), 'logits chosen')
            # 3. Calculate log probs for all models (no concat as in TRL for simplicity and to save memory with smaller batch size)
            chosen_probs_log = get_log_prob(torch.softmax(chosen_probs, dim=-1), chosen_input[:, 1:], chosen_mask[:, 1:])
            ref_chosen_probs_log = get_log_prob(torch.softmax(ref_chosen_probs, dim=-1), chosen_input[:, 1:], chosen_mask[:, 1:])

            rejected_input_log = get_log_prob(torch.softmax(rejected_probs, dim=-1), rejected_input[:, 1:], rejected_mask[:, 1:])
            ref_rejected_probs_log = get_log_prob(torch.softmax(ref_rejected_probs, dim=-1), rejected_input[:, 1:], rejected_mask[:, 1:])
        # print(chosen_probs_log.sum(), 'logits chosen')
        # print(rejected_input_log.sum(), 'logits chosen')
        # 4. Calculate loss
        loss, accuracy, margin = dpo_loss(chosen_probs_log, rejected_input_log, ref_chosen_probs_log, ref_rejected_probs_log)
        # print(loss, accuracy, margin)
        # 5. Make optimizer step
        optimizer.zero_grad()
        scaler.scale(loss).backward()
        
        # loss.backward()
        # optimizer.step()
        scaler.step(optimizer)
        scaler.update()

        losses.append(loss.item())
        accs.append(accuracy.item())
        margins.append(margin.item())
        pbar.set_postfix({"Reward margins": np.mean(margins), "Reward acc": np.mean(accs)})

    if ENABLE_WANDB:
        wandb.log(
            {
                "loss": loss.item(),
                "train-reward-margins": margin.item(),
                "train-reward-accuracy": accuracy.item(),
                "epoch": epoch,
            }
        )

    pbar.close()


In [None]:
# У меня отвалилась гпу в самом конце первой эпохи(не хватило памяти), моделька обучилась хорошо, я ее выложил на hf, тут просто нет логов обучения.

Во время обучения reward margins и accuracy должны были расти. Давайте проверим что изменилось после обучения:

In [28]:
messages = [{"role": "user", "content": "What's your morning routine like?"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = tokenizer([text], return_tensors="pt").to(DEVICE)

generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=256, do_sample=True)
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

init_generated_ids = ref_model.generate(model_inputs.input_ids, max_new_tokens=256, do_sample=True)
init_response = tokenizer.batch_decode(init_generated_ids, skip_special_tokens=True)[0]

print("======== BEFORE TUNING ========")
print(init_response)
print()

print("======== AFTER TUNING ========")
print(response)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


user
What's your morning routine like?
assistant
As I wake up, I start feeling a sense of relief and excitement. I'm already getting those 8 hours of sleep, and I'm grateful for the amazing morning I've had. As I sit up, I glance around the room, taking in the familiar sights and sounds. It's a familiar ritual, but I don't know how to say goodbye.

As I sit down in my bed, I notice the familiar scent of the air. The soft chirping of crickets seems to fill the space, and the usual cozy atmosphere washes over me. I take a sip of my morning coffee, feeling its aroma wafting up our breath. It's a comfort, like nothing else.

As I get dressed, I notice the clock on the wall and floor is 3:45. I've got a head start on the day, but I'm not feeling up for the rest of it yet. I take a quick step forward, my hands behind my back, and begin to move about. It's hard to walk, but I'm trying my best to take it in.

As I start to get dressed, I notice the sun rises high in the sky. It's the first thi

In [30]:
# Загружаем все на хаб

model.push_to_hub(f"{REPO_NAME}-dpo", private=True)
tokenizer.push_to_hub(f"{REPO_NAME}-dpo", private=True)

model.safetensors: 100%|██████████| 538M/538M [00:22<00:00, 23.5MB/s]   


CommitInfo(commit_url='https://huggingface.co/fridalex/llm-course-hw2-dpo/commit/36ee36f7d537ac45ecef2a5ecbb21e45c440b8e9', commit_message='Upload tokenizer', commit_description='', oid='36ee36f7d537ac45ecef2a5ecbb21e45c440b8e9', pr_url=None, repo_url=RepoUrl('https://huggingface.co/fridalex/llm-course-hw2-dpo', endpoint='https://huggingface.co', repo_type='model', repo_id='fridalex/llm-course-hw2-dpo'), pr_revision=None, pr_num=None)

# Часть 2: PPO и TRL

Вторая часть будет сильно проще и направлена на то, чтобы познакомиться с самой популярной библотекой для алаймента от huggingface - [TRL](https://huggingface.co/docs/trl/v0.15.0/index). C помощью TRL нужно будет обучить PPO, а для этого вначале обучить Reward Model.

**Лирическое отступление**: PPO имеет парадоксальную репутацию. С одной строны в RL он считается чуть ли не единственным применимым (до сих пор) на практике алгоритмом, который заводится с пол-пинка и на любой задаче. Основной боттлнек для него - данные, чем быстрее симулятор, там больше вероятность, что он вашу задачу решит. Примеров много - так решили Dota 2 или Minecraft. С другой стороны, у алгоритма крайне дурная репутация в плане имплементации с нуля, т.к. есть много важных и маленьких деталей, которые при неправильном исполнении приведут к незаметному, но крайне странному поведению. Дебагать это очень сложно, [чего стоит только этот список](https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/) и [такой же для уже RLHF](https://huggingface.co/blog/the_n_implementation_details_of_rlhf_with_ppo), причем часто трюки не пересекаются между доменами. Более того, как раз из-за этого если вы загуглите имплементации PPO с нуля, с большой вероятностью большая часть будет с ошибками.

Поэтому кодить PPO без тесного знакомства и опыта в RL крайне не рекомендуется. Для RLHF лучше использовать TRL или аналоги, для RL лучше использовать [Sample-Factory](https://github.com/alex-petrenko/sample-factory).

## Обучение Reward Model [2 балл]

В отличие от DPO, который выводит апдейт явно, убирая необходимость в награде, для PPO награда нужна, а значит кто-то должен ее выдавать. В общем случае это может быть какая-то простая функция, например равенство с правильным ответом. Для PPO, TRL поддерживает только награды от других моделек (но это поправят в будущем).

Возьмем тот же датасет и попробуем обучить сами. Для обучения нам понадобится preference dataset with implicit prompt ([см. примеры в документации](https://huggingface.co/docs/trl/main/dataset_formats)). То есть должны быть только две колонки: chosen, rejected, каждая содержаящая в себе промпт. По аналогии, это все надо привести в темплейт чата.

Пример:
```python
## Implicit prompt
preference_example = {
    "chosen": [
        {"role": "user", "content": "What color is the sky?"},
        {"role": "assistant", "content": "It is blue."}
    ],
    "rejected": [
        {"role": "user", "content": "What color is the sky?"},
        {"role": "assistant", "content": "It is green."}
    ]
}
```

Подробнее про лосс который оптимизируется [тут](https://rlhfbook.com/c/07-reward-models.html). TRL все сделает за вас.

In [72]:
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'

def to_implicit_prompt_preferences(example: dict[str, str]) -> dict[str, list[dict[str, str]]]:
    """
    Converts an example into implicit prompt preferences format.

    Args:
        example (dict[str, str]): A dictionary with the following keys:
            - "prompt": The user's input prompt.
            - "chosen": The assistant's chosen response.
            - "rejected": The assistant's rejected response.

    Returns:
        dict[str, list[dict[str, str]]]: A dictionary containing:
            - "chosen": A list of messages forming the conversation for the chosen response.
            - "rejected": A list of messages forming the conversation for the rejected response.
    """
    return {
    "chosen": [
        {"role": "user", "content": example['prompt']},
        {"role": "assistant", "content": example['chosen']}
    ], 
    "rejected": [
        {"role": "user", "content": example['prompt']},
        {"role": "assistant", "content": example['rejected']}
    ]}



In [73]:
dataset = load_dataset(DATASET_ID, split="train")
dataset = dataset.map(to_implicit_prompt_preferences, remove_columns=["prompt"])
dataset = dataset.train_test_split(train_size=0.9)

In [75]:
dataset['train']


Dataset({
    features: ['chosen', 'rejected'],
    num_rows: 9795
})

Использовать будем ту же модель, обучать только линейный слой поверх. Для модели используйте `AutoModelForSequenceClassification`. Обучите ревард модель с помощь `RewardConfig` и `RewardTrainer`. Одной эпохи должно быть достаточно (даже меньше). Для удобства подгрузите получившуюся модель на хаб.

In [76]:
ENABLE_WANDB = False

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
# Важно, чтобы тренер правильно отработал для этой модели.
tokenizer.pad_token = tokenizer.eos_token
tokenizer.chat_template = "{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] +  '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}"
# ========== TODO ==========
#      Ваш код здесь      =
# ==========================
reward_model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID, num_labels=1)
reward_model.train()
reward_model.config.pad_token_id = tokenizer.pad_token_id

# Freeze weights
for name, param in reward_model.named_parameters():
    if name != 'score.weight':
        param.requires_grad = False

reward_config = RewardConfig(
    num_train_epochs=1,
    output_dir='model',
    do_train=True, 
    per_device_train_batch_size=8,
    max_length=1024,
    disable_dropout=True,
    learning_rate=3e-04,
    seed=42,
    logging_steps=25,
    report_to="wandb" if ENABLE_WANDB else "none",
)
reward_trainer = RewardTrainer(
    model=reward_model,
    processing_class=tokenizer,
    args=reward_config,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
)

# reward_trainer.train()

Some weights of LlamaForSequenceClassification were not initialized from the model checkpoint at HuggingFaceTB/SmolLM-135M-Instruct and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 9795/9795 [00:05<00:00, 1833.19 examples/s]
Map: 100%|██████████| 9795/9795 [00:22<00:00, 426.67 examples/s]
Filter: 100%|██████████| 9795/9795 [00:06<00:00, 1584.48 examples/s]
Map: 100%|██████████| 1089/1089 [00:00<00:00, 2032.53 examples/s]
Map: 100%|██████████| 1089/1089 [00:02<00:00, 426.81 examples/s]
Filter: 100%|██████████| 1089/1089 [00:00<00:00, 1494.66 examples/s]


In [77]:
reward_trainer.train()



  0%|          | 0/1224 [00:00<?, ?it/s]You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
                                                 
  2%|▏         | 25/1224 [00:06<05:10,  3.87it/s]

{'loss': 0.3971, 'grad_norm': 1.9981492757797241, 'learning_rate': 0.0002938725490196078, 'epoch': 0.02}



                                                 
  4%|▍         | 50/1224 [00:13<05:40,  3.45it/s]
  0%|          | 0/2041 [41:41<?, ?it/s][A

{'loss': 0.1418, 'grad_norm': 1.1195499897003174, 'learning_rate': 0.0002877450980392157, 'epoch': 0.04}


                                                 
  6%|▌         | 75/1224 [00:20<05:19,  3.60it/s]
  0%|          | 0/2041 [41:48<?, ?it/s][A

{'loss': 0.0736, 'grad_norm': 0.29912275075912476, 'learning_rate': 0.0002816176470588235, 'epoch': 0.06}


                                                  
  8%|▊         | 100/1224 [00:27<05:44,  3.26it/s]
  0%|          | 0/2041 [41:55<?, ?it/s][A

{'loss': 0.0494, 'grad_norm': 0.26732149720191956, 'learning_rate': 0.00027549019607843136, 'epoch': 0.08}


                                                  
 10%|█         | 125/1224 [00:34<05:06,  3.59it/s]
  0%|          | 0/2041 [42:02<?, ?it/s][A

{'loss': 0.0329, 'grad_norm': 0.24634119868278503, 'learning_rate': 0.0002693627450980392, 'epoch': 0.1}


                                                  
 12%|█▏        | 150/1224 [00:40<04:12,  4.25it/s]
  0%|          | 0/2041 [42:09<?, ?it/s][A

{'loss': 0.021, 'grad_norm': 0.18083544075489044, 'learning_rate': 0.000263235294117647, 'epoch': 0.12}


                                                  
 14%|█▍        | 175/1224 [00:47<05:06,  3.42it/s]
  0%|          | 0/2041 [42:15<?, ?it/s][A

{'loss': 0.0261, 'grad_norm': 0.32099854946136475, 'learning_rate': 0.00025710784313725485, 'epoch': 0.14}


                                                  
 16%|█▋        | 200/1224 [00:54<05:01,  3.39it/s]
  0%|          | 0/2041 [42:22<?, ?it/s][A

{'loss': 0.0206, 'grad_norm': 0.09169066697359085, 'learning_rate': 0.00025098039215686274, 'epoch': 0.16}


                                                  
 18%|█▊        | 225/1224 [01:01<03:56,  4.23it/s]
  0%|          | 0/2041 [42:29<?, ?it/s][A

{'loss': 0.0139, 'grad_norm': 0.054739587008953094, 'learning_rate': 0.0002448529411764706, 'epoch': 0.18}


                                                  
 20%|██        | 250/1224 [01:08<04:34,  3.55it/s]
  0%|          | 0/2041 [42:36<?, ?it/s][A

{'loss': 0.0126, 'grad_norm': 0.05411922186613083, 'learning_rate': 0.0002387254901960784, 'epoch': 0.2}


                                                  
 22%|██▏       | 275/1224 [01:15<04:26,  3.57it/s]
  0%|          | 0/2041 [42:43<?, ?it/s][A

{'loss': 0.0099, 'grad_norm': 0.0529799684882164, 'learning_rate': 0.00023259803921568627, 'epoch': 0.22}


                                                  
 25%|██▍       | 300/1224 [01:21<03:45,  4.10it/s]
  0%|          | 0/2041 [42:49<?, ?it/s][A

{'loss': 0.0128, 'grad_norm': 0.0433105044066906, 'learning_rate': 0.0002264705882352941, 'epoch': 0.25}


                                                  
 27%|██▋       | 325/1224 [01:28<04:29,  3.33it/s]


{'loss': 0.0099, 'grad_norm': 0.15084271132946014, 'learning_rate': 0.00022034313725490196, 'epoch': 0.27}


                                                  
 29%|██▊       | 350/1224 [01:35<03:52,  3.76it/s]
  0%|          | 0/2041 [43:03<?, ?it/s][A

{'loss': 0.011, 'grad_norm': 0.15795546770095825, 'learning_rate': 0.0002142156862745098, 'epoch': 0.29}


                                                  
 31%|███       | 375/1224 [01:42<03:26,  4.11it/s]
  0%|          | 0/2041 [43:10<?, ?it/s][A

{'loss': 0.0083, 'grad_norm': 0.05632450804114342, 'learning_rate': 0.00020808823529411762, 'epoch': 0.31}


                                                  
 33%|███▎      | 400/1224 [01:49<03:29,  3.94it/s]
  0%|          | 0/2041 [43:17<?, ?it/s][A

{'loss': 0.0064, 'grad_norm': 0.05008755624294281, 'learning_rate': 0.00020196078431372548, 'epoch': 0.33}


                                                  
 35%|███▍      | 425/1224 [01:55<03:35,  3.71it/s]
  0%|          | 0/2041 [43:23<?, ?it/s][A

{'loss': 0.0104, 'grad_norm': 0.02150246687233448, 'learning_rate': 0.00019583333333333331, 'epoch': 0.35}


                                                  
 37%|███▋      | 450/1224 [02:03<03:55,  3.29it/s]
  0%|          | 0/2041 [43:31<?, ?it/s][A

{'loss': 0.0075, 'grad_norm': 0.07741361856460571, 'learning_rate': 0.00018970588235294115, 'epoch': 0.37}


                                                  
 39%|███▉      | 475/1224 [02:09<03:23,  3.68it/s]
  0%|          | 0/2041 [43:37<?, ?it/s][A

{'loss': 0.0084, 'grad_norm': 0.023946058005094528, 'learning_rate': 0.000183578431372549, 'epoch': 0.39}


                                                  
 41%|████      | 500/1224 [02:16<03:03,  3.95it/s]
  0%|          | 0/2041 [43:44<?, ?it/s][A

{'loss': 0.005, 'grad_norm': 0.05430074781179428, 'learning_rate': 0.00017745098039215684, 'epoch': 0.41}


                                                  
 43%|████▎     | 525/1224 [02:30<03:25,  3.40it/s]
  0%|          | 0/2041 [43:58<?, ?it/s][A

{'loss': 0.008, 'grad_norm': 0.030501747503876686, 'learning_rate': 0.00017132352941176467, 'epoch': 0.43}


                                                  
 45%|████▍     | 550/1224 [02:36<03:11,  3.51it/s]
  0%|          | 0/2041 [44:05<?, ?it/s][A

{'loss': 0.0057, 'grad_norm': 0.06089256703853607, 'learning_rate': 0.00016519607843137256, 'epoch': 0.45}


                                                  
 47%|████▋     | 575/1224 [02:43<03:01,  3.57it/s]
  0%|          | 0/2041 [44:11<?, ?it/s][A

{'loss': 0.0057, 'grad_norm': 0.0679587796330452, 'learning_rate': 0.0001590686274509804, 'epoch': 0.47}


                                                  
 49%|████▉     | 600/1224 [02:50<03:00,  3.45it/s]
  0%|          | 0/2041 [44:18<?, ?it/s][A

{'loss': 0.0056, 'grad_norm': 0.1328796148300171, 'learning_rate': 0.00015294117647058822, 'epoch': 0.49}


                                                  
 51%|█████     | 625/1224 [02:57<02:31,  3.95it/s]
  0%|          | 0/2041 [44:25<?, ?it/s][A

{'loss': 0.0046, 'grad_norm': 0.020577097311615944, 'learning_rate': 0.00014681372549019605, 'epoch': 0.51}


                                                  
 53%|█████▎    | 650/1224 [03:04<01:57,  4.88it/s]
  0%|          | 0/2041 [44:32<?, ?it/s][A

{'loss': 0.0039, 'grad_norm': 0.02037283033132553, 'learning_rate': 0.00014068627450980391, 'epoch': 0.53}


                                                  
 55%|█████▌    | 675/1224 [03:11<02:49,  3.24it/s]
  0%|          | 0/2041 [44:39<?, ?it/s][A

{'loss': 0.003, 'grad_norm': 0.01585463248193264, 'learning_rate': 0.00013455882352941175, 'epoch': 0.55}


                                                  
 57%|█████▋    | 700/1224 [03:18<02:23,  3.65it/s]
  0%|          | 0/2041 [44:46<?, ?it/s][A

{'loss': 0.0054, 'grad_norm': 0.2220660150051117, 'learning_rate': 0.00012843137254901958, 'epoch': 0.57}


                                                  
 59%|█████▉    | 725/1224 [03:24<02:11,  3.79it/s]
  0%|          | 0/2041 [44:52<?, ?it/s][A

{'loss': 0.0049, 'grad_norm': 0.06791508942842484, 'learning_rate': 0.00012230392156862744, 'epoch': 0.59}


                                                  
 61%|██████▏   | 750/1224 [03:31<02:12,  3.59it/s]
  0%|          | 0/2041 [44:59<?, ?it/s][A

{'loss': 0.0041, 'grad_norm': 0.0596810057759285, 'learning_rate': 0.00011617647058823528, 'epoch': 0.61}


                                                  
 63%|██████▎   | 775/1224 [03:38<02:05,  3.58it/s]
  0%|          | 0/2041 [45:06<?, ?it/s][A

{'loss': 0.0039, 'grad_norm': 0.050478823482990265, 'learning_rate': 0.00011004901960784312, 'epoch': 0.63}


                                                  
 65%|██████▌   | 800/1224 [03:45<01:35,  4.45it/s]
  0%|          | 0/2041 [45:13<?, ?it/s][A

{'loss': 0.0039, 'grad_norm': 0.013935361988842487, 'learning_rate': 0.00010392156862745096, 'epoch': 0.65}


                                                  
 67%|██████▋   | 825/1224 [03:52<01:31,  4.35it/s]
  0%|          | 0/2041 [45:20<?, ?it/s][A

{'loss': 0.0032, 'grad_norm': 0.017208140343427658, 'learning_rate': 9.779411764705882e-05, 'epoch': 0.67}


                                                  
 69%|██████▉   | 850/1224 [03:58<01:32,  4.03it/s]
  0%|          | 0/2041 [45:26<?, ?it/s][A

{'loss': 0.0026, 'grad_norm': 0.0473170280456543, 'learning_rate': 9.166666666666667e-05, 'epoch': 0.69}


                                                  
 71%|███████▏  | 875/1224 [04:05<01:34,  3.70it/s]
  0%|          | 0/2041 [45:33<?, ?it/s][A

{'loss': 0.0054, 'grad_norm': 0.02568373829126358, 'learning_rate': 8.55392156862745e-05, 'epoch': 0.71}


                                                  
 74%|███████▎  | 900/1224 [04:12<01:32,  3.51it/s]
  0%|          | 0/2041 [45:40<?, ?it/s][A

{'loss': 0.0024, 'grad_norm': 0.011708649806678295, 'learning_rate': 7.941176470588235e-05, 'epoch': 0.74}


                                                  
 76%|███████▌  | 925/1224 [04:18<01:30,  3.30it/s]
  0%|          | 0/2041 [45:47<?, ?it/s][A

{'loss': 0.003, 'grad_norm': 0.026023617014288902, 'learning_rate': 7.328431372549019e-05, 'epoch': 0.76}


                                                  
 78%|███████▊  | 950/1224 [04:25<01:15,  3.63it/s]
  0%|          | 0/2041 [45:53<?, ?it/s][A

{'loss': 0.0051, 'grad_norm': 0.019107427448034286, 'learning_rate': 6.715686274509804e-05, 'epoch': 0.78}


                                                  
 80%|███████▉  | 975/1224 [04:32<01:09,  3.60it/s]
  0%|          | 0/2041 [46:00<?, ?it/s][A

{'loss': 0.0029, 'grad_norm': 0.04272183030843735, 'learning_rate': 6.102941176470588e-05, 'epoch': 0.8}


                                                   
 82%|████████▏ | 1000/1224 [04:38<00:52,  4.28it/s]
  0%|          | 0/2041 [46:07<?, ?it/s][A

{'loss': 0.0044, 'grad_norm': 0.03475893661379814, 'learning_rate': 5.4901960784313716e-05, 'epoch': 0.82}


                                                   
 84%|████████▎ | 1025/1224 [04:52<00:55,  3.58it/s]
  0%|          | 0/2041 [46:20<?, ?it/s][A

{'loss': 0.0043, 'grad_norm': 0.3251512050628662, 'learning_rate': 4.877450980392157e-05, 'epoch': 0.84}


                                                   
 86%|████████▌ | 1050/1224 [04:59<00:49,  3.52it/s]
  0%|          | 0/2041 [46:27<?, ?it/s][A

{'loss': 0.0041, 'grad_norm': 0.08771193772554398, 'learning_rate': 4.264705882352941e-05, 'epoch': 0.86}


                                                   
 88%|████████▊ | 1075/1224 [05:06<00:44,  3.36it/s]
 88%|████████▊ | 1076/1224 [05:06<00:38,  3.81it/s]

{'loss': 0.0034, 'grad_norm': 0.11351912468671799, 'learning_rate': 3.6519607843137254e-05, 'epoch': 0.88}


                                                   
 90%|████████▉ | 1100/1224 [05:13<00:34,  3.62it/s]
  0%|          | 0/2041 [46:41<?, ?it/s][A

{'loss': 0.0023, 'grad_norm': 0.012923027388751507, 'learning_rate': 3.0392156862745097e-05, 'epoch': 0.9}


                                                   
 92%|█████████▏| 1125/1224 [05:20<00:29,  3.31it/s]
  0%|          | 0/2041 [46:48<?, ?it/s][A

{'loss': 0.0023, 'grad_norm': 0.03937005624175072, 'learning_rate': 2.426470588235294e-05, 'epoch': 0.92}


                                                   
 94%|█████████▍| 1150/1224 [05:26<00:19,  3.89it/s]
  0%|          | 0/2041 [46:54<?, ?it/s][A

{'loss': 0.0054, 'grad_norm': 0.19992662966251373, 'learning_rate': 1.813725490196078e-05, 'epoch': 0.94}


                                                   
 96%|█████████▌| 1175/1224 [05:33<00:12,  3.88it/s]
  0%|          | 0/2041 [47:01<?, ?it/s][A

{'loss': 0.0026, 'grad_norm': 0.011011868715286255, 'learning_rate': 1.2009803921568626e-05, 'epoch': 0.96}


                                                   
 98%|█████████▊| 1200/1224 [05:40<00:06,  3.64it/s]
  0%|          | 0/2041 [47:08<?, ?it/s][A

{'loss': 0.0047, 'grad_norm': 0.011811692267656326, 'learning_rate': 5.88235294117647e-06, 'epoch': 0.98}


                                                   
100%|██████████| 1224/1224 [05:52<00:00,  3.95it/s]
100%|██████████| 1224/1224 [05:52<00:00,  3.47it/s]

{'train_runtime': 352.7785, 'train_samples_per_second': 27.754, 'train_steps_per_second': 3.47, 'train_loss': 0.02048316763506995, 'epoch': 1.0}





TrainOutput(global_step=1224, training_loss=0.02048316763506995, metrics={'train_runtime': 352.7785, 'train_samples_per_second': 27.754, 'train_steps_per_second': 3.47, 'total_flos': 0.0, 'train_loss': 0.02048316763506995, 'epoch': 1.0})

Награда для chosen должна быть выше чем для rejected.

In [78]:
inputs_chosen = tokenizer.apply_chat_template(dataset['test'][0]["chosen"], tokenize=False, )
inputs_chosen = tokenizer(inputs_chosen, return_tensors="pt").to(DEVICE)

inputs_rejected = tokenizer.apply_chat_template(dataset['test'][0]["rejected"], tokenize=False)
inputs_rejected = tokenizer(inputs_rejected, return_tensors="pt").to(DEVICE)

score_chosen = reward_model(**inputs_chosen).logits[0].cpu().detach()
score_rejected = reward_model(**inputs_rejected).logits[0].cpu().detach()


In [79]:
score_chosen, score_rejected




(tensor([1.8269]), tensor([-3.7880]))

In [83]:
# Загрузим reward модель на хаб

reward_trainer.push_to_hub(f"{REPO_NAME}-reward-model", dataset_name=DATASET_ID)

model.safetensors:   4%|▍         | 23.7M/538M [00:47<1:05:39, 131kB/s]

model.safetensors:   0%|          | 0.00/538M [00:00<?, ?B/s][A[A

model.safetensors:   0%|          | 41.0k/538M [00:00<21:53, 410kB/s][A[A

model.safetensors:   0%|          | 295k/538M [00:00<06:17, 1.43MB/s][A[A

model.safetensors:   0%|          | 1.57M/538M [00:00<01:36, 5.54MB/s][A[A

model.safetensors:   1%|          | 3.05M/538M [00:00<01:05, 8.16MB/s][A[A

model.safetensors:   1%|          | 6.06M/538M [00:00<00:38, 13.9MB/s][A[A

model.safetensors:   2%|▏         | 10.9M/538M [00:00<00:37, 14.1MB/s][A[A

model.safetensors:   2%|▏         | 12.8M/538M [00:01<00:37, 13.9MB/s][A[A

model.safetensors:   3%|▎         | 16.0M/538M [00:02<01:40, 5.18MB/s][A[A

model.safetensors:   4%|▎         | 20.1M/538M [00:02<01:07, 7.62MB/s][A[A

model.safetensors:   4%|▍         | 23.7M/538M [00:02<00:51, 10.0MB/s][A[A

model.safetensors:   5%|▍         | 25.8M/538M [00:02<00:46, 11.1MB/s][A[A

mo

CommitInfo(commit_url='https://huggingface.co/fridalex/model/commit/55a547db3f2cb1fa7f216e37284a912221ba6f3b', commit_message='fridalex/llm-course-hw2-reward-model', commit_description='', oid='55a547db3f2cb1fa7f216e37284a912221ba6f3b', pr_url=None, repo_url=RepoUrl('https://huggingface.co/fridalex/model', endpoint='https://huggingface.co', repo_type='model', repo_id='fridalex/model'), pr_revision=None, pr_num=None)

## Обучение PPO [4 балла]

**WARN**: TRL недавно смержили большой рефактор PPO, забыв обновить всю документацию и примеры 🥴🥴🥴. Для правильных примеров смотрите в код, а не в докментацию. Если вам интересно знать виновных в лицо:

<a href="https://ibb.co/zTFL4GTt"><img src="https://i.ibb.co/1tMpm8t4/Screenshot-2025-02-13-at-17-40-48.png" alt="" border="0" /></a>

Для PPO нам понадобится тот же датасет, но уже в формате только prompt. Приведите prompt в чат темплейт и токенизируйте (`tokenizer.apply_chat_template`). Все остальные колонки можно удалить.

В качестве `policy`, `ref_policy` подгрузите SmolLM2-135M-Instruct, в качестве `reward_model`, `value_model` свою обученную ревард модель. Для обучения используйте `PPOConfig` и `PPOTrainer`.

In [84]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, padding_side="left")
tokenizer.add_special_tokens({"pad_token": "[PAD]"})

# ========== TODO ==========
#      Ваш код здесь       =
# ==========================
value_model = reward_model
reward_model = reward_model
policy = AutoModelForCausalLM.from_pretrained(MODEL_ID)
ref_policy = AutoModelForCausalLM.from_pretrained(MODEL_ID)

def tokenize(example, tokenizer):
    example = [{"role" : 'user', "content" : example['prompt']}]
    input_ids = tokenizer.apply_chat_template(example)
    return {"input_ids": input_ids}

dataset = load_dataset(DATASET_ID, split="train")
dataset = dataset.remove_columns(["chosen", "rejected"])
dataset = dataset.map(tokenize, fn_kwargs={"tokenizer": tokenizer}, remove_columns=dataset.column_names)
dataset = dataset.train_test_split()

training_args = PPOConfig(output_dir='model_PPO',
                          do_train=True,
                          do_eval=True,
                          per_device_train_batch_size=4,
                          learning_rate=5e-5,
                          logging_steps=25,
                          seed=42,
                          num_train_epochs=1
                         )

Using the latest cached version of the dataset since HumanLLMs/Human-Like-DPO-Dataset couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'default' at /home/jupyter/datasphere/project/datasetscache/HumanLLMs___human-like-dpo-dataset/default/0.0.0/dd82ab6a284a15765964149e6a6603ff8ed7d672 (last modified on Fri Mar 28 18:05:31 2025).


In [85]:
trainer = PPOTrainer(args=training_args, 
                     processing_class=tokenizer, 
                     model=policy,
                     ref_model=ref_policy,
                     reward_model=reward_model,
                     train_dataset=dataset['train'],
                     value_model=value_model,
                     eval_dataset=dataset['test']
                    )
trainer.train()

  0%|          | 0/2041 [53:35<?, ?it/s]


===training policy===



  0%|          | 0/2041 [00:00<?, ?it/s][A
                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [03:23<1:05:39, 131kB/s]
  0%|          | 1/2041 [00:05<2:52:56,  5.09s/it][A

{'eps': 0, 'objective/kl': 1.239776611328125e-05, 'objective/entropy': 37.6148681640625, 'objective/non_score_reward': -6.198883397701138e-07, 'objective/rlhf_reward': 0.8250447511672974, 'objective/scores': 0.8250453472137451, 'policy/approxkl_avg': 0.714454710483551, 'policy/clipfrac_avg': 0.28891509771347046, 'loss/policy_avg': -0.05566459149122238, 'loss/value_avg': 0.9395594596862793, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.588732123374939, 'val/ratio': 1.8873934745788574, 'val/ratio_var': 0.711409866809845, 'val/num_eos_tokens': 0, 'lr': 5e-05, 'episode': 4, 'epoch': 0.0}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [03:32<1:05:39, 131kB/s]
  0%|          | 2/2041 [00:14<4:11:19,  7.40s/it][A

{'eps': 0, 'objective/kl': 1.8971768617630005, 'objective/entropy': 33.864620208740234, 'objective/non_score_reward': -0.09485884010791779, 'objective/rlhf_reward': 1.5182996988296509, 'objective/scores': 1.6131585836410522, 'policy/approxkl_avg': 0.15987087786197662, 'policy/clipfrac_avg': 0.24646227061748505, 'loss/policy_avg': -0.05364333093166351, 'loss/value_avg': 0.9573596119880676, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6700950264930725, 'val/ratio': 1.0601707696914673, 'val/ratio_var': 0.002866404829546809, 'val/num_eos_tokens': 0, 'lr': 4.997550220480157e-05, 'episode': 8, 'epoch': 0.0}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [03:37<1:05:39, 131kB/s]
  0%|          | 3/2041 [00:19<3:39:15,  6.46s/it][A

{'eps': 0, 'objective/kl': 1.623546838760376, 'objective/entropy': 28.30360221862793, 'objective/non_score_reward': -0.08117733895778656, 'objective/rlhf_reward': 0.5252725481987, 'objective/scores': 0.6064499020576477, 'policy/approxkl_avg': 0.11487320065498352, 'policy/clipfrac_avg': 0.18632075190544128, 'loss/policy_avg': -0.0446024015545845, 'loss/value_avg': 1.0699694156646729, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.5543174743652344, 'val/ratio': 1.0838556289672852, 'val/ratio_var': 0.0065040732733905315, 'val/num_eos_tokens': 0, 'lr': 4.995100440960314e-05, 'episode': 12, 'epoch': 0.0}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [03:43<1:05:39, 131kB/s]
  0%|          | 4/2041 [00:24<3:22:06,  5.95s/it][A

{'eps': 0, 'objective/kl': 10.708807945251465, 'objective/entropy': 34.88738250732422, 'objective/non_score_reward': -0.5354404449462891, 'objective/rlhf_reward': -0.21680909395217896, 'objective/scores': 0.3186313509941101, 'policy/approxkl_avg': 0.4939707815647125, 'policy/clipfrac_avg': 0.2146226316690445, 'loss/policy_avg': -0.04820301756262779, 'loss/value_avg': 0.9520881772041321, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7438135147094727, 'val/ratio': 1.0728681087493896, 'val/ratio_var': 0.004675543867051601, 'val/num_eos_tokens': 0, 'lr': 4.9926506614404707e-05, 'episode': 16, 'epoch': 0.0}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [03:48<1:05:39, 131kB/s]
  0%|          | 5/2041 [00:29<3:13:29,  5.70s/it][A

{'eps': 0, 'objective/kl': 16.279653549194336, 'objective/entropy': 38.99875259399414, 'objective/non_score_reward': -0.8139826059341431, 'objective/rlhf_reward': -1.439305305480957, 'objective/scores': -0.6253226399421692, 'policy/approxkl_avg': 1.45090651512146, 'policy/clipfrac_avg': 0.20754718780517578, 'loss/policy_avg': -0.04757889732718468, 'loss/value_avg': 1.4740244150161743, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7854398488998413, 'val/ratio': 1.0711616277694702, 'val/ratio_var': 0.0059131477028131485, 'val/num_eos_tokens': 2, 'lr': 4.9902008819206275e-05, 'episode': 20, 'epoch': 0.0}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [03:53<1:05:39, 131kB/s]
  0%|          | 6/2041 [00:35<3:08:32,  5.56s/it][A

{'eps': 0, 'objective/kl': 26.399681091308594, 'objective/entropy': 42.41204833984375, 'objective/non_score_reward': -1.3199841976165771, 'objective/rlhf_reward': -1.441834807395935, 'objective/scores': -0.12185061722993851, 'policy/approxkl_avg': 0.16993485391139984, 'policy/clipfrac_avg': 0.18867924809455872, 'loss/policy_avg': -0.045251403003931046, 'loss/value_avg': 1.1503198146820068, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8568536639213562, 'val/ratio': 1.0499508380889893, 'val/ratio_var': 0.0022517929319292307, 'val/num_eos_tokens': 2, 'lr': 4.9877511024007836e-05, 'episode': 24, 'epoch': 0.0}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [03:58<1:05:39, 131kB/s]
  0%|          | 7/2041 [00:40<3:05:11,  5.46s/it][A

{'eps': 0, 'objective/kl': 20.998790740966797, 'objective/entropy': 49.90937042236328, 'objective/non_score_reward': -1.0499393939971924, 'objective/rlhf_reward': 0.014268159866333008, 'objective/scores': 1.0642075538635254, 'policy/approxkl_avg': 0.33475592732429504, 'policy/clipfrac_avg': 0.21344339847564697, 'loss/policy_avg': -0.05787738040089607, 'loss/value_avg': 1.067716121673584, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9210748672485352, 'val/ratio': 1.2057461738586426, 'val/ratio_var': 0.06678853929042816, 'val/num_eos_tokens': 1, 'lr': 4.985301322880941e-05, 'episode': 28, 'epoch': 0.0}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [04:04<1:05:39, 131kB/s]
  0%|          | 8/2041 [00:45<3:03:27,  5.41s/it][A

{'eps': 0, 'objective/kl': 31.19163703918457, 'objective/entropy': 47.851097106933594, 'objective/non_score_reward': -1.5595818758010864, 'objective/rlhf_reward': -0.7505007982254028, 'objective/scores': 0.8090810775756836, 'policy/approxkl_avg': 0.14633311331272125, 'policy/clipfrac_avg': 0.20283018052577972, 'loss/policy_avg': -0.05710326135158539, 'loss/value_avg': 1.0039896965026855, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.887448787689209, 'val/ratio': 1.0292928218841553, 'val/ratio_var': 0.0011409464059397578, 'val/num_eos_tokens': 1, 'lr': 4.982851543361098e-05, 'episode': 32, 'epoch': 0.0}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [04:09<1:05:39, 131kB/s]
  0%|          | 9/2041 [00:51<3:02:31,  5.39s/it][A

{'eps': 0, 'objective/kl': 30.571617126464844, 'objective/entropy': 41.459510803222656, 'objective/non_score_reward': -1.528580904006958, 'objective/rlhf_reward': -0.8958563208580017, 'objective/scores': 0.6327245831489563, 'policy/approxkl_avg': 0.3172243535518646, 'policy/clipfrac_avg': 0.18160375952720642, 'loss/policy_avg': -0.048630356788635254, 'loss/value_avg': 0.9130662679672241, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.667744517326355, 'val/ratio': 1.0393452644348145, 'val/ratio_var': 0.0017807421972975135, 'val/num_eos_tokens': 3, 'lr': 4.980401763841255e-05, 'episode': 36, 'epoch': 0.0}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [04:14<1:05:39, 131kB/s]
  0%|          | 10/2041 [00:56<3:01:13,  5.35s/it][A

{'eps': 0, 'objective/kl': 42.879356384277344, 'objective/entropy': 49.8011589050293, 'objective/non_score_reward': -2.143967866897583, 'objective/rlhf_reward': -0.9252245426177979, 'objective/scores': 1.2187433242797852, 'policy/approxkl_avg': 0.11860764026641846, 'policy/clipfrac_avg': 0.19103772938251495, 'loss/policy_avg': -0.049626100808382034, 'loss/value_avg': 1.03985595703125, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9051010608673096, 'val/ratio': 1.0817432403564453, 'val/ratio_var': 0.0072579472325742245, 'val/num_eos_tokens': 1, 'lr': 4.9779519843214115e-05, 'episode': 40, 'epoch': 0.0}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [04:20<1:05:39, 131kB/s]
  1%|          | 11/2041 [01:01<3:00:00,  5.32s/it][A

{'eps': 0, 'objective/kl': 41.746002197265625, 'objective/entropy': 45.55827331542969, 'objective/non_score_reward': -2.0873000621795654, 'objective/rlhf_reward': -0.6874116659164429, 'objective/scores': 1.3998883962631226, 'policy/approxkl_avg': 0.11827096343040466, 'policy/clipfrac_avg': 0.23113207519054413, 'loss/policy_avg': -0.05431003123521805, 'loss/value_avg': 0.8653074502944946, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.91999751329422, 'val/ratio': 1.0640137195587158, 'val/ratio_var': 0.0046383836306631565, 'val/num_eos_tokens': 0, 'lr': 4.975502204801568e-05, 'episode': 44, 'epoch': 0.01}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [04:25<1:05:39, 131kB/s]
  1%|          | 12/2041 [01:06<3:00:19,  5.33s/it][A

{'eps': 0, 'objective/kl': 32.36552047729492, 'objective/entropy': 70.18437194824219, 'objective/non_score_reward': -1.6182761192321777, 'objective/rlhf_reward': -1.2022764682769775, 'objective/scores': 0.4159996509552002, 'policy/approxkl_avg': 0.09902387857437134, 'policy/clipfrac_avg': 0.28066039085388184, 'loss/policy_avg': -0.06435428559780121, 'loss/value_avg': 1.0426594018936157, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0924787521362305, 'val/ratio': 1.1035056114196777, 'val/ratio_var': 0.011279452592134476, 'val/num_eos_tokens': 0, 'lr': 4.973052425281725e-05, 'episode': 48, 'epoch': 0.01}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [04:30<1:05:39, 131kB/s]
  1%|          | 13/2041 [01:12<2:59:43,  5.32s/it][A

{'eps': 0, 'objective/kl': 28.643199920654297, 'objective/entropy': 51.940513610839844, 'objective/non_score_reward': -1.4321600198745728, 'objective/rlhf_reward': -0.2455521821975708, 'objective/scores': 1.186607837677002, 'policy/approxkl_avg': 0.0921836718916893, 'policy/clipfrac_avg': 0.18985849618911743, 'loss/policy_avg': -0.04925812780857086, 'loss/value_avg': 0.9417964220046997, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0366342067718506, 'val/ratio': 1.1783795356750488, 'val/ratio_var': 0.0381385013461113, 'val/num_eos_tokens': 4, 'lr': 4.970602645761881e-05, 'episode': 52, 'epoch': 0.01}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [04:36<1:05:39, 131kB/s]
  1%|          | 14/2041 [01:17<2:59:46,  5.32s/it][A

{'eps': 0, 'objective/kl': 31.528432846069336, 'objective/entropy': 68.34970092773438, 'objective/non_score_reward': -1.5764217376708984, 'objective/rlhf_reward': -1.5504766702651978, 'objective/scores': 0.025945037603378296, 'policy/approxkl_avg': 0.1099998727440834, 'policy/clipfrac_avg': 0.2158018946647644, 'loss/policy_avg': -0.05023517459630966, 'loss/value_avg': 1.4550693035125732, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0219769477844238, 'val/ratio': 1.2559092044830322, 'val/ratio_var': 0.08944106101989746, 'val/num_eos_tokens': 2, 'lr': 4.968152866242039e-05, 'episode': 56, 'epoch': 0.01}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [04:41<1:05:39, 131kB/s]

{'eps': 0, 'objective/kl': 33.50262451171875, 'objective/entropy': 40.63916015625, 'objective/non_score_reward': -1.67513108253479, 'objective/rlhf_reward': -1.303091049194336, 'objective/scores': 0.3720400035381317, 'policy/approxkl_avg': 0.13955004513263702, 'policy/clipfrac_avg': 0.17452828586101532, 'loss/policy_avg': -0.043337248265743256, 'loss/value_avg': 0.9019114375114441, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7724559307098389, 'val/ratio': 1.046579122543335, 'val/ratio_var': 0.002441985299810767, 'val/num_eos_tokens': 6, 'lr': 4.9657030867221955e-05, 'episode': 60, 'epoch': 0.01}



  1%|          | 15/2041 [01:22<2:59:18,  5.31s/it][A
                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [04:46<1:05:39, 131kB/s]
  1%|          | 16/2041 [01:28<2:59:47,  5.33s/it][A

{'eps': 0, 'objective/kl': 30.616836547851562, 'objective/entropy': 22.594135284423828, 'objective/non_score_reward': -1.5308417081832886, 'objective/rlhf_reward': -0.37646543979644775, 'objective/scores': 1.1543762683868408, 'policy/approxkl_avg': 0.044494058936834335, 'policy/clipfrac_avg': 0.10495282709598541, 'loss/policy_avg': -0.03110651671886444, 'loss/value_avg': 0.6564831137657166, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.414008229970932, 'val/ratio': 1.0130800008773804, 'val/ratio_var': 0.00038329779636114836, 'val/num_eos_tokens': 6, 'lr': 4.9632533072023516e-05, 'episode': 64, 'epoch': 0.01}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [04:51<1:05:39, 131kB/s]
  1%|          | 17/2041 [01:33<2:58:53,  5.30s/it][A

{'eps': 0, 'objective/kl': 35.47602844238281, 'objective/entropy': 48.629356384277344, 'objective/non_score_reward': -1.7738014459609985, 'objective/rlhf_reward': -2.2388315200805664, 'objective/scores': -0.46503016352653503, 'policy/approxkl_avg': 0.13381414115428925, 'policy/clipfrac_avg': 0.18278302252292633, 'loss/policy_avg': -0.05865176394581795, 'loss/value_avg': 1.1257506608963013, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9960563778877258, 'val/ratio': 1.0671502351760864, 'val/ratio_var': 0.00573002127930522, 'val/num_eos_tokens': 0, 'lr': 4.960803527682509e-05, 'episode': 68, 'epoch': 0.01}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [04:57<1:05:39, 131kB/s]
  1%|          | 18/2041 [01:38<2:57:32,  5.27s/it][A

{'eps': 0, 'objective/kl': 38.95935821533203, 'objective/entropy': 49.687416076660156, 'objective/non_score_reward': -1.9479680061340332, 'objective/rlhf_reward': -1.989875316619873, 'objective/scores': -0.04190737009048462, 'policy/approxkl_avg': 0.14542421698570251, 'policy/clipfrac_avg': 0.20283019542694092, 'loss/policy_avg': -0.04907900094985962, 'loss/value_avg': 1.3763024806976318, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1081454753875732, 'val/ratio': 1.054811954498291, 'val/ratio_var': 0.0031372467055916786, 'val/num_eos_tokens': 0, 'lr': 4.958353748162666e-05, 'episode': 72, 'epoch': 0.01}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [05:02<1:05:39, 131kB/s]
  1%|          | 19/2041 [01:43<2:57:37,  5.27s/it][A

{'eps': 0, 'objective/kl': 50.28990173339844, 'objective/entropy': 51.026702880859375, 'objective/non_score_reward': -2.5144951343536377, 'objective/rlhf_reward': -1.6474454402923584, 'objective/scores': 0.8670496940612793, 'policy/approxkl_avg': 0.10491063445806503, 'policy/clipfrac_avg': 0.20400944352149963, 'loss/policy_avg': -0.053526394069194794, 'loss/value_avg': 1.0839954614639282, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0657036304473877, 'val/ratio': 1.0386378765106201, 'val/ratio_var': 0.0021423029247671366, 'val/num_eos_tokens': 2, 'lr': 4.955903968642822e-05, 'episode': 76, 'epoch': 0.01}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [05:07<1:05:39, 131kB/s]
  1%|          | 20/2041 [01:49<2:57:05,  5.26s/it][A

{'eps': 0, 'objective/kl': 34.74308395385742, 'objective/entropy': 52.87108612060547, 'objective/non_score_reward': -1.737154245376587, 'objective/rlhf_reward': 0.6917243003845215, 'objective/scores': 2.4288785457611084, 'policy/approxkl_avg': 0.12727127969264984, 'policy/clipfrac_avg': 0.23113207519054413, 'loss/policy_avg': -0.05770854651927948, 'loss/value_avg': 1.3332921266555786, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9713219404220581, 'val/ratio': 1.106726884841919, 'val/ratio_var': 0.01127924770116806, 'val/num_eos_tokens': 0, 'lr': 4.953454189122979e-05, 'episode': 80, 'epoch': 0.01}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [05:12<1:05:39, 131kB/s]
  1%|          | 21/2041 [01:54<2:57:22,  5.27s/it][A

{'eps': 0, 'objective/kl': 53.37518310546875, 'objective/entropy': 43.504417419433594, 'objective/non_score_reward': -2.668759346008301, 'objective/rlhf_reward': -1.4087257385253906, 'objective/scores': 1.2600336074829102, 'policy/approxkl_avg': 0.04850105196237564, 'policy/clipfrac_avg': 0.1603773683309555, 'loss/policy_avg': -0.039301417768001556, 'loss/value_avg': 1.079655408859253, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8195248246192932, 'val/ratio': 1.0139994621276855, 'val/ratio_var': 0.0002699346805457026, 'val/num_eos_tokens': 4, 'lr': 4.951004409603136e-05, 'episode': 84, 'epoch': 0.01}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [05:18<1:05:39, 131kB/s]
  1%|          | 22/2041 [01:59<2:57:19,  5.27s/it][A

{'eps': 0, 'objective/kl': 43.53974914550781, 'objective/entropy': 49.259132385253906, 'objective/non_score_reward': -2.176987648010254, 'objective/rlhf_reward': -2.3419740200042725, 'objective/scores': -0.16498637199401855, 'policy/approxkl_avg': 0.26836860179901123, 'policy/clipfrac_avg': 0.21344339847564697, 'loss/policy_avg': -0.03770057484507561, 'loss/value_avg': 0.6076620221138, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8634496927261353, 'val/ratio': 1.0577914714813232, 'val/ratio_var': 0.00430385721847415, 'val/num_eos_tokens': 7, 'lr': 4.948554630083293e-05, 'episode': 88, 'epoch': 0.01}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [05:23<1:05:39, 131kB/s]
  1%|          | 23/2041 [02:05<2:58:21,  5.30s/it][A

{'eps': 0, 'objective/kl': 50.906227111816406, 'objective/entropy': 35.556034088134766, 'objective/non_score_reward': -2.545311450958252, 'objective/rlhf_reward': -1.4846420288085938, 'objective/scores': 1.0606694221496582, 'policy/approxkl_avg': 0.06474868953227997, 'policy/clipfrac_avg': 0.15566037595272064, 'loss/policy_avg': -0.04245036467909813, 'loss/value_avg': 0.9592081308364868, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6826825737953186, 'val/ratio': 1.0254037380218506, 'val/ratio_var': 0.0006696322816424072, 'val/num_eos_tokens': 8, 'lr': 4.946104850563449e-05, 'episode': 92, 'epoch': 0.01}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [05:28<1:05:39, 131kB/s]
  1%|          | 24/2041 [02:10<2:58:39,  5.31s/it]

{'eps': 0, 'objective/kl': 40.22883987426758, 'objective/entropy': 41.80463790893555, 'objective/non_score_reward': -2.011442184448242, 'objective/rlhf_reward': -1.1701444387435913, 'objective/scores': 0.8412977457046509, 'policy/approxkl_avg': 0.0874582827091217, 'policy/clipfrac_avg': 0.14740565419197083, 'loss/policy_avg': -0.036503128707408905, 'loss/value_avg': 1.0821959972381592, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6641427278518677, 'val/ratio': 0.9955618977546692, 'val/ratio_var': 6.411859794752672e-05, 'val/num_eos_tokens': 6, 'lr': 4.943655071043606e-05, 'episode': 96, 'epoch': 0.01}


[A
                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [05:34<1:05:39, 131kB/s]
  1%|          | 25/2041 [02:15<2:58:18,  5.31s/it][A

{'eps': 0, 'objective/kl': 43.590362548828125, 'objective/entropy': 44.88491439819336, 'objective/non_score_reward': -2.179518222808838, 'objective/rlhf_reward': -2.2649385929107666, 'objective/scores': -0.08542037010192871, 'policy/approxkl_avg': 0.057775769382715225, 'policy/clipfrac_avg': 0.20518869161605835, 'loss/policy_avg': -0.05413230136036873, 'loss/value_avg': 0.8040376901626587, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9024684429168701, 'val/ratio': 1.022684097290039, 'val/ratio_var': 0.0006139609031379223, 'val/num_eos_tokens': 12, 'lr': 4.9412052915237635e-05, 'episode': 100, 'epoch': 0.01}



                                                                       
  1%|          | 25/2041 [02:21<2:58:18,  5.31s/it]

{'eps': 0, 'objective/kl': 43.028343200683594, 'objective/entropy': 46.381675720214844, 'objective/non_score_reward': -2.1514172554016113, 'objective/rlhf_reward': -1.4555413722991943, 'objective/scores': 0.695875883102417, 'policy/approxkl_avg': 0.13827617466449738, 'policy/clipfrac_avg': 0.22051885724067688, 'loss/policy_avg': -0.0523267462849617, 'loss/value_avg': 0.6607098579406738, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0142306089401245, 'val/ratio': 1.0878384113311768, 'val/ratio_var': 0.0067269341088831425, 'val/num_eos_tokens': 1, 'lr': 4.93875551200392e-05, 'episode': 104, 'epoch': 0.01}


model.safetensors:   4%|▍         | 23.7M/538M [05:39<1:05:39, 131kB/s]
  1%|▏         | 26/2041 [02:21<2:58:20,  5.31s/it][A
                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [05:44<1:05:39, 131kB/s]
  1%|▏         | 27/2041 [02:26<2:58:01,  5.30s/it][A

{'eps': 0, 'objective/kl': 46.854286193847656, 'objective/entropy': 50.65544128417969, 'objective/non_score_reward': -2.342714309692383, 'objective/rlhf_reward': -2.5950846672058105, 'objective/scores': -0.2523704171180725, 'policy/approxkl_avg': 0.19358201324939728, 'policy/clipfrac_avg': 0.20518867671489716, 'loss/policy_avg': -0.05571739003062248, 'loss/value_avg': 0.8372774720191956, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9451339244842529, 'val/ratio': 1.1267942190170288, 'val/ratio_var': 0.022556044161319733, 'val/num_eos_tokens': 0, 'lr': 4.9363057324840765e-05, 'episode': 108, 'epoch': 0.01}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [05:50<1:05:39, 131kB/s]
  1%|▏         | 28/2041 [02:31<2:57:10,  5.28s/it][A

{'eps': 0, 'objective/kl': 48.39067840576172, 'objective/entropy': 33.32194519042969, 'objective/non_score_reward': -2.4195339679718018, 'objective/rlhf_reward': -1.5001277923583984, 'objective/scores': 0.9194061756134033, 'policy/approxkl_avg': 0.08634570240974426, 'policy/clipfrac_avg': 0.1450471729040146, 'loss/policy_avg': -0.041892923414707184, 'loss/value_avg': 0.8595300316810608, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6719427704811096, 'val/ratio': 0.9824534058570862, 'val/ratio_var': 0.00015032982628326863, 'val/num_eos_tokens': 9, 'lr': 4.933855952964234e-05, 'episode': 112, 'epoch': 0.01}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [05:55<1:05:39, 131kB/s]
  1%|▏         | 29/2041 [02:36<2:57:25,  5.29s/it][A

{'eps': 0, 'objective/kl': 64.06858825683594, 'objective/entropy': 48.472312927246094, 'objective/non_score_reward': -3.2034292221069336, 'objective/rlhf_reward': -2.1616430282592773, 'objective/scores': 1.0417861938476562, 'policy/approxkl_avg': 0.08505875617265701, 'policy/clipfrac_avg': 0.16509434580802917, 'loss/policy_avg': -0.041203297674655914, 'loss/value_avg': 0.7290585041046143, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8710154294967651, 'val/ratio': 0.9795174598693848, 'val/ratio_var': 0.00019885606889147311, 'val/num_eos_tokens': 11, 'lr': 4.93140617344439e-05, 'episode': 116, 'epoch': 0.01}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [06:00<1:05:39, 131kB/s]
  1%|▏         | 30/2041 [02:42<2:56:44,  5.27s/it][A

{'eps': 0, 'objective/kl': 52.76198196411133, 'objective/entropy': 42.29496765136719, 'objective/non_score_reward': -2.638098955154419, 'objective/rlhf_reward': -2.559666156768799, 'objective/scores': 0.07843279838562012, 'policy/approxkl_avg': 0.11994827538728714, 'policy/clipfrac_avg': 0.21344339847564697, 'loss/policy_avg': -0.05372026562690735, 'loss/value_avg': 0.9736047983169556, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8235772848129272, 'val/ratio': 1.0330787897109985, 'val/ratio_var': 0.0015954429982230067, 'val/num_eos_tokens': 2, 'lr': 4.928956393924547e-05, 'episode': 120, 'epoch': 0.01}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [06:05<1:05:39, 131kB/s]
  2%|▏         | 31/2041 [02:47<2:56:34,  5.27s/it][A

{'eps': 0, 'objective/kl': 40.34679412841797, 'objective/entropy': 40.85151672363281, 'objective/non_score_reward': -2.0173397064208984, 'objective/rlhf_reward': -1.8807069063186646, 'objective/scores': 0.1366328001022339, 'policy/approxkl_avg': 0.11406157910823822, 'policy/clipfrac_avg': 0.15801885724067688, 'loss/policy_avg': -0.04150959104299545, 'loss/value_avg': 0.61067795753479, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7178907990455627, 'val/ratio': 1.0672357082366943, 'val/ratio_var': 0.0050296480767428875, 'val/num_eos_tokens': 5, 'lr': 4.926506614404704e-05, 'episode': 124, 'epoch': 0.02}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [06:11<1:05:39, 131kB/s]
  2%|▏         | 32/2041 [02:52<2:57:13,  5.29s/it][A

{'eps': 0, 'objective/kl': 45.65711212158203, 'objective/entropy': 60.3685417175293, 'objective/non_score_reward': -2.28285551071167, 'objective/rlhf_reward': -1.0359084606170654, 'objective/scores': 1.2469470500946045, 'policy/approxkl_avg': 0.8423842787742615, 'policy/clipfrac_avg': 0.24056604504585266, 'loss/policy_avg': -0.05638778954744339, 'loss/value_avg': 0.9402029514312744, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1356496810913086, 'val/ratio': 1.169857144355774, 'val/ratio_var': 0.030365953221917152, 'val/num_eos_tokens': 2, 'lr': 4.9240568348848605e-05, 'episode': 128, 'epoch': 0.02}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [06:16<1:05:39, 131kB/s]
  2%|▏         | 33/2041 [02:58<2:57:06,  5.29s/it][A

{'eps': 0, 'objective/kl': 56.29810333251953, 'objective/entropy': 70.52581787109375, 'objective/non_score_reward': -2.8149051666259766, 'objective/rlhf_reward': -2.234105110168457, 'objective/scores': 0.5808000564575195, 'policy/approxkl_avg': 0.16596044600009918, 'policy/clipfrac_avg': 0.24056604504585266, 'loss/policy_avg': -0.06716275215148926, 'loss/value_avg': 1.2217425107955933, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.3760571479797363, 'val/ratio': 1.0993759632110596, 'val/ratio_var': 0.01716577261686325, 'val/num_eos_tokens': 0, 'lr': 4.921607055365017e-05, 'episode': 132, 'epoch': 0.02}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [06:21<1:05:39, 131kB/s]
  2%|▏         | 34/2041 [03:03<2:56:12,  5.27s/it][A

{'eps': 0, 'objective/kl': 66.01594543457031, 'objective/entropy': 56.8868408203125, 'objective/non_score_reward': -3.300797462463379, 'objective/rlhf_reward': -3.376115560531616, 'objective/scores': -0.0753180980682373, 'policy/approxkl_avg': 0.2092645764350891, 'policy/clipfrac_avg': 0.2216981053352356, 'loss/policy_avg': -0.06149478256702423, 'loss/value_avg': 1.1309155225753784, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1332528591156006, 'val/ratio': 1.0452086925506592, 'val/ratio_var': 0.0031174386385828257, 'val/num_eos_tokens': 0, 'lr': 4.919157275845174e-05, 'episode': 136, 'epoch': 0.02}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [06:26<1:05:39, 131kB/s]
  2%|▏         | 35/2041 [03:08<2:55:38,  5.25s/it][A

{'eps': 0, 'objective/kl': 59.849178314208984, 'objective/entropy': 59.5140266418457, 'objective/non_score_reward': -2.9924588203430176, 'objective/rlhf_reward': -3.3209447860717773, 'objective/scores': -0.32848599553108215, 'policy/approxkl_avg': 0.1366877257823944, 'policy/clipfrac_avg': 0.22523584961891174, 'loss/policy_avg': -0.0619676411151886, 'loss/value_avg': 0.7300173044204712, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1549434661865234, 'val/ratio': 1.093688726425171, 'val/ratio_var': 0.00939890369772911, 'val/num_eos_tokens': 1, 'lr': 4.916707496325331e-05, 'episode': 140, 'epoch': 0.02}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [06:32<1:05:39, 131kB/s]
  2%|▏         | 36/2041 [03:13<2:55:31,  5.25s/it][A

{'eps': 0, 'objective/kl': 65.34236145019531, 'objective/entropy': 37.30182647705078, 'objective/non_score_reward': -3.267117977142334, 'objective/rlhf_reward': -3.113992691040039, 'objective/scores': 0.15312525629997253, 'policy/approxkl_avg': 0.261825293302536, 'policy/clipfrac_avg': 0.1450471729040146, 'loss/policy_avg': -0.039454665035009384, 'loss/value_avg': 0.7331690788269043, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7629231214523315, 'val/ratio': 0.9754407405853271, 'val/ratio_var': 0.00028364741592667997, 'val/num_eos_tokens': 11, 'lr': 4.914257716805488e-05, 'episode': 144, 'epoch': 0.02}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [06:37<1:05:39, 131kB/s]
  2%|▏         | 37/2041 [03:18<2:54:57,  5.24s/it][A

{'eps': 0, 'objective/kl': 61.51608657836914, 'objective/entropy': 69.83662414550781, 'objective/non_score_reward': -3.0758044719696045, 'objective/rlhf_reward': -3.274963855743408, 'objective/scores': -0.19915932416915894, 'policy/approxkl_avg': 0.21644540131092072, 'policy/clipfrac_avg': 0.2594339847564697, 'loss/policy_avg': -0.06568870693445206, 'loss/value_avg': 1.0602662563323975, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.32108473777771, 'val/ratio': 1.1433658599853516, 'val/ratio_var': 0.03781026974320412, 'val/num_eos_tokens': 3, 'lr': 4.9118079372856445e-05, 'episode': 148, 'epoch': 0.02}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [06:42<1:05:39, 131kB/s]
  2%|▏         | 38/2041 [03:24<2:56:27,  5.29s/it][A

{'eps': 0, 'objective/kl': 52.75434494018555, 'objective/entropy': 58.35750961303711, 'objective/non_score_reward': -2.6377172470092773, 'objective/rlhf_reward': -2.4762110710144043, 'objective/scores': 0.16150611639022827, 'policy/approxkl_avg': 0.32851043343544006, 'policy/clipfrac_avg': 0.22523584961891174, 'loss/policy_avg': -0.05546233803033829, 'loss/value_avg': 0.7321254014968872, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.043818712234497, 'val/ratio': 1.0591065883636475, 'val/ratio_var': 0.004037593957036734, 'val/num_eos_tokens': 3, 'lr': 4.909358157765801e-05, 'episode': 152, 'epoch': 0.02}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [06:47<1:05:39, 131kB/s]
  2%|▏         | 39/2041 [03:29<2:54:30,  5.23s/it][A

{'eps': 0, 'objective/kl': 64.63606262207031, 'objective/entropy': 51.756805419921875, 'objective/non_score_reward': -3.2318029403686523, 'objective/rlhf_reward': -3.389423131942749, 'objective/scores': -0.15762026607990265, 'policy/approxkl_avg': 0.1700330525636673, 'policy/clipfrac_avg': 0.2228773683309555, 'loss/policy_avg': -0.05720552057027817, 'loss/value_avg': 0.7828627228736877, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9923031330108643, 'val/ratio': 1.0175676345825195, 'val/ratio_var': 0.00043170456774532795, 'val/num_eos_tokens': 7, 'lr': 4.906908378245958e-05, 'episode': 156, 'epoch': 0.02}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [06:52<1:05:39, 131kB/s]
  2%|▏         | 40/2041 [03:34<2:53:28,  5.20s/it][A

{'eps': 0, 'objective/kl': 41.38330841064453, 'objective/entropy': 62.72875213623047, 'objective/non_score_reward': -2.0691654682159424, 'objective/rlhf_reward': -2.5544910430908203, 'objective/scores': -0.48532551527023315, 'policy/approxkl_avg': 0.0968170240521431, 'policy/clipfrac_avg': 0.23466981947422028, 'loss/policy_avg': -0.05071456730365753, 'loss/value_avg': 0.7429488897323608, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9591760635375977, 'val/ratio': 1.0858972072601318, 'val/ratio_var': 0.009552652947604656, 'val/num_eos_tokens': 4, 'lr': 4.904458598726115e-05, 'episode': 160, 'epoch': 0.02}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [06:58<1:05:39, 131kB/s]
  2%|▏         | 41/2041 [03:39<2:52:44,  5.18s/it][A

{'eps': 0, 'objective/kl': 53.642242431640625, 'objective/entropy': 36.59233093261719, 'objective/non_score_reward': -2.682112216949463, 'objective/rlhf_reward': -2.4976913928985596, 'objective/scores': 0.1844208687543869, 'policy/approxkl_avg': 0.4156128168106079, 'policy/clipfrac_avg': 0.19221697747707367, 'loss/policy_avg': -0.05193200707435608, 'loss/value_avg': 0.873307466506958, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6791143417358398, 'val/ratio': 1.0049197673797607, 'val/ratio_var': 0.0006517008878290653, 'val/num_eos_tokens': 4, 'lr': 4.902008819206272e-05, 'episode': 164, 'epoch': 0.02}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [07:03<1:05:39, 131kB/s]
  2%|▏         | 42/2041 [03:44<2:53:09,  5.20s/it][A

{'eps': 0, 'objective/kl': 36.995506286621094, 'objective/entropy': 33.96087646484375, 'objective/non_score_reward': -1.8497755527496338, 'objective/rlhf_reward': -0.668891191482544, 'objective/scores': 1.1808843612670898, 'policy/approxkl_avg': 0.09129426628351212, 'policy/clipfrac_avg': 0.173349067568779, 'loss/policy_avg': -0.04297615960240364, 'loss/value_avg': 0.5814544558525085, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6295493841171265, 'val/ratio': 1.0264594554901123, 'val/ratio_var': 0.0008279900066554546, 'val/num_eos_tokens': 0, 'lr': 4.8995590396864285e-05, 'episode': 168, 'epoch': 0.02}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [07:08<1:05:39, 131kB/s]
  2%|▏         | 43/2041 [03:50<2:53:26,  5.21s/it][A

{'eps': 0, 'objective/kl': 39.469444274902344, 'objective/entropy': 53.80306625366211, 'objective/non_score_reward': -1.973472237586975, 'objective/rlhf_reward': -1.3442761898040771, 'objective/scores': 0.629196047782898, 'policy/approxkl_avg': 0.08756843209266663, 'policy/clipfrac_avg': 0.21698112785816193, 'loss/policy_avg': -0.050021566450595856, 'loss/value_avg': 0.6727373600006104, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9512745141983032, 'val/ratio': 1.0299922227859497, 'val/ratio_var': 0.0009489532676525414, 'val/num_eos_tokens': 0, 'lr': 4.8971092601665853e-05, 'episode': 172, 'epoch': 0.02}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [07:13<1:05:39, 131kB/s]
  2%|▏         | 44/2041 [03:55<2:52:33,  5.18s/it][A

{'eps': 0, 'objective/kl': 56.12469482421875, 'objective/entropy': 32.51464080810547, 'objective/non_score_reward': -2.806234836578369, 'objective/rlhf_reward': -2.0061161518096924, 'objective/scores': 0.800118625164032, 'policy/approxkl_avg': 0.2106834053993225, 'policy/clipfrac_avg': 0.15094339847564697, 'loss/policy_avg': -0.03764237090945244, 'loss/value_avg': 0.988024115562439, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7227562665939331, 'val/ratio': 1.1612980365753174, 'val/ratio_var': 0.024811161682009697, 'val/num_eos_tokens': 0, 'lr': 4.894659480646742e-05, 'episode': 176, 'epoch': 0.02}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [07:18<1:05:39, 131kB/s]
  2%|▏         | 45/2041 [04:00<2:52:34,  5.19s/it][A

{'eps': 0, 'objective/kl': 45.59040069580078, 'objective/entropy': 36.04511260986328, 'objective/non_score_reward': -2.279520034790039, 'objective/rlhf_reward': -2.5128109455108643, 'objective/scores': -0.23329082131385803, 'policy/approxkl_avg': 0.05305524542927742, 'policy/clipfrac_avg': 0.17806603014469147, 'loss/policy_avg': -0.048865824937820435, 'loss/value_avg': 0.8252246975898743, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7152658104896545, 'val/ratio': 1.0442134141921997, 'val/ratio_var': 0.001874361652880907, 'val/num_eos_tokens': 0, 'lr': 4.892209701126898e-05, 'episode': 180, 'epoch': 0.02}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [07:24<1:05:39, 131kB/s]
  2%|▏         | 46/2041 [04:05<2:52:13,  5.18s/it][A

{'eps': 0, 'objective/kl': 46.44313430786133, 'objective/entropy': 37.116546630859375, 'objective/non_score_reward': -2.3221569061279297, 'objective/rlhf_reward': -2.3110125064849854, 'objective/scores': 0.011144295334815979, 'policy/approxkl_avg': 0.0586528554558754, 'policy/clipfrac_avg': 0.1804245263338089, 'loss/policy_avg': -0.047291144728660583, 'loss/value_avg': 0.6353886723518372, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8337177634239197, 'val/ratio': 1.0136091709136963, 'val/ratio_var': 0.00030728281126357615, 'val/num_eos_tokens': 0, 'lr': 4.889759921607056e-05, 'episode': 184, 'epoch': 0.02}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [07:29<1:05:39, 131kB/s]
  2%|▏         | 47/2041 [04:10<2:51:47,  5.17s/it][A

{'eps': 0, 'objective/kl': 45.926841735839844, 'objective/entropy': 51.474021911621094, 'objective/non_score_reward': -2.296341896057129, 'objective/rlhf_reward': -2.9527781009674072, 'objective/scores': -0.6564361453056335, 'policy/approxkl_avg': 0.1392648071050644, 'policy/clipfrac_avg': 0.21698114275932312, 'loss/policy_avg': -0.0562322661280632, 'loss/value_avg': 0.6844065189361572, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9683575630187988, 'val/ratio': 1.025185465812683, 'val/ratio_var': 0.0007838497986085713, 'val/num_eos_tokens': 0, 'lr': 4.8873101420872126e-05, 'episode': 188, 'epoch': 0.02}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [07:34<1:05:39, 131kB/s]
  2%|▏         | 48/2041 [04:15<2:51:33,  5.16s/it][A

{'eps': 0, 'objective/kl': 46.7793083190918, 'objective/entropy': 55.517356872558594, 'objective/non_score_reward': -2.33896541595459, 'objective/rlhf_reward': -1.1600558757781982, 'objective/scores': 1.1789095401763916, 'policy/approxkl_avg': 0.11911026388406754, 'policy/clipfrac_avg': 0.25471699237823486, 'loss/policy_avg': -0.06107188016176224, 'loss/value_avg': 0.7598789930343628, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.965000331401825, 'val/ratio': 1.220211148262024, 'val/ratio_var': 0.06071345880627632, 'val/num_eos_tokens': 0, 'lr': 4.884860362567369e-05, 'episode': 192, 'epoch': 0.02}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [07:39<1:05:39, 131kB/s]
  2%|▏         | 49/2041 [04:20<2:50:23,  5.13s/it][A

{'eps': 0, 'objective/kl': 60.4233512878418, 'objective/entropy': 48.28956604003906, 'objective/non_score_reward': -3.021167755126953, 'objective/rlhf_reward': -3.2935030460357666, 'objective/scores': -0.2723352313041687, 'policy/approxkl_avg': 0.19035080075263977, 'policy/clipfrac_avg': 0.21816037595272064, 'loss/policy_avg': -0.061372123658657074, 'loss/value_avg': 0.9888601899147034, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8720539212226868, 'val/ratio': 1.1227912902832031, 'val/ratio_var': 0.016436045989394188, 'val/num_eos_tokens': 0, 'lr': 4.882410583047526e-05, 'episode': 196, 'epoch': 0.02}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [07:44<1:05:39, 131kB/s]
  2%|▏         | 50/2041 [04:26<2:50:03,  5.13s/it][A

{'eps': 0, 'objective/kl': 57.113121032714844, 'objective/entropy': 47.067161560058594, 'objective/non_score_reward': -2.855656147003174, 'objective/rlhf_reward': -2.1422252655029297, 'objective/scores': 0.7134308815002441, 'policy/approxkl_avg': 0.25188741087913513, 'policy/clipfrac_avg': 0.2146226465702057, 'loss/policy_avg': -0.05191222205758095, 'loss/value_avg': 0.7369890809059143, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9151182174682617, 'val/ratio': 1.0327115058898926, 'val/ratio_var': 0.0016269097104668617, 'val/num_eos_tokens': 0, 'lr': 4.879960803527683e-05, 'episode': 200, 'epoch': 0.02}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [07:49<1:05:39, 131kB/s]
  2%|▏         | 51/2041 [04:31<2:49:07,  5.10s/it][A

{'eps': 0, 'objective/kl': 52.332942962646484, 'objective/entropy': 50.81925964355469, 'objective/non_score_reward': -2.616647243499756, 'objective/rlhf_reward': -2.148780107498169, 'objective/scores': 0.4678671956062317, 'policy/approxkl_avg': 0.230915367603302, 'policy/clipfrac_avg': 0.2558962404727936, 'loss/policy_avg': -0.06466354429721832, 'loss/value_avg': 0.5854470729827881, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8545690774917603, 'val/ratio': 1.050854206085205, 'val/ratio_var': 0.003845118684694171, 'val/num_eos_tokens': 0, 'lr': 4.87751102400784e-05, 'episode': 204, 'epoch': 0.02}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [07:54<1:05:39, 131kB/s]
  3%|▎         | 52/2041 [04:36<2:48:56,  5.10s/it][A

{'eps': 0, 'objective/kl': 56.894229888916016, 'objective/entropy': 41.8107795715332, 'objective/non_score_reward': -2.8447117805480957, 'objective/rlhf_reward': -1.8577563762664795, 'objective/scores': 0.9869553446769714, 'policy/approxkl_avg': 0.1026611402630806, 'policy/clipfrac_avg': 0.22405660152435303, 'loss/policy_avg': -0.05386516451835632, 'loss/value_avg': 0.6135686635971069, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7879804968833923, 'val/ratio': 1.0175001621246338, 'val/ratio_var': 0.0006048650247976184, 'val/num_eos_tokens': 0, 'lr': 4.875061244487996e-05, 'episode': 208, 'epoch': 0.03}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [07:59<1:05:39, 131kB/s]
  3%|▎         | 53/2041 [04:41<2:49:36,  5.12s/it][A

{'eps': 0, 'objective/kl': 69.57588958740234, 'objective/entropy': 55.19133377075195, 'objective/non_score_reward': -3.4787943363189697, 'objective/rlhf_reward': -3.5233144760131836, 'objective/scores': -0.04452018439769745, 'policy/approxkl_avg': 0.3676418960094452, 'policy/clipfrac_avg': 0.258254736661911, 'loss/policy_avg': -0.060101479291915894, 'loss/value_avg': 0.7252563238143921, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0108509063720703, 'val/ratio': 1.1131811141967773, 'val/ratio_var': 0.011498277075588703, 'val/num_eos_tokens': 0, 'lr': 4.8726114649681534e-05, 'episode': 212, 'epoch': 0.03}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [08:04<1:05:39, 131kB/s]
  3%|▎         | 54/2041 [04:46<2:49:11,  5.11s/it][A

{'eps': 0, 'objective/kl': 64.861572265625, 'objective/entropy': 63.563568115234375, 'objective/non_score_reward': -3.2430787086486816, 'objective/rlhf_reward': -2.8576605319976807, 'objective/scores': 0.3854181170463562, 'policy/approxkl_avg': 0.1670456975698471, 'policy/clipfrac_avg': 0.2853773534297943, 'loss/policy_avg': -0.06386194378137589, 'loss/value_avg': 0.7149751782417297, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.182479977607727, 'val/ratio': 1.0730888843536377, 'val/ratio_var': 0.006825319025665522, 'val/num_eos_tokens': 0, 'lr': 4.87016168544831e-05, 'episode': 216, 'epoch': 0.03}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [08:10<1:05:39, 131kB/s]
  3%|▎         | 55/2041 [04:51<2:49:31,  5.12s/it][A

{'eps': 0, 'objective/kl': 67.51350402832031, 'objective/entropy': 53.652793884277344, 'objective/non_score_reward': -3.3756752014160156, 'objective/rlhf_reward': -3.2021446228027344, 'objective/scores': 0.17353051900863647, 'policy/approxkl_avg': 0.14089272916316986, 'policy/clipfrac_avg': 0.24646227061748505, 'loss/policy_avg': -0.05983084440231323, 'loss/value_avg': 0.7571044564247131, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9408760070800781, 'val/ratio': 1.101198673248291, 'val/ratio_var': 0.012496140785515308, 'val/num_eos_tokens': 0, 'lr': 4.867711905928466e-05, 'episode': 220, 'epoch': 0.03}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [08:15<1:05:39, 131kB/s]
  3%|▎         | 56/2041 [04:56<2:50:17,  5.15s/it][A

{'eps': 0, 'objective/kl': 69.68486022949219, 'objective/entropy': 44.793006896972656, 'objective/non_score_reward': -3.4842429161071777, 'objective/rlhf_reward': -4.422258377075195, 'objective/scores': -0.9380154013633728, 'policy/approxkl_avg': 0.08099275827407837, 'policy/clipfrac_avg': 0.21933962404727936, 'loss/policy_avg': -0.05540265887975693, 'loss/value_avg': 0.7497885227203369, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9168686866760254, 'val/ratio': 1.0605032444000244, 'val/ratio_var': 0.003877603216096759, 'val/num_eos_tokens': 0, 'lr': 4.865262126408623e-05, 'episode': 224, 'epoch': 0.03}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [08:20<1:05:39, 131kB/s]
  3%|▎         | 57/2041 [05:02<2:50:41,  5.16s/it][A

{'eps': 0, 'objective/kl': 85.84107971191406, 'objective/entropy': 55.655792236328125, 'objective/non_score_reward': -4.29205322265625, 'objective/rlhf_reward': -4.8789448738098145, 'objective/scores': -0.5868917107582092, 'policy/approxkl_avg': 0.3179788291454315, 'policy/clipfrac_avg': 0.26061320304870605, 'loss/policy_avg': -0.05892467498779297, 'loss/value_avg': 1.0367718935012817, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.941003143787384, 'val/ratio': 1.403209924697876, 'val/ratio_var': 0.23173360526561737, 'val/num_eos_tokens': 0, 'lr': 4.8628123468887806e-05, 'episode': 228, 'epoch': 0.03}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [08:25<1:05:39, 131kB/s]
  3%|▎         | 58/2041 [05:07<2:49:42,  5.13s/it][A

{'eps': 0, 'objective/kl': 85.50189208984375, 'objective/entropy': 37.71035385131836, 'objective/non_score_reward': -4.275094985961914, 'objective/rlhf_reward': -4.69952917098999, 'objective/scores': -0.4244340658187866, 'policy/approxkl_avg': 0.6415836811065674, 'policy/clipfrac_avg': 0.18985848128795624, 'loss/policy_avg': -0.04037874937057495, 'loss/value_avg': 1.5850309133529663, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7351740598678589, 'val/ratio': 0.9809470176696777, 'val/ratio_var': 0.0002808256249409169, 'val/num_eos_tokens': 0, 'lr': 4.860362567368937e-05, 'episode': 232, 'epoch': 0.03}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [08:30<1:05:39, 131kB/s]
  3%|▎         | 59/2041 [05:12<2:48:49,  5.11s/it][A

{'eps': 0, 'objective/kl': 80.52165222167969, 'objective/entropy': 45.956092834472656, 'objective/non_score_reward': -4.026082992553711, 'objective/rlhf_reward': -3.953430414199829, 'objective/scores': 0.0726526752114296, 'policy/approxkl_avg': 0.33334001898765564, 'policy/clipfrac_avg': 0.20400944352149963, 'loss/policy_avg': -0.055732615292072296, 'loss/value_avg': 0.8885297179222107, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9183944463729858, 'val/ratio': 0.9948509931564331, 'val/ratio_var': 0.0004465295060072094, 'val/num_eos_tokens': 0, 'lr': 4.8579127878490935e-05, 'episode': 236, 'epoch': 0.03}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [08:35<1:05:39, 131kB/s]
  3%|▎         | 60/2041 [05:17<2:49:09,  5.12s/it][A

{'eps': 0, 'objective/kl': 89.69583892822266, 'objective/entropy': 63.577919006347656, 'objective/non_score_reward': -4.4847917556762695, 'objective/rlhf_reward': -3.6590917110443115, 'objective/scores': 0.825700044631958, 'policy/approxkl_avg': 0.2606881558895111, 'policy/clipfrac_avg': 0.24410377442836761, 'loss/policy_avg': -0.05062730982899666, 'loss/value_avg': 1.3208951950073242, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1007091999053955, 'val/ratio': 0.9942436218261719, 'val/ratio_var': 0.0009416866232641041, 'val/num_eos_tokens': 0, 'lr': 4.855463008329251e-05, 'episode': 240, 'epoch': 0.03}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [08:40<1:05:39, 131kB/s]
  3%|▎         | 61/2041 [05:22<2:48:28,  5.11s/it][A

{'eps': 0, 'objective/kl': 75.70446014404297, 'objective/entropy': 84.04356384277344, 'objective/non_score_reward': -3.7852227687835693, 'objective/rlhf_reward': -4.552806377410889, 'objective/scores': -0.7675834894180298, 'policy/approxkl_avg': 0.40273308753967285, 'policy/clipfrac_avg': 0.3207547068595886, 'loss/policy_avg': -0.07098449766635895, 'loss/value_avg': 1.5924904346466064, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.386101484298706, 'val/ratio': 1.3991769552230835, 'val/ratio_var': 0.11100766807794571, 'val/num_eos_tokens': 0, 'lr': 4.853013228809407e-05, 'episode': 244, 'epoch': 0.03}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [08:45<1:05:39, 131kB/s]
  3%|▎         | 62/2041 [05:27<2:48:18,  5.10s/it][A

{'eps': 0, 'objective/kl': 72.63731384277344, 'objective/entropy': 36.81572341918945, 'objective/non_score_reward': -3.631866216659546, 'objective/rlhf_reward': -4.460101127624512, 'objective/scores': -0.8282350301742554, 'policy/approxkl_avg': 0.2811163663864136, 'policy/clipfrac_avg': 0.1804245412349701, 'loss/policy_avg': -0.046516284346580505, 'loss/value_avg': 1.2561615705490112, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8206644058227539, 'val/ratio': 1.0171483755111694, 'val/ratio_var': 0.0008942953427322209, 'val/num_eos_tokens': 0, 'lr': 4.850563449289564e-05, 'episode': 248, 'epoch': 0.03}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [08:51<1:05:39, 131kB/s]
  3%|▎         | 63/2041 [05:33<2:55:47,  5.33s/it][A

{'eps': 0, 'objective/kl': 85.54011535644531, 'objective/entropy': 61.68926239013672, 'objective/non_score_reward': -4.277005672454834, 'objective/rlhf_reward': -3.7000892162323, 'objective/scores': 0.5769163966178894, 'policy/approxkl_avg': 0.19112950563430786, 'policy/clipfrac_avg': 0.22523584961891174, 'loss/policy_avg': -0.05629536509513855, 'loss/value_avg': 0.9131704568862915, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.121001958847046, 'val/ratio': 1.1074151992797852, 'val/ratio_var': 0.012206189334392548, 'val/num_eos_tokens': 1, 'lr': 4.848113669769721e-05, 'episode': 252, 'epoch': 0.03}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [08:57<1:05:39, 131kB/s]
  3%|▎         | 64/2041 [05:38<2:56:09,  5.35s/it][A

{'eps': 0, 'objective/kl': 82.66512298583984, 'objective/entropy': 52.01161193847656, 'objective/non_score_reward': -4.133256435394287, 'objective/rlhf_reward': -3.6261699199676514, 'objective/scores': 0.5070865154266357, 'policy/approxkl_avg': 0.29794326424598694, 'policy/clipfrac_avg': 0.19339622557163239, 'loss/policy_avg': -0.042903266847133636, 'loss/value_avg': 0.8258242607116699, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9410172700881958, 'val/ratio': 0.9744440317153931, 'val/ratio_var': 0.00029799912590533495, 'val/num_eos_tokens': 0, 'lr': 4.845663890249878e-05, 'episode': 256, 'epoch': 0.03}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [09:02<1:05:39, 131kB/s]
  3%|▎         | 65/2041 [05:43<2:53:16,  5.26s/it][A

{'eps': 0, 'objective/kl': 106.17549133300781, 'objective/entropy': 67.32087707519531, 'objective/non_score_reward': -5.308774471282959, 'objective/rlhf_reward': -4.860734462738037, 'objective/scores': 0.44803985953330994, 'policy/approxkl_avg': 0.31220677495002747, 'policy/clipfrac_avg': 0.2641509473323822, 'loss/policy_avg': -0.06455342471599579, 'loss/value_avg': 1.0875356197357178, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2368288040161133, 'val/ratio': 1.029861330986023, 'val/ratio_var': 0.0017944738501682878, 'val/num_eos_tokens': 0, 'lr': 4.8432141107300344e-05, 'episode': 260, 'epoch': 0.03}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [09:07<1:05:39, 131kB/s]
  3%|▎         | 66/2041 [05:48<2:52:50,  5.25s/it][A

{'eps': 0, 'objective/kl': 78.92178344726562, 'objective/entropy': 52.67555236816406, 'objective/non_score_reward': -3.946089267730713, 'objective/rlhf_reward': -3.8816637992858887, 'objective/scores': 0.0644254982471466, 'policy/approxkl_avg': 0.25227007269859314, 'policy/clipfrac_avg': 0.2358490526676178, 'loss/policy_avg': -0.054237332195043564, 'loss/value_avg': 0.8511316776275635, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0382091999053955, 'val/ratio': 1.1550421714782715, 'val/ratio_var': 0.01878884807229042, 'val/num_eos_tokens': 0, 'lr': 4.840764331210191e-05, 'episode': 264, 'epoch': 0.03}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [09:12<1:05:39, 131kB/s]
  3%|▎         | 67/2041 [05:54<2:52:25,  5.24s/it][A

{'eps': 0, 'objective/kl': 65.40862274169922, 'objective/entropy': 54.503501892089844, 'objective/non_score_reward': -3.2704315185546875, 'objective/rlhf_reward': -3.884322166442871, 'objective/scores': -0.6138907074928284, 'policy/approxkl_avg': 0.2009282112121582, 'policy/clipfrac_avg': 0.24646227061748505, 'loss/policy_avg': -0.06039934232831001, 'loss/value_avg': 0.902823269367218, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8814915418624878, 'val/ratio': 1.0140882730484009, 'val/ratio_var': 0.0007680314010940492, 'val/num_eos_tokens': 0, 'lr': 4.838314551690348e-05, 'episode': 268, 'epoch': 0.03}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [09:17<1:05:39, 131kB/s]
  3%|▎         | 68/2041 [05:59<2:52:24,  5.24s/it][A

{'eps': 0, 'objective/kl': 74.6845703125, 'objective/entropy': 45.811683654785156, 'objective/non_score_reward': -3.7342286109924316, 'objective/rlhf_reward': -4.052912712097168, 'objective/scores': -0.3186842203140259, 'policy/approxkl_avg': 0.3665202260017395, 'policy/clipfrac_avg': 0.21816039085388184, 'loss/policy_avg': -0.052147116512060165, 'loss/value_avg': 0.9080203771591187, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8938890695571899, 'val/ratio': 1.111307978630066, 'val/ratio_var': 0.020813604816794395, 'val/num_eos_tokens': 0, 'lr': 4.835864772170505e-05, 'episode': 272, 'epoch': 0.03}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [09:23<1:05:39, 131kB/s]
  3%|▎         | 69/2041 [06:04<2:53:19,  5.27s/it][A

{'eps': 0, 'objective/kl': 60.53150939941406, 'objective/entropy': 27.232759475708008, 'objective/non_score_reward': -3.0265753269195557, 'objective/rlhf_reward': -2.470757246017456, 'objective/scores': 0.5558180212974548, 'policy/approxkl_avg': 0.13620859384536743, 'policy/clipfrac_avg': 0.12971697747707367, 'loss/policy_avg': -0.03256983309984207, 'loss/value_avg': 0.9005425572395325, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6191612482070923, 'val/ratio': 0.9766124486923218, 'val/ratio_var': 0.000270073622232303, 'val/num_eos_tokens': 0, 'lr': 4.8334149926506616e-05, 'episode': 276, 'epoch': 0.03}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [09:28<1:05:39, 131kB/s]
  3%|▎         | 70/2041 [06:10<2:53:36,  5.28s/it][A

{'eps': 0, 'objective/kl': 74.2793960571289, 'objective/entropy': 66.33633422851562, 'objective/non_score_reward': -3.7139699459075928, 'objective/rlhf_reward': -3.7953808307647705, 'objective/scores': -0.08141081035137177, 'policy/approxkl_avg': 0.3448449373245239, 'policy/clipfrac_avg': 0.2547169625759125, 'loss/policy_avg': -0.0612526461482048, 'loss/value_avg': 0.9885190725326538, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2269484996795654, 'val/ratio': 1.0244619846343994, 'val/ratio_var': 0.0014596503460779786, 'val/num_eos_tokens': 0, 'lr': 4.8309652131308184e-05, 'episode': 280, 'epoch': 0.03}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [09:33<1:05:39, 131kB/s]
  3%|▎         | 71/2041 [06:15<2:53:00,  5.27s/it][A

{'eps': 0, 'objective/kl': 63.03851318359375, 'objective/entropy': 47.624977111816406, 'objective/non_score_reward': -3.151925802230835, 'objective/rlhf_reward': -2.770231246948242, 'objective/scores': 0.38169464468955994, 'policy/approxkl_avg': 0.2224912941455841, 'policy/clipfrac_avg': 0.20990565419197083, 'loss/policy_avg': -0.05459999665617943, 'loss/value_avg': 0.7482390403747559, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9596070647239685, 'val/ratio': 1.0332738161087036, 'val/ratio_var': 0.0024929626379162073, 'val/num_eos_tokens': 0, 'lr': 4.828515433610975e-05, 'episode': 284, 'epoch': 0.03}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [09:39<1:05:39, 131kB/s]
  4%|▎         | 72/2041 [06:20<2:53:33,  5.29s/it][A

{'eps': 0, 'objective/kl': 68.87100219726562, 'objective/entropy': 43.74176788330078, 'objective/non_score_reward': -3.4435501098632812, 'objective/rlhf_reward': -3.3612539768218994, 'objective/scores': 0.08229613304138184, 'policy/approxkl_avg': 0.120430126786232, 'policy/clipfrac_avg': 0.24174529314041138, 'loss/policy_avg': -0.05994357913732529, 'loss/value_avg': 0.8934577703475952, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0106067657470703, 'val/ratio': 1.0283864736557007, 'val/ratio_var': 0.0006682683597318828, 'val/num_eos_tokens': 0, 'lr': 4.826065654091132e-05, 'episode': 288, 'epoch': 0.04}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [09:44<1:05:39, 131kB/s]
  4%|▎         | 73/2041 [06:26<2:53:53,  5.30s/it][A

{'eps': 0, 'objective/kl': 72.64364624023438, 'objective/entropy': 48.942909240722656, 'objective/non_score_reward': -3.6321823596954346, 'objective/rlhf_reward': -4.7556376457214355, 'objective/scores': -1.1234551668167114, 'policy/approxkl_avg': 0.29049691557884216, 'policy/clipfrac_avg': 0.23938678205013275, 'loss/policy_avg': -0.05590273439884186, 'loss/value_avg': 1.065924048423767, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0483909845352173, 'val/ratio': 1.0835866928100586, 'val/ratio_var': 0.0080487672239542, 'val/num_eos_tokens': 0, 'lr': 4.823615874571289e-05, 'episode': 292, 'epoch': 0.04}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [09:49<1:05:39, 131kB/s]
  4%|▎         | 74/2041 [06:31<2:53:16,  5.29s/it][A

{'eps': 0, 'objective/kl': 90.53860473632812, 'objective/entropy': 80.37872314453125, 'objective/non_score_reward': -4.526930332183838, 'objective/rlhf_reward': -5.540011405944824, 'objective/scores': -1.0130810737609863, 'policy/approxkl_avg': 0.1979767382144928, 'policy/clipfrac_avg': 0.27476415038108826, 'loss/policy_avg': -0.05672137811779976, 'loss/value_avg': 1.1718697547912598, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2156128883361816, 'val/ratio': 1.5351886749267578, 'val/ratio_var': 0.34710565209388733, 'val/num_eos_tokens': 0, 'lr': 4.8211660950514456e-05, 'episode': 296, 'epoch': 0.04}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [09:54<1:05:39, 131kB/s]
  4%|▎         | 75/2041 [06:36<2:52:41,  5.27s/it][A

{'eps': 0, 'objective/kl': 76.83899688720703, 'objective/entropy': 58.32103729248047, 'objective/non_score_reward': -3.841949939727783, 'objective/rlhf_reward': -2.987506866455078, 'objective/scores': 0.8544429540634155, 'policy/approxkl_avg': 0.2021925002336502, 'policy/clipfrac_avg': 0.23938679695129395, 'loss/policy_avg': -0.05357068404555321, 'loss/value_avg': 0.9180433750152588, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0878088474273682, 'val/ratio': 1.2325825691223145, 'val/ratio_var': 0.049262940883636475, 'val/num_eos_tokens': 0, 'lr': 4.8187163155316024e-05, 'episode': 300, 'epoch': 0.04}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [10:00<1:05:39, 131kB/s]
  4%|▎         | 76/2041 [06:41<2:53:22,  5.29s/it][A

{'eps': 0, 'objective/kl': 69.91819763183594, 'objective/entropy': 42.65401840209961, 'objective/non_score_reward': -3.4959099292755127, 'objective/rlhf_reward': -2.583669662475586, 'objective/scores': 0.9122401475906372, 'policy/approxkl_avg': 0.15072327852249146, 'policy/clipfrac_avg': 0.14150944352149963, 'loss/policy_avg': -0.045874837785959244, 'loss/value_avg': 0.5045914649963379, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7587157487869263, 'val/ratio': 1.1621108055114746, 'val/ratio_var': 0.02455122210085392, 'val/num_eos_tokens': 0, 'lr': 4.816266536011759e-05, 'episode': 304, 'epoch': 0.04}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [10:05<1:05:39, 131kB/s]
  4%|▍         | 77/2041 [06:47<2:53:32,  5.30s/it][A

{'eps': 0, 'objective/kl': 54.44401168823242, 'objective/entropy': 32.88902282714844, 'objective/non_score_reward': -2.722200870513916, 'objective/rlhf_reward': -3.1048638820648193, 'objective/scores': -0.3826630711555481, 'policy/approxkl_avg': 0.1552795022726059, 'policy/clipfrac_avg': 0.15448114275932312, 'loss/policy_avg': -0.03801558166742325, 'loss/value_avg': 0.7086120843887329, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.5795201063156128, 'val/ratio': 1.0769538879394531, 'val/ratio_var': 0.0076141697354614735, 'val/num_eos_tokens': 0, 'lr': 4.813816756491916e-05, 'episode': 308, 'epoch': 0.04}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [10:10<1:05:39, 131kB/s]
  4%|▍         | 78/2041 [06:52<2:52:01,  5.26s/it][A

{'eps': 0, 'objective/kl': 66.68161010742188, 'objective/entropy': 59.028865814208984, 'objective/non_score_reward': -3.334080696105957, 'objective/rlhf_reward': -2.149341583251953, 'objective/scores': 1.184739112854004, 'policy/approxkl_avg': 0.3800867795944214, 'policy/clipfrac_avg': 0.2158018797636032, 'loss/policy_avg': -0.04902910813689232, 'loss/value_avg': 0.6005418300628662, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9895386695861816, 'val/ratio': 1.042565107345581, 'val/ratio_var': 0.0026127786841243505, 'val/num_eos_tokens': 0, 'lr': 4.811366976972073e-05, 'episode': 312, 'epoch': 0.04}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [10:15<1:05:39, 131kB/s]
  4%|▍         | 79/2041 [06:57<2:51:20,  5.24s/it][A

{'eps': 0, 'objective/kl': 71.48155975341797, 'objective/entropy': 58.027069091796875, 'objective/non_score_reward': -3.57407808303833, 'objective/rlhf_reward': -2.6427390575408936, 'objective/scores': 0.9313390851020813, 'policy/approxkl_avg': 0.10947168618440628, 'policy/clipfrac_avg': 0.2429245263338089, 'loss/policy_avg': -0.06264165788888931, 'loss/value_avg': 0.5912364721298218, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1463805437088013, 'val/ratio': 1.0520453453063965, 'val/ratio_var': 0.002513585379347205, 'val/num_eos_tokens': 0, 'lr': 4.8089171974522296e-05, 'episode': 316, 'epoch': 0.04}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [10:21<1:05:39, 131kB/s]
  4%|▍         | 80/2041 [07:02<2:50:01,  5.20s/it][A

{'eps': 0, 'objective/kl': 79.99507141113281, 'objective/entropy': 65.6357421875, 'objective/non_score_reward': -3.99975323677063, 'objective/rlhf_reward': -3.6736738681793213, 'objective/scores': 0.3260793685913086, 'policy/approxkl_avg': 0.08413567394018173, 'policy/clipfrac_avg': 0.22523584961891174, 'loss/policy_avg': -0.05401678383350372, 'loss/value_avg': 0.943327009677887, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.153799057006836, 'val/ratio': 1.0434749126434326, 'val/ratio_var': 0.0018960750894621015, 'val/num_eos_tokens': 0, 'lr': 4.8064674179323864e-05, 'episode': 320, 'epoch': 0.04}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [10:26<1:05:39, 131kB/s]
  4%|▍         | 81/2041 [07:07<2:49:17,  5.18s/it][A

{'eps': 0, 'objective/kl': 84.18083953857422, 'objective/entropy': 59.26097869873047, 'objective/non_score_reward': -4.209042072296143, 'objective/rlhf_reward': -4.612560272216797, 'objective/scores': -0.4035181403160095, 'policy/approxkl_avg': 0.1975404918193817, 'policy/clipfrac_avg': 0.2570754885673523, 'loss/policy_avg': -0.06038825958967209, 'loss/value_avg': 1.0077167749404907, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.193320393562317, 'val/ratio': 1.0167787075042725, 'val/ratio_var': 0.0008223732002079487, 'val/num_eos_tokens': 0, 'lr': 4.804017638412543e-05, 'episode': 324, 'epoch': 0.04}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [10:31<1:05:39, 131kB/s]
  4%|▍         | 82/2041 [07:13<2:49:40,  5.20s/it][A

{'eps': 0, 'objective/kl': 58.71852111816406, 'objective/entropy': 47.647010803222656, 'objective/non_score_reward': -2.9359259605407715, 'objective/rlhf_reward': -2.5206565856933594, 'objective/scores': 0.41526949405670166, 'policy/approxkl_avg': 0.042658474296331406, 'policy/clipfrac_avg': 0.15566037595272064, 'loss/policy_avg': -0.04175140708684921, 'loss/value_avg': 0.39097774028778076, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8518747687339783, 'val/ratio': 1.01084566116333, 'val/ratio_var': 0.00025258914683945477, 'val/num_eos_tokens': 0, 'lr': 4.8015678588927e-05, 'episode': 328, 'epoch': 0.04}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [10:36<1:05:39, 131kB/s]
  4%|▍         | 83/2041 [07:18<2:49:40,  5.20s/it][A

{'eps': 0, 'objective/kl': 80.93089294433594, 'objective/entropy': 54.769126892089844, 'objective/non_score_reward': -4.046544075012207, 'objective/rlhf_reward': -4.28816556930542, 'objective/scores': -0.24162133038043976, 'policy/approxkl_avg': 0.31667226552963257, 'policy/clipfrac_avg': 0.21933962404727936, 'loss/policy_avg': -0.05482270568609238, 'loss/value_avg': 0.6743508577346802, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9910334348678589, 'val/ratio': 1.0651546716690063, 'val/ratio_var': 0.006733700167387724, 'val/num_eos_tokens': 0, 'lr': 4.799118079372857e-05, 'episode': 332, 'epoch': 0.04}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [10:41<1:05:39, 131kB/s]
  4%|▍         | 84/2041 [07:23<2:49:23,  5.19s/it][A

{'eps': 0, 'objective/kl': 74.40689086914062, 'objective/entropy': 74.7123794555664, 'objective/non_score_reward': -3.7203445434570312, 'objective/rlhf_reward': -3.8575499057769775, 'objective/scores': -0.1372053325176239, 'policy/approxkl_avg': 0.16966331005096436, 'policy/clipfrac_avg': 0.27358490228652954, 'loss/policy_avg': -0.06505490839481354, 'loss/value_avg': 0.714245617389679, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2413643598556519, 'val/ratio': 1.0432994365692139, 'val/ratio_var': 0.006270062178373337, 'val/num_eos_tokens': 0, 'lr': 4.796668299853013e-05, 'episode': 336, 'epoch': 0.04}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [10:46<1:05:39, 131kB/s]
  4%|▍         | 85/2041 [07:28<2:48:55,  5.18s/it][A

{'eps': 0, 'objective/kl': 79.25686645507812, 'objective/entropy': 54.15998077392578, 'objective/non_score_reward': -3.962843179702759, 'objective/rlhf_reward': -3.134213924407959, 'objective/scores': 0.8286292552947998, 'policy/approxkl_avg': 0.10554217547178268, 'policy/clipfrac_avg': 0.19103774428367615, 'loss/policy_avg': -0.042725276201963425, 'loss/value_avg': 0.8702349066734314, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9431718587875366, 'val/ratio': 1.005744218826294, 'val/ratio_var': 7.604936399729922e-05, 'val/num_eos_tokens': 0, 'lr': 4.7942185203331705e-05, 'episode': 340, 'epoch': 0.04}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [10:52<1:05:39, 131kB/s]
  4%|▍         | 86/2041 [07:33<2:48:17,  5.16s/it][A

{'eps': 0, 'objective/kl': 76.21737670898438, 'objective/entropy': 62.66320037841797, 'objective/non_score_reward': -3.8108692169189453, 'objective/rlhf_reward': -4.218086242675781, 'objective/scores': -0.4072170853614807, 'policy/approxkl_avg': 0.1632044017314911, 'policy/clipfrac_avg': 0.21226416528224945, 'loss/policy_avg': -0.05155498534440994, 'loss/value_avg': 0.8941333293914795, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1208066940307617, 'val/ratio': 1.0085862874984741, 'val/ratio_var': 0.0006780294352211058, 'val/num_eos_tokens': 0, 'lr': 4.791768740813327e-05, 'episode': 344, 'epoch': 0.04}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [10:57<1:05:39, 131kB/s]
  4%|▍         | 87/2041 [07:38<2:46:58,  5.13s/it][A

{'eps': 0, 'objective/kl': 69.79386901855469, 'objective/entropy': 54.00169372558594, 'objective/non_score_reward': -3.489694118499756, 'objective/rlhf_reward': -3.909600257873535, 'objective/scores': -0.4199061989784241, 'policy/approxkl_avg': 0.24455435574054718, 'policy/clipfrac_avg': 0.2087264209985733, 'loss/policy_avg': -0.05153023079037666, 'loss/value_avg': 0.5681542158126831, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0176136493682861, 'val/ratio': 0.9954932928085327, 'val/ratio_var': 0.0002856990322470665, 'val/num_eos_tokens': 0, 'lr': 4.7893189612934834e-05, 'episode': 348, 'epoch': 0.04}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [11:02<1:05:39, 131kB/s]
  4%|▍         | 88/2041 [07:43<2:47:16,  5.14s/it][A

{'eps': 0, 'objective/kl': 59.918487548828125, 'objective/entropy': 36.55986404418945, 'objective/non_score_reward': -2.995924472808838, 'objective/rlhf_reward': -2.565277338027954, 'objective/scores': 0.430647075176239, 'policy/approxkl_avg': 0.2627747356891632, 'policy/clipfrac_avg': 0.14268867671489716, 'loss/policy_avg': -0.03761976212263107, 'loss/value_avg': 0.34067481756210327, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6770229935646057, 'val/ratio': 0.9373874664306641, 'val/ratio_var': 0.00217683264054358, 'val/num_eos_tokens': 0, 'lr': 4.78686918177364e-05, 'episode': 352, 'epoch': 0.04}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [11:07<1:05:39, 131kB/s]
  4%|▍         | 89/2041 [07:49<2:47:50,  5.16s/it][A

{'eps': 0, 'objective/kl': 66.99542236328125, 'objective/entropy': 45.1182746887207, 'objective/non_score_reward': -3.349771499633789, 'objective/rlhf_reward': -3.131230115890503, 'objective/scores': 0.21854141354560852, 'policy/approxkl_avg': 0.09539376199245453, 'policy/clipfrac_avg': 0.16391509771347046, 'loss/policy_avg': -0.04092966020107269, 'loss/value_avg': 0.3326362371444702, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8543586134910583, 'val/ratio': 0.9761916399002075, 'val/ratio_var': 0.00026814217562787235, 'val/num_eos_tokens': 0, 'lr': 4.784419402253798e-05, 'episode': 356, 'epoch': 0.04}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [11:12<1:05:39, 131kB/s]
  4%|▍         | 90/2041 [07:54<2:47:05,  5.14s/it][A

{'eps': 0, 'objective/kl': 72.07552337646484, 'objective/entropy': 49.67720031738281, 'objective/non_score_reward': -3.603776216506958, 'objective/rlhf_reward': -2.874520778656006, 'objective/scores': 0.7292553186416626, 'policy/approxkl_avg': 0.14640727639198303, 'policy/clipfrac_avg': 0.16391509771347046, 'loss/policy_avg': -0.03964872658252716, 'loss/value_avg': 0.42202499508857727, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8360159397125244, 'val/ratio': 0.9948205947875977, 'val/ratio_var': 8.319539483636618e-05, 'val/num_eos_tokens': 0, 'lr': 4.7819696227339545e-05, 'episode': 360, 'epoch': 0.04}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [11:17<1:05:39, 131kB/s]
  4%|▍         | 91/2041 [07:59<2:47:12,  5.14s/it][A

{'eps': 0, 'objective/kl': 81.37439727783203, 'objective/entropy': 27.940292358398438, 'objective/non_score_reward': -4.068719863891602, 'objective/rlhf_reward': -4.957019805908203, 'objective/scores': -0.8883000612258911, 'policy/approxkl_avg': 0.4098970592021942, 'policy/clipfrac_avg': 0.11556603759527206, 'loss/policy_avg': -0.03421002998948097, 'loss/value_avg': 1.368119716644287, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7235274314880371, 'val/ratio': 0.9746606349945068, 'val/ratio_var': 0.0006124331266619265, 'val/num_eos_tokens': 4, 'lr': 4.7795198432141106e-05, 'episode': 364, 'epoch': 0.04}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [11:22<1:05:39, 131kB/s]
  5%|▍         | 92/2041 [08:04<2:47:54,  5.17s/it][A

{'eps': 0, 'objective/kl': 76.95106506347656, 'objective/entropy': 45.23264694213867, 'objective/non_score_reward': -3.847553253173828, 'objective/rlhf_reward': -4.5914740562438965, 'objective/scores': -0.7439208030700684, 'policy/approxkl_avg': 0.16970708966255188, 'policy/clipfrac_avg': 0.21344338357448578, 'loss/policy_avg': -0.05388636514544487, 'loss/value_avg': 0.9331175088882446, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8115414381027222, 'val/ratio': 0.9964059591293335, 'val/ratio_var': 4.291307050152682e-05, 'val/num_eos_tokens': 0, 'lr': 4.777070063694268e-05, 'episode': 368, 'epoch': 0.05}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [11:28<1:05:39, 131kB/s]
  5%|▍         | 93/2041 [08:09<2:47:38,  5.16s/it][A

{'eps': 0, 'objective/kl': 41.537193298339844, 'objective/entropy': 11.489590644836426, 'objective/non_score_reward': -2.076859712600708, 'objective/rlhf_reward': -1.27438223361969, 'objective/scores': 0.8024774789810181, 'policy/approxkl_avg': 0.03969902917742729, 'policy/clipfrac_avg': 0.04599056392908096, 'loss/policy_avg': -0.013380185700953007, 'loss/value_avg': 0.34223657846450806, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.24713024497032166, 'val/ratio': 1.0552027225494385, 'val/ratio_var': 0.0018803519196808338, 'val/num_eos_tokens': 1, 'lr': 4.774620284174425e-05, 'episode': 372, 'epoch': 0.05}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [11:33<1:05:39, 131kB/s]
  5%|▍         | 94/2041 [08:14<2:48:07,  5.18s/it][A

{'eps': 0, 'objective/kl': 48.875999450683594, 'objective/entropy': 19.41901397705078, 'objective/non_score_reward': -2.4437999725341797, 'objective/rlhf_reward': -2.10728120803833, 'objective/scores': 0.33651888370513916, 'policy/approxkl_avg': 0.1250770539045334, 'policy/clipfrac_avg': 0.0695754736661911, 'loss/policy_avg': -0.030342066660523415, 'loss/value_avg': 0.15317174792289734, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.3327527940273285, 'val/ratio': 0.9759393930435181, 'val/ratio_var': 0.00031125289388000965, 'val/num_eos_tokens': 0, 'lr': 4.772170504654581e-05, 'episode': 376, 'epoch': 0.05}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [11:38<1:05:39, 131kB/s]
  5%|▍         | 95/2041 [08:20<2:47:24,  5.16s/it][A

{'eps': 0, 'objective/kl': 47.695556640625, 'objective/entropy': 23.978591918945312, 'objective/non_score_reward': -2.3847780227661133, 'objective/rlhf_reward': -1.6994352340698242, 'objective/scores': 0.6853427886962891, 'policy/approxkl_avg': 0.03234364464879036, 'policy/clipfrac_avg': 0.08962264657020569, 'loss/policy_avg': -0.0240374356508255, 'loss/value_avg': 0.40369632840156555, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.49732622504234314, 'val/ratio': 1.0100122690200806, 'val/ratio_var': 0.00016970880096778274, 'val/num_eos_tokens': 0, 'lr': 4.769720725134738e-05, 'episode': 380, 'epoch': 0.05}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [11:43<1:05:39, 131kB/s]
  5%|▍         | 96/2041 [08:25<2:47:39,  5.17s/it][A

{'eps': 0, 'objective/kl': 60.5171012878418, 'objective/entropy': 37.16967010498047, 'objective/non_score_reward': -3.025855302810669, 'objective/rlhf_reward': -2.8320322036743164, 'objective/scores': 0.19382309913635254, 'policy/approxkl_avg': 0.38698041439056396, 'policy/clipfrac_avg': 0.15094339847564697, 'loss/policy_avg': -0.04734063893556595, 'loss/value_avg': 0.35864022374153137, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7375198602676392, 'val/ratio': 0.9634539484977722, 'val/ratio_var': 0.0006565487128682435, 'val/num_eos_tokens': 0, 'lr': 4.767270945614895e-05, 'episode': 384, 'epoch': 0.05}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [11:48<1:05:39, 131kB/s]
  5%|▍         | 97/2041 [08:30<2:47:03,  5.16s/it][A

{'eps': 0, 'objective/kl': 41.131282806396484, 'objective/entropy': 17.44986915588379, 'objective/non_score_reward': -2.0565643310546875, 'objective/rlhf_reward': -2.390768051147461, 'objective/scores': -0.3342037796974182, 'policy/approxkl_avg': 0.04274909943342209, 'policy/clipfrac_avg': 0.0554245263338089, 'loss/policy_avg': -0.01467946544289589, 'loss/value_avg': 0.31119194626808167, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.3544326424598694, 'val/ratio': 0.9989496469497681, 'val/ratio_var': 7.670290506212041e-05, 'val/num_eos_tokens': 0, 'lr': 4.7648211660950514e-05, 'episode': 388, 'epoch': 0.05}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [11:53<1:05:39, 131kB/s]
  5%|▍         | 98/2041 [08:35<2:46:38,  5.15s/it][A

{'eps': 0, 'objective/kl': 55.14480209350586, 'objective/entropy': 24.58724594116211, 'objective/non_score_reward': -2.7572402954101562, 'objective/rlhf_reward': -2.7983977794647217, 'objective/scores': -0.0411575585603714, 'policy/approxkl_avg': 0.03979601338505745, 'policy/clipfrac_avg': 0.10023585706949234, 'loss/policy_avg': -0.024586528539657593, 'loss/value_avg': 0.548871636390686, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.5278134346008301, 'val/ratio': 0.9705280661582947, 'val/ratio_var': 0.0005973902880214155, 'val/num_eos_tokens': 8, 'lr': 4.762371386575208e-05, 'episode': 392, 'epoch': 0.05}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [11:59<1:05:39, 131kB/s]
  5%|▍         | 99/2041 [08:40<2:46:40,  5.15s/it][A

{'eps': 0, 'objective/kl': 52.466217041015625, 'objective/entropy': 14.421539306640625, 'objective/non_score_reward': -2.6233110427856445, 'objective/rlhf_reward': -2.515563726425171, 'objective/scores': 0.10774734616279602, 'policy/approxkl_avg': 0.04710417240858078, 'policy/clipfrac_avg': 0.04599056765437126, 'loss/policy_avg': -0.023267515003681183, 'loss/value_avg': 0.15245255827903748, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.24408678710460663, 'val/ratio': 1.0119376182556152, 'val/ratio_var': 0.0005749693955294788, 'val/num_eos_tokens': 0, 'lr': 4.759921607055366e-05, 'episode': 396, 'epoch': 0.05}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [12:04<1:05:39, 131kB/s]
  5%|▍         | 100/2041 [08:45<2:46:37,  5.15s/it][A

{'eps': 0, 'objective/kl': 62.785987854003906, 'objective/entropy': 18.95786476135254, 'objective/non_score_reward': -3.1392996311187744, 'objective/rlhf_reward': -1.991227149963379, 'objective/scores': 1.1480724811553955, 'policy/approxkl_avg': 0.09407778084278107, 'policy/clipfrac_avg': 0.06839622557163239, 'loss/policy_avg': -0.027509108185768127, 'loss/value_avg': 0.24483227729797363, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.35324883460998535, 'val/ratio': 0.9570184350013733, 'val/ratio_var': 0.001026769750751555, 'val/num_eos_tokens': 7, 'lr': 4.757471827535522e-05, 'episode': 400, 'epoch': 0.05}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [12:09<1:05:39, 131kB/s]
  5%|▍         | 101/2041 [08:50<2:46:45,  5.16s/it][A

{'eps': 0, 'objective/kl': 60.850643157958984, 'objective/entropy': 4.966663360595703, 'objective/non_score_reward': -3.042532444000244, 'objective/rlhf_reward': -2.208730697631836, 'objective/scores': 0.833801805973053, 'policy/approxkl_avg': 0.061283182352781296, 'policy/clipfrac_avg': 0.03537736088037491, 'loss/policy_avg': -0.02313818968832493, 'loss/value_avg': 0.12758608162403107, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.136228546500206, 'val/ratio': 0.9973788261413574, 'val/ratio_var': 4.7409521357622e-06, 'val/num_eos_tokens': 0, 'lr': 4.7550220480156786e-05, 'episode': 404, 'epoch': 0.05}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [12:14<1:05:39, 131kB/s]
  5%|▍         | 102/2041 [08:56<2:46:09,  5.14s/it][A

{'eps': 0, 'objective/kl': 69.68729400634766, 'objective/entropy': 7.3462347984313965, 'objective/non_score_reward': -3.4843647480010986, 'objective/rlhf_reward': -2.5948023796081543, 'objective/scores': 0.8895623087882996, 'policy/approxkl_avg': 0.1223592609167099, 'policy/clipfrac_avg': 0.036556605249643326, 'loss/policy_avg': -0.021622344851493835, 'loss/value_avg': 0.3291398882865906, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1692841351032257, 'val/ratio': 0.9816490411758423, 'val/ratio_var': 0.00016958305786829442, 'val/num_eos_tokens': 0, 'lr': 4.7525722684958354e-05, 'episode': 408, 'epoch': 0.05}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [12:19<1:05:39, 131kB/s]
  5%|▌         | 103/2041 [09:01<2:46:37,  5.16s/it][A

{'eps': 0, 'objective/kl': 70.75115966796875, 'objective/entropy': 16.94525146484375, 'objective/non_score_reward': -3.537558078765869, 'objective/rlhf_reward': -3.1257095336914062, 'objective/scores': 0.41184860467910767, 'policy/approxkl_avg': 0.07467307895421982, 'policy/clipfrac_avg': 0.06839622557163239, 'loss/policy_avg': -0.031616248190402985, 'loss/value_avg': 0.2747485935688019, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.3538021445274353, 'val/ratio': 0.9744051098823547, 'val/ratio_var': 0.000317037949571386, 'val/num_eos_tokens': 0, 'lr': 4.750122488975993e-05, 'episode': 412, 'epoch': 0.05}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [12:24<1:05:39, 131kB/s]
  5%|▌         | 104/2041 [09:06<2:45:18,  5.12s/it][A

{'eps': 0, 'objective/kl': 80.468017578125, 'objective/entropy': 13.991996765136719, 'objective/non_score_reward': -4.023401260375977, 'objective/rlhf_reward': -3.210068702697754, 'objective/scores': 0.8133325576782227, 'policy/approxkl_avg': 0.04238349199295044, 'policy/clipfrac_avg': 0.056603770703077316, 'loss/policy_avg': -0.028975848108530045, 'loss/value_avg': 0.13163898885250092, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.21300578117370605, 'val/ratio': 0.9671048521995544, 'val/ratio_var': 0.0006555583677254617, 'val/num_eos_tokens': 0, 'lr': 4.747672709456149e-05, 'episode': 416, 'epoch': 0.05}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [12:30<1:05:39, 131kB/s]
  5%|▌         | 105/2041 [09:11<2:46:45,  5.17s/it][A

{'eps': 0, 'objective/kl': 63.73229217529297, 'objective/entropy': 24.723119735717773, 'objective/non_score_reward': -3.186614513397217, 'objective/rlhf_reward': -2.245335578918457, 'objective/scores': 0.9412789344787598, 'policy/approxkl_avg': 0.09888848662376404, 'policy/clipfrac_avg': 0.08962263911962509, 'loss/policy_avg': -0.03931362181901932, 'loss/value_avg': 0.1553332656621933, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.379061222076416, 'val/ratio': 0.9800377488136292, 'val/ratio_var': 0.0001880223717307672, 'val/num_eos_tokens': 0, 'lr': 4.745222929936306e-05, 'episode': 420, 'epoch': 0.05}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [12:35<1:05:39, 131kB/s]
  5%|▌         | 106/2041 [09:16<2:47:40,  5.20s/it][A

{'eps': 0, 'objective/kl': 52.73857116699219, 'objective/entropy': 16.988351821899414, 'objective/non_score_reward': -2.6369285583496094, 'objective/rlhf_reward': -1.6796047687530518, 'objective/scores': 0.9573237895965576, 'policy/approxkl_avg': 0.19056269526481628, 'policy/clipfrac_avg': 0.07429245114326477, 'loss/policy_avg': -0.026012979447841644, 'loss/value_avg': 0.20263385772705078, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.26654958724975586, 'val/ratio': 1.006009578704834, 'val/ratio_var': 8.707703818799928e-05, 'val/num_eos_tokens': 2, 'lr': 4.742773150416463e-05, 'episode': 424, 'epoch': 0.05}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [12:40<1:05:39, 131kB/s]
  5%|▌         | 107/2041 [09:22<2:48:12,  5.22s/it][A

{'eps': 0, 'objective/kl': 76.3641586303711, 'objective/entropy': 8.259368896484375, 'objective/non_score_reward': -3.8182079792022705, 'objective/rlhf_reward': -3.387268304824829, 'objective/scores': 0.43093961477279663, 'policy/approxkl_avg': 0.07435540854930878, 'policy/clipfrac_avg': 0.03891509398818016, 'loss/policy_avg': -0.023929111659526825, 'loss/value_avg': 0.3038126230239868, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.19202491641044617, 'val/ratio': 0.9845664501190186, 'val/ratio_var': 0.00011242142500123009, 'val/num_eos_tokens': 0, 'lr': 4.7403233708966195e-05, 'episode': 428, 'epoch': 0.05}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [12:45<1:05:39, 131kB/s]
  5%|▌         | 108/2041 [09:27<2:47:39,  5.20s/it][A

{'eps': 0, 'objective/kl': 69.3320541381836, 'objective/entropy': 15.107820510864258, 'objective/non_score_reward': -3.4666030406951904, 'objective/rlhf_reward': -2.378862142562866, 'objective/scores': 1.0877408981323242, 'policy/approxkl_avg': 0.21495865285396576, 'policy/clipfrac_avg': 0.05424528196454048, 'loss/policy_avg': -0.029180636629462242, 'loss/value_avg': 0.07587762176990509, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.2022913694381714, 'val/ratio': 0.9691905975341797, 'val/ratio_var': 0.0004669938643928617, 'val/num_eos_tokens': 0, 'lr': 4.737873591376776e-05, 'episode': 432, 'epoch': 0.05}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [12:50<1:05:39, 131kB/s]
  5%|▌         | 109/2041 [09:32<2:47:35,  5.20s/it][A

{'eps': 0, 'objective/kl': 63.66749572753906, 'objective/entropy': 8.694662094116211, 'objective/non_score_reward': -3.1833748817443848, 'objective/rlhf_reward': -2.0071306228637695, 'objective/scores': 1.1762442588806152, 'policy/approxkl_avg': 0.016744019463658333, 'policy/clipfrac_avg': 0.03066037967801094, 'loss/policy_avg': -0.0193372443318367, 'loss/value_avg': 0.11489836871623993, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.14982420206069946, 'val/ratio': 0.9870056509971619, 'val/ratio_var': 0.00010038415348390117, 'val/num_eos_tokens': 0, 'lr': 4.735423811856933e-05, 'episode': 436, 'epoch': 0.05}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [12:56<1:05:39, 131kB/s]
  5%|▌         | 110/2041 [09:37<2:47:30,  5.20s/it][A

{'eps': 0, 'objective/kl': 54.96989059448242, 'objective/entropy': 5.745940208435059, 'objective/non_score_reward': -2.7484946250915527, 'objective/rlhf_reward': -1.2823166847229004, 'objective/scores': 1.4661779403686523, 'policy/approxkl_avg': 0.016568388789892197, 'policy/clipfrac_avg': 0.028301887214183807, 'loss/policy_avg': -0.018926316872239113, 'loss/value_avg': 0.05789879709482193, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.14078104496002197, 'val/ratio': 0.9908748269081116, 'val/ratio_var': 4.5109994971426204e-05, 'val/num_eos_tokens': 0, 'lr': 4.73297403233709e-05, 'episode': 440, 'epoch': 0.05}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [13:01<1:05:39, 131kB/s]
  5%|▌         | 111/2041 [09:42<2:46:38,  5.18s/it][A

{'eps': 0, 'objective/kl': 65.96511840820312, 'objective/entropy': 5.057773113250732, 'objective/non_score_reward': -3.2982561588287354, 'objective/rlhf_reward': -2.9243721961975098, 'objective/scores': 0.37388384342193604, 'policy/approxkl_avg': 0.0070478301495313644, 'policy/clipfrac_avg': 0.02358490601181984, 'loss/policy_avg': -0.02282942272722721, 'loss/value_avg': 0.1294570416212082, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.10793711245059967, 'val/ratio': 0.9880008697509766, 'val/ratio_var': 9.658943599788472e-05, 'val/num_eos_tokens': 0, 'lr': 4.730524252817247e-05, 'episode': 444, 'epoch': 0.05}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [13:06<1:05:39, 131kB/s]
  5%|▌         | 112/2041 [09:47<2:46:13,  5.17s/it][A

{'eps': 0, 'objective/kl': 52.301788330078125, 'objective/entropy': 3.8953299522399902, 'objective/non_score_reward': -2.6150894165039062, 'objective/rlhf_reward': -1.4337540864944458, 'objective/scores': 1.1813353300094604, 'policy/approxkl_avg': 0.0047927312552928925, 'policy/clipfrac_avg': 0.02358490601181984, 'loss/policy_avg': -0.01667550392448902, 'loss/value_avg': 0.057014480233192444, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.14771997928619385, 'val/ratio': 0.9926108717918396, 'val/ratio_var': 5.059561226516962e-05, 'val/num_eos_tokens': 0, 'lr': 4.7280744732974035e-05, 'episode': 448, 'epoch': 0.05}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [13:11<1:05:39, 131kB/s]
  6%|▌         | 113/2041 [09:53<2:46:40,  5.19s/it][A

{'eps': 0, 'objective/kl': 62.37479782104492, 'objective/entropy': 29.976884841918945, 'objective/non_score_reward': -3.1187398433685303, 'objective/rlhf_reward': -2.3492746353149414, 'objective/scores': 0.7694653272628784, 'policy/approxkl_avg': 0.028089504688978195, 'policy/clipfrac_avg': 0.06367924809455872, 'loss/policy_avg': -0.02930145151913166, 'loss/value_avg': 0.26148051023483276, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.4809020161628723, 'val/ratio': 0.9955306649208069, 'val/ratio_var': 1.267128754989244e-05, 'val/num_eos_tokens': 0, 'lr': 4.72562469377756e-05, 'episode': 452, 'epoch': 0.06}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [13:16<1:05:39, 131kB/s]
  6%|▌         | 114/2041 [09:58<2:46:28,  5.18s/it][A

{'eps': 0, 'objective/kl': 70.4024658203125, 'objective/entropy': 33.898353576660156, 'objective/non_score_reward': -3.5201234817504883, 'objective/rlhf_reward': -3.1900217533111572, 'objective/scores': 0.33010172843933105, 'policy/approxkl_avg': 0.0504266656935215, 'policy/clipfrac_avg': 0.09316037595272064, 'loss/policy_avg': -0.03229658678174019, 'loss/value_avg': 0.3709873855113983, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.5510852932929993, 'val/ratio': 0.96072918176651, 'val/ratio_var': 0.0009267393033951521, 'val/num_eos_tokens': 0, 'lr': 4.723174914257717e-05, 'episode': 456, 'epoch': 0.06}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [13:21<1:05:39, 131kB/s]
  6%|▌         | 115/2041 [10:03<2:46:10,  5.18s/it][A

{'eps': 0, 'objective/kl': 77.5649642944336, 'objective/entropy': 12.201763153076172, 'objective/non_score_reward': -3.878248453140259, 'objective/rlhf_reward': -2.859165668487549, 'objective/scores': 1.01908278465271, 'policy/approxkl_avg': 0.02237599715590477, 'policy/clipfrac_avg': 0.05778301879763603, 'loss/policy_avg': -0.02704457938671112, 'loss/value_avg': 0.2650027871131897, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.2598578631877899, 'val/ratio': 0.988520085811615, 'val/ratio_var': 8.32902078400366e-05, 'val/num_eos_tokens': 0, 'lr': 4.720725134737874e-05, 'episode': 460, 'epoch': 0.06}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [13:27<1:05:39, 131kB/s]
  6%|▌         | 116/2041 [10:08<2:46:05,  5.18s/it][A

{'eps': 0, 'objective/kl': 83.74691772460938, 'objective/entropy': 28.69075584411621, 'objective/non_score_reward': -4.187345504760742, 'objective/rlhf_reward': -3.614043712615967, 'objective/scores': 0.5733019113540649, 'policy/approxkl_avg': 0.049355264753103256, 'policy/clipfrac_avg': 0.07783018797636032, 'loss/policy_avg': -0.030112311244010925, 'loss/value_avg': 0.5710827112197876, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.5640667676925659, 'val/ratio': 0.9751131534576416, 'val/ratio_var': 0.0004895672318525612, 'val/num_eos_tokens': 0, 'lr': 4.718275355218031e-05, 'episode': 464, 'epoch': 0.06}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [13:32<1:05:39, 131kB/s]
  6%|▌         | 117/2041 [10:13<2:45:57,  5.18s/it][A

{'eps': 0, 'objective/kl': 61.651283264160156, 'objective/entropy': 28.68031883239746, 'objective/non_score_reward': -3.082564353942871, 'objective/rlhf_reward': -2.5647025108337402, 'objective/scores': 0.5178618431091309, 'policy/approxkl_avg': 0.34897997975349426, 'policy/clipfrac_avg': 0.10259433835744858, 'loss/policy_avg': -0.037436410784721375, 'loss/value_avg': 0.21999584138393402, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.5939414501190186, 'val/ratio': 0.9690619707107544, 'val/ratio_var': 0.00046399631537497044, 'val/num_eos_tokens': 0, 'lr': 4.7158255756981875e-05, 'episode': 468, 'epoch': 0.06}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [13:37<1:05:39, 131kB/s]
  6%|▌         | 118/2041 [10:19<2:45:31,  5.16s/it][A

{'eps': 0, 'objective/kl': 68.88630676269531, 'objective/entropy': 17.39748764038086, 'objective/non_score_reward': -3.444315195083618, 'objective/rlhf_reward': -2.583575963973999, 'objective/scores': 0.8607392907142639, 'policy/approxkl_avg': 0.042973365634679794, 'policy/clipfrac_avg': 0.044811319559812546, 'loss/policy_avg': -0.023756805807352066, 'loss/value_avg': 0.2131277173757553, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.37834393978118896, 'val/ratio': 0.9824448823928833, 'val/ratio_var': 0.00020370357378851622, 'val/num_eos_tokens': 0, 'lr': 4.713375796178344e-05, 'episode': 472, 'epoch': 0.06}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [13:42<1:05:39, 131kB/s]
  6%|▌         | 119/2041 [10:24<2:46:19,  5.19s/it][A

{'eps': 0, 'objective/kl': 66.85211944580078, 'objective/entropy': 24.080001831054688, 'objective/non_score_reward': -3.34260630607605, 'objective/rlhf_reward': -1.6631431579589844, 'objective/scores': 1.6794631481170654, 'policy/approxkl_avg': 0.04508739709854126, 'policy/clipfrac_avg': 0.07547169923782349, 'loss/policy_avg': -0.03502102196216583, 'loss/value_avg': 0.0903504341840744, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.4001152515411377, 'val/ratio': 1.0148438215255737, 'val/ratio_var': 0.0002959691046271473, 'val/num_eos_tokens': 0, 'lr': 4.710926016658501e-05, 'episode': 476, 'epoch': 0.06}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [13:47<1:05:39, 131kB/s]
  6%|▌         | 120/2041 [10:29<2:46:30,  5.20s/it][A

{'eps': 0, 'objective/kl': 62.23118591308594, 'objective/entropy': 18.642412185668945, 'objective/non_score_reward': -3.1115593910217285, 'objective/rlhf_reward': -2.2213196754455566, 'objective/scores': 0.8902395963668823, 'policy/approxkl_avg': 0.04667031392455101, 'policy/clipfrac_avg': 0.053066037595272064, 'loss/policy_avg': -0.021126199513673782, 'loss/value_avg': 0.2196589708328247, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.31038498878479004, 'val/ratio': 0.9849561452865601, 'val/ratio_var': 0.00012541039905045182, 'val/num_eos_tokens': 0, 'lr': 4.708476237138658e-05, 'episode': 480, 'epoch': 0.06}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [13:53<1:05:39, 131kB/s]
  6%|▌         | 121/2041 [10:34<2:46:39,  5.21s/it][A

{'eps': 0, 'objective/kl': 72.20408630371094, 'objective/entropy': 31.884883880615234, 'objective/non_score_reward': -3.6102042198181152, 'objective/rlhf_reward': -3.021136999130249, 'objective/scores': 0.5890671610832214, 'policy/approxkl_avg': 0.04129252955317497, 'policy/clipfrac_avg': 0.11202830076217651, 'loss/policy_avg': -0.02648916468024254, 'loss/value_avg': 0.4649210572242737, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.5320061445236206, 'val/ratio': 0.9597780704498291, 'val/ratio_var': 0.001004321500658989, 'val/num_eos_tokens': 6, 'lr': 4.706026457618815e-05, 'episode': 484, 'epoch': 0.06}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [13:58<1:05:39, 131kB/s]
  6%|▌         | 122/2041 [10:39<2:46:49,  5.22s/it][A

{'eps': 0, 'objective/kl': 63.92002868652344, 'objective/entropy': 17.623559951782227, 'objective/non_score_reward': -3.1960012912750244, 'objective/rlhf_reward': -1.1052753925323486, 'objective/scores': 2.090725898742676, 'policy/approxkl_avg': 0.1998797059059143, 'policy/clipfrac_avg': 0.0695754736661911, 'loss/policy_avg': -0.02565048448741436, 'loss/value_avg': 0.07481180876493454, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.3101173937320709, 'val/ratio': 1.0239026546478271, 'val/ratio_var': 0.0005733894067816436, 'val/num_eos_tokens': 0, 'lr': 4.7035766780989715e-05, 'episode': 488, 'epoch': 0.06}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [14:03<1:05:39, 131kB/s]
  6%|▌         | 123/2041 [10:45<2:46:34,  5.21s/it][A

{'eps': 0, 'objective/kl': 68.91820526123047, 'objective/entropy': 16.778762817382812, 'objective/non_score_reward': -3.4459102153778076, 'objective/rlhf_reward': -1.0134742259979248, 'objective/scores': 2.432435989379883, 'policy/approxkl_avg': 0.11285420507192612, 'policy/clipfrac_avg': 0.0695754736661911, 'loss/policy_avg': -0.02665477618575096, 'loss/value_avg': 0.1541012078523636, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.23643183708190918, 'val/ratio': 0.9796019792556763, 'val/ratio_var': 0.00020266247156541795, 'val/num_eos_tokens': 0, 'lr': 4.701126898579128e-05, 'episode': 492, 'epoch': 0.06}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [14:08<1:05:39, 131kB/s]
  6%|▌         | 124/2041 [10:50<2:46:26,  5.21s/it][A

{'eps': 0, 'objective/kl': 36.257484436035156, 'objective/entropy': 3.7603063583374023, 'objective/non_score_reward': -1.8128743171691895, 'objective/rlhf_reward': 0.7646536827087402, 'objective/scores': 2.5775279998779297, 'policy/approxkl_avg': 0.008137764409184456, 'policy/clipfrac_avg': 0.012971698306500912, 'loss/policy_avg': -0.010242093354463577, 'loss/value_avg': 0.05373750627040863, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.06531292200088501, 'val/ratio': 1.0251497030258179, 'val/ratio_var': 0.0006530850660055876, 'val/num_eos_tokens': 0, 'lr': 4.698677119059285e-05, 'episode': 496, 'epoch': 0.06}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [14:13<1:05:39, 131kB/s]
  6%|▌         | 125/2041 [10:55<2:46:04,  5.20s/it][A

{'eps': 0, 'objective/kl': 34.30222702026367, 'objective/entropy': 14.100726127624512, 'objective/non_score_reward': -1.715111255645752, 'objective/rlhf_reward': 0.5165276527404785, 'objective/scores': 2.2316389083862305, 'policy/approxkl_avg': 0.01754278875887394, 'policy/clipfrac_avg': 0.06603773683309555, 'loss/policy_avg': -0.01622607558965683, 'loss/value_avg': 0.10573592036962509, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.16726316511631012, 'val/ratio': 0.9999720454216003, 'val/ratio_var': 3.28351634379942e-05, 'val/num_eos_tokens': 0, 'lr': 4.696227339539442e-05, 'episode': 500, 'epoch': 0.06}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [14:19<1:05:39, 131kB/s]
  6%|▌         | 126/2041 [11:00<2:44:59,  5.17s/it][A

{'eps': 0, 'objective/kl': 54.00774002075195, 'objective/entropy': 27.057767868041992, 'objective/non_score_reward': -2.7003872394561768, 'objective/rlhf_reward': -0.8745336532592773, 'objective/scores': 1.8258535861968994, 'policy/approxkl_avg': 0.09788662940263748, 'policy/clipfrac_avg': 0.14033018052577972, 'loss/policy_avg': -0.04496348276734352, 'loss/value_avg': 0.49870380759239197, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.5357320308685303, 'val/ratio': 0.9493380188941956, 'val/ratio_var': 0.001847297535277903, 'val/num_eos_tokens': 0, 'lr': 4.693777560019598e-05, 'episode': 504, 'epoch': 0.06}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [14:24<1:05:39, 131kB/s]
  6%|▌         | 127/2041 [11:05<2:44:56,  5.17s/it][A

{'eps': 0, 'objective/kl': 51.866844177246094, 'objective/entropy': 25.326597213745117, 'objective/non_score_reward': -2.5933423042297363, 'objective/rlhf_reward': -0.36502718925476074, 'objective/scores': 2.2283151149749756, 'policy/approxkl_avg': 0.20130544900894165, 'policy/clipfrac_avg': 0.12617924809455872, 'loss/policy_avg': -0.03818611055612564, 'loss/value_avg': 0.19975149631500244, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.48369503021240234, 'val/ratio': 1.0738720893859863, 'val/ratio_var': 0.006269196048378944, 'val/num_eos_tokens': 0, 'lr': 4.691327780499755e-05, 'episode': 508, 'epoch': 0.06}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [14:29<1:05:39, 131kB/s]
  6%|▋         | 128/2041 [11:10<2:44:46,  5.17s/it][A

{'eps': 0, 'objective/kl': 32.490501403808594, 'objective/entropy': 7.306728363037109, 'objective/non_score_reward': -1.6245250701904297, 'objective/rlhf_reward': 0.8859217166900635, 'objective/scores': 2.510446786880493, 'policy/approxkl_avg': 0.004250778816640377, 'policy/clipfrac_avg': 0.016509434208273888, 'loss/policy_avg': -0.004790506325662136, 'loss/value_avg': 0.06663482636213303, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.08610573410987854, 'val/ratio': 1.0139503479003906, 'val/ratio_var': 0.00019157973292749375, 'val/num_eos_tokens': 0, 'lr': 4.6888780009799124e-05, 'episode': 512, 'epoch': 0.06}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [14:34<1:05:39, 131kB/s]
  6%|▋         | 129/2041 [11:16<2:44:35,  5.17s/it][A

{'eps': 0, 'objective/kl': 38.19718551635742, 'objective/entropy': 4.129086017608643, 'objective/non_score_reward': -1.909859299659729, 'objective/rlhf_reward': 0.4684208631515503, 'objective/scores': 2.3782801628112793, 'policy/approxkl_avg': 0.016588177531957626, 'policy/clipfrac_avg': 0.024764150381088257, 'loss/policy_avg': -0.001845511607825756, 'loss/value_avg': 0.04148147627711296, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.047799427062273026, 'val/ratio': 1.0118141174316406, 'val/ratio_var': 0.00010471193672856316, 'val/num_eos_tokens': 0, 'lr': 4.6864282214600685e-05, 'episode': 516, 'epoch': 0.06}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [14:39<1:05:39, 131kB/s]
  6%|▋         | 130/2041 [11:21<2:45:34,  5.20s/it][A

{'eps': 0, 'objective/kl': 36.65478515625, 'objective/entropy': 4.461152076721191, 'objective/non_score_reward': -1.8327393531799316, 'objective/rlhf_reward': 0.7252933979034424, 'objective/scores': 2.558032751083374, 'policy/approxkl_avg': 0.0017874781042337418, 'policy/clipfrac_avg': 0.014150943607091904, 'loss/policy_avg': -0.005459376610815525, 'loss/value_avg': 0.057063989341259, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.0588875487446785, 'val/ratio': 1.0063475370407104, 'val/ratio_var': 4.035948222735897e-05, 'val/num_eos_tokens': 0, 'lr': 4.683978441940225e-05, 'episode': 520, 'epoch': 0.06}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [14:45<1:05:39, 131kB/s]
  6%|▋         | 131/2041 [11:26<2:46:04,  5.22s/it][A

{'eps': 0, 'objective/kl': 46.278778076171875, 'objective/entropy': 22.786073684692383, 'objective/non_score_reward': -2.313939094543457, 'objective/rlhf_reward': 0.03417062759399414, 'objective/scores': 2.348109722137451, 'policy/approxkl_avg': 0.3675090968608856, 'policy/clipfrac_avg': 0.10613206773996353, 'loss/policy_avg': -0.03471992164850235, 'loss/value_avg': 0.16672906279563904, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.3036479949951172, 'val/ratio': 1.1172497272491455, 'val/ratio_var': 0.021422849968075752, 'val/num_eos_tokens': 0, 'lr': 4.681528662420383e-05, 'episode': 524, 'epoch': 0.06}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [14:50<1:05:39, 131kB/s]
  6%|▋         | 132/2041 [11:31<2:46:59,  5.25s/it][A

{'eps': 0, 'objective/kl': 34.01629638671875, 'objective/entropy': 0.3462371826171875, 'objective/non_score_reward': -1.7008147239685059, 'objective/rlhf_reward': 0.5593686103820801, 'objective/scores': 2.260183334350586, 'policy/approxkl_avg': 7.680195267312229e-05, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.0011532813077792525, 'loss/value_avg': 0.0445483922958374, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.02853633649647236, 'val/ratio': 0.9986299276351929, 'val/ratio_var': 2.221137947344687e-06, 'val/num_eos_tokens': 0, 'lr': 4.6790788829005396e-05, 'episode': 528, 'epoch': 0.06}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [14:55<1:05:39, 131kB/s]
  7%|▋         | 133/2041 [11:37<2:46:49,  5.25s/it][A

{'eps': 0, 'objective/kl': 34.71833419799805, 'objective/entropy': 2.7019429206848145, 'objective/non_score_reward': -1.7359167337417603, 'objective/rlhf_reward': 0.698459267616272, 'objective/scores': 2.4343760013580322, 'policy/approxkl_avg': 0.005693670362234116, 'policy/clipfrac_avg': 0.022405661642551422, 'loss/policy_avg': -0.006219690665602684, 'loss/value_avg': 0.05335831642150879, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.05011255294084549, 'val/ratio': 1.0026893615722656, 'val/ratio_var': 3.73143393517239e-06, 'val/num_eos_tokens': 0, 'lr': 4.676629103380696e-05, 'episode': 532, 'epoch': 0.07}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [15:00<1:05:39, 131kB/s]
  7%|▋         | 134/2041 [11:42<2:46:06,  5.23s/it][A

{'eps': 0, 'objective/kl': 47.29774475097656, 'objective/entropy': 2.892604351043701, 'objective/non_score_reward': -2.364887237548828, 'objective/rlhf_reward': 0.18283414840698242, 'objective/scores': 2.5477213859558105, 'policy/approxkl_avg': 0.003413249971345067, 'policy/clipfrac_avg': 0.01179245300590992, 'loss/policy_avg': -0.011806970462203026, 'loss/value_avg': 0.06716051697731018, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.07532524317502975, 'val/ratio': 0.9967082142829895, 'val/ratio_var': 6.450874934671447e-06, 'val/num_eos_tokens': 0, 'lr': 4.6741793238608525e-05, 'episode': 536, 'epoch': 0.07}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [15:05<1:05:39, 131kB/s]
  7%|▋         | 135/2041 [11:47<2:45:11,  5.20s/it][A

{'eps': 0, 'objective/kl': 44.69086456298828, 'objective/entropy': 4.735926628112793, 'objective/non_score_reward': -2.2345433235168457, 'objective/rlhf_reward': 0.515162467956543, 'objective/scores': 2.7497057914733887, 'policy/approxkl_avg': 0.010883343406021595, 'policy/clipfrac_avg': 0.02358490601181984, 'loss/policy_avg': -0.010067563503980637, 'loss/value_avg': 0.07585656642913818, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.11580727249383926, 'val/ratio': 1.0113844871520996, 'val/ratio_var': 0.00016027588571887463, 'val/num_eos_tokens': 0, 'lr': 4.67172954434101e-05, 'episode': 540, 'epoch': 0.07}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [15:11<1:05:39, 131kB/s]
  7%|▋         | 136/2041 [11:52<2:44:29,  5.18s/it][A

{'eps': 0, 'objective/kl': 44.03172302246094, 'objective/entropy': 9.022071838378906, 'objective/non_score_reward': -2.2015862464904785, 'objective/rlhf_reward': 0.893878698348999, 'objective/scores': 3.0954649448394775, 'policy/approxkl_avg': 0.006150944158434868, 'policy/clipfrac_avg': 0.022405661642551422, 'loss/policy_avg': -0.005350528750568628, 'loss/value_avg': 0.0983918234705925, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.16126129031181335, 'val/ratio': 1.0003747940063477, 'val/ratio_var': 2.6949908715323545e-05, 'val/num_eos_tokens': 0, 'lr': 4.669279764821166e-05, 'episode': 544, 'epoch': 0.07}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [15:16<1:05:39, 131kB/s]
  7%|▋         | 137/2041 [11:57<2:43:25,  5.15s/it][A

{'eps': 0, 'objective/kl': 47.0388298034668, 'objective/entropy': 9.312398910522461, 'objective/non_score_reward': -2.3519413471221924, 'objective/rlhf_reward': 0.9541230201721191, 'objective/scores': 3.3060643672943115, 'policy/approxkl_avg': 0.13916635513305664, 'policy/clipfrac_avg': 0.0554245263338089, 'loss/policy_avg': -0.012022924609482288, 'loss/value_avg': 0.10801981389522552, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.18401911854743958, 'val/ratio': 1.0016772747039795, 'val/ratio_var': 6.769475294277072e-05, 'val/num_eos_tokens': 0, 'lr': 4.666829985301323e-05, 'episode': 548, 'epoch': 0.07}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [15:21<1:05:39, 131kB/s]
  7%|▋         | 138/2041 [12:02<2:42:38,  5.13s/it][A

{'eps': 0, 'objective/kl': 44.21342086791992, 'objective/entropy': 10.563777923583984, 'objective/non_score_reward': -2.2106711864471436, 'objective/rlhf_reward': 1.247408390045166, 'objective/scores': 3.4580795764923096, 'policy/approxkl_avg': 0.012876483611762524, 'policy/clipfrac_avg': 0.04716981202363968, 'loss/policy_avg': -0.0038900752551853657, 'loss/value_avg': 0.11547913402318954, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.14363044500350952, 'val/ratio': 1.0102903842926025, 'val/ratio_var': 6.32140800007619e-05, 'val/num_eos_tokens': 0, 'lr': 4.66438020578148e-05, 'episode': 552, 'epoch': 0.07}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [15:26<1:05:39, 131kB/s]
  7%|▋         | 139/2041 [12:07<2:42:32,  5.13s/it][A

{'eps': 0, 'objective/kl': 40.860511779785156, 'objective/entropy': 6.869609355926514, 'objective/non_score_reward': -2.043025493621826, 'objective/rlhf_reward': 1.1889595985412598, 'objective/scores': 3.231985092163086, 'policy/approxkl_avg': 0.4646472930908203, 'policy/clipfrac_avg': 0.07547169923782349, 'loss/policy_avg': -0.004723591264337301, 'loss/value_avg': 0.09231244027614594, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.15401719510555267, 'val/ratio': 0.9927468299865723, 'val/ratio_var': 0.00030203076312318444, 'val/num_eos_tokens': 0, 'lr': 4.6619304262616365e-05, 'episode': 556, 'epoch': 0.07}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [15:31<1:05:39, 131kB/s]
  7%|▋         | 140/2041 [12:13<2:42:22,  5.12s/it][A

{'eps': 0, 'objective/kl': 33.0267333984375, 'objective/entropy': 0.07184410095214844, 'objective/non_score_reward': -1.651336669921875, 'objective/rlhf_reward': -0.20954358577728271, 'objective/scores': 1.4417930841445923, 'policy/approxkl_avg': 1.8531416571931913e-05, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.002381835365667939, 'loss/value_avg': 0.03217408433556557, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.0064590563997626305, 'val/ratio': 0.9995406270027161, 'val/ratio_var': 3.669050840926502e-07, 'val/num_eos_tokens': 0, 'lr': 4.659480646741793e-05, 'episode': 560, 'epoch': 0.07}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [15:36<1:05:39, 131kB/s]
  7%|▋         | 141/2041 [12:18<2:43:03,  5.15s/it][A

{'eps': 0, 'objective/kl': 30.50657844543457, 'objective/entropy': 0.14626407623291016, 'objective/non_score_reward': -1.5253289937973022, 'objective/rlhf_reward': 0.03708302974700928, 'objective/scores': 1.5624120235443115, 'policy/approxkl_avg': 0.007973261177539825, 'policy/clipfrac_avg': 0.009433962404727936, 'loss/policy_avg': -0.008869433775544167, 'loss/value_avg': 0.016796529293060303, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.011650431901216507, 'val/ratio': 0.992845892906189, 'val/ratio_var': 4.794002961716615e-05, 'val/num_eos_tokens': 0, 'lr': 4.65703086722195e-05, 'episode': 564, 'epoch': 0.07}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [15:41<1:05:39, 131kB/s]
  7%|▋         | 142/2041 [12:23<2:43:06,  5.15s/it][A

{'eps': 0, 'objective/kl': 35.020408630371094, 'objective/entropy': 0.31120872497558594, 'objective/non_score_reward': -1.7510206699371338, 'objective/rlhf_reward': 0.8852837085723877, 'objective/scores': 2.6363043785095215, 'policy/approxkl_avg': 2.654803392942995e-05, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': 0.0006222269730642438, 'loss/value_avg': 0.04590967670083046, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.025682277977466583, 'val/ratio': 1.001490592956543, 'val/ratio_var': 1.2169991805421887e-06, 'val/num_eos_tokens': 0, 'lr': 4.654581087702107e-05, 'episode': 568, 'epoch': 0.07}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [15:47<1:05:39, 131kB/s]
  7%|▋         | 143/2041 [12:28<2:43:32,  5.17s/it][A

{'eps': 0, 'objective/kl': 34.9937744140625, 'objective/entropy': 0.18912744522094727, 'objective/non_score_reward': -1.7496888637542725, 'objective/rlhf_reward': 0.8183763027191162, 'objective/scores': 2.5680651664733887, 'policy/approxkl_avg': 2.3168995255673508e-07, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': 0.0003472538373898715, 'loss/value_avg': 0.05065557360649109, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.02115253359079361, 'val/ratio': 1.0001981258392334, 'val/ratio_var': 2.0409748557881358e-08, 'val/num_eos_tokens': 0, 'lr': 4.652131308182264e-05, 'episode': 572, 'epoch': 0.07}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [15:52<1:05:39, 131kB/s]
  7%|▋         | 144/2041 [12:33<2:43:14,  5.16s/it][A

{'eps': 0, 'objective/kl': 35.369728088378906, 'objective/entropy': 0.22102880477905273, 'objective/non_score_reward': -1.7684863805770874, 'objective/rlhf_reward': 0.9669207334518433, 'objective/scores': 2.7354071140289307, 'policy/approxkl_avg': 1.4221193112007313e-07, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.00018596425070427358, 'loss/value_avg': 0.03482211381196976, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.02518756501376629, 'val/ratio': 0.9999078512191772, 'val/ratio_var': 6.548038999909522e-09, 'val/num_eos_tokens': 0, 'lr': 4.6496815286624206e-05, 'episode': 576, 'epoch': 0.07}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [15:57<1:05:39, 131kB/s]
  7%|▋         | 145/2041 [12:39<2:44:02,  5.19s/it][A

{'eps': 0, 'objective/kl': 35.15892028808594, 'objective/entropy': 3.0998501777648926, 'objective/non_score_reward': -1.757946252822876, 'objective/rlhf_reward': 0.9215006828308105, 'objective/scores': 2.6794469356536865, 'policy/approxkl_avg': 0.0005846373387612402, 'policy/clipfrac_avg': 0.00589622650295496, 'loss/policy_avg': -0.003774183802306652, 'loss/value_avg': 0.048969414085149765, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.05885615199804306, 'val/ratio': 0.9984444379806519, 'val/ratio_var': 2.411027935522725e-06, 'val/num_eos_tokens': 0, 'lr': 4.6472317491425774e-05, 'episode': 580, 'epoch': 0.07}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [16:02<1:05:39, 131kB/s]
  7%|▋         | 146/2041 [12:44<2:43:07,  5.16s/it][A

{'eps': 0, 'objective/kl': 35.65613555908203, 'objective/entropy': 3.4780378341674805, 'objective/non_score_reward': -1.7828067541122437, 'objective/rlhf_reward': 1.0721145868301392, 'objective/scores': 2.854921340942383, 'policy/approxkl_avg': 0.006309213116765022, 'policy/clipfrac_avg': 0.01179245300590992, 'loss/policy_avg': -0.004397119861096144, 'loss/value_avg': 0.0459774024784565, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.061048123985528946, 'val/ratio': 1.0158352851867676, 'val/ratio_var': 0.0002595876285340637, 'val/num_eos_tokens': 0, 'lr': 4.644781969622734e-05, 'episode': 584, 'epoch': 0.07}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [16:07<1:05:39, 131kB/s]
  7%|▋         | 147/2041 [12:49<2:42:15,  5.14s/it][A

{'eps': 0, 'objective/kl': 36.89942932128906, 'objective/entropy': 4.877870559692383, 'objective/non_score_reward': -1.8449716567993164, 'objective/rlhf_reward': 1.0535516738891602, 'objective/scores': 2.8985233306884766, 'policy/approxkl_avg': 0.012439759448170662, 'policy/clipfrac_avg': 0.024764152243733406, 'loss/policy_avg': -0.0033154808916151524, 'loss/value_avg': 0.05090276524424553, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.09821444749832153, 'val/ratio': 1.0056474208831787, 'val/ratio_var': 1.9361228623893112e-05, 'val/num_eos_tokens': 0, 'lr': 4.642332190102891e-05, 'episode': 588, 'epoch': 0.07}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [16:12<1:05:39, 131kB/s]
  7%|▋         | 148/2041 [12:54<2:42:28,  5.15s/it][A

{'eps': 0, 'objective/kl': 36.87541198730469, 'objective/entropy': 5.761834621429443, 'objective/non_score_reward': -1.8437705039978027, 'objective/rlhf_reward': 1.476942539215088, 'objective/scores': 3.3207130432128906, 'policy/approxkl_avg': 0.01093227043747902, 'policy/clipfrac_avg': 0.05188679322600365, 'loss/policy_avg': -0.0023834765888750553, 'loss/value_avg': 0.11450056731700897, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.2095995843410492, 'val/ratio': 0.9895594120025635, 'val/ratio_var': 9.271346061723307e-05, 'val/num_eos_tokens': 0, 'lr': 4.639882410583048e-05, 'episode': 592, 'epoch': 0.07}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [16:18<1:05:39, 131kB/s]
  7%|▋         | 149/2041 [12:59<2:42:18,  5.15s/it][A

{'eps': 0, 'objective/kl': 35.67674255371094, 'objective/entropy': 9.62722396850586, 'objective/non_score_reward': -1.7838371992111206, 'objective/rlhf_reward': 1.5970813035964966, 'objective/scores': 3.380918502807617, 'policy/approxkl_avg': 0.02612708881497383, 'policy/clipfrac_avg': 0.10023584961891174, 'loss/policy_avg': -0.005704186391085386, 'loss/value_avg': 0.09094478189945221, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.2131597101688385, 'val/ratio': 0.9876312017440796, 'val/ratio_var': 7.239980186568573e-05, 'val/num_eos_tokens': 0, 'lr': 4.6374326310632046e-05, 'episode': 596, 'epoch': 0.07}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [16:23<1:05:39, 131kB/s]
  7%|▋         | 150/2041 [13:04<2:42:18,  5.15s/it][A

{'eps': 0, 'objective/kl': 38.54878616333008, 'objective/entropy': 12.682916641235352, 'objective/non_score_reward': -1.9274394512176514, 'objective/rlhf_reward': 1.1380977630615234, 'objective/scores': 3.065537214279175, 'policy/approxkl_avg': 0.0019405626226216555, 'policy/clipfrac_avg': 0.01179245300590992, 'loss/policy_avg': -0.0032254518009722233, 'loss/value_avg': 0.07120844721794128, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1843009889125824, 'val/ratio': 1.0007927417755127, 'val/ratio_var': 3.239105353713967e-05, 'val/num_eos_tokens': 0, 'lr': 4.6349828515433614e-05, 'episode': 600, 'epoch': 0.07}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [16:28<1:05:39, 131kB/s]
  7%|▋         | 151/2041 [13:09<2:42:01,  5.14s/it][A

{'eps': 0, 'objective/kl': 35.57807540893555, 'objective/entropy': 3.2923903465270996, 'objective/non_score_reward': -1.778903841972351, 'objective/rlhf_reward': 0.9962953329086304, 'objective/scores': 2.7751991748809814, 'policy/approxkl_avg': 0.0016901446506381035, 'policy/clipfrac_avg': 0.015330188907682896, 'loss/policy_avg': -0.007268931716680527, 'loss/value_avg': 0.048865411430597305, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.0948135256767273, 'val/ratio': 0.9968558549880981, 'val/ratio_var': 6.335871603369014e-06, 'val/num_eos_tokens': 0, 'lr': 4.632533072023518e-05, 'episode': 604, 'epoch': 0.07}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [16:33<1:05:39, 131kB/s]
  7%|▋         | 152/2041 [13:15<2:42:24,  5.16s/it][A

{'eps': 0, 'objective/kl': 36.976131439208984, 'objective/entropy': 8.620983123779297, 'objective/non_score_reward': -1.8488065004348755, 'objective/rlhf_reward': 1.3667525053024292, 'objective/scores': 3.2155590057373047, 'policy/approxkl_avg': 0.0021752119064331055, 'policy/clipfrac_avg': 0.02358490601181984, 'loss/policy_avg': -0.005628315266221762, 'loss/value_avg': 0.054277241230010986, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.10442443192005157, 'val/ratio': 0.9950148463249207, 'val/ratio_var': 1.585969039297197e-05, 'val/num_eos_tokens': 0, 'lr': 4.630083292503675e-05, 'episode': 608, 'epoch': 0.07}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [16:38<1:05:39, 131kB/s]
  7%|▋         | 153/2041 [13:20<2:42:38,  5.17s/it][A

{'eps': 0, 'objective/kl': 34.65172576904297, 'objective/entropy': 6.226445198059082, 'objective/non_score_reward': -1.732586145401001, 'objective/rlhf_reward': 1.59926176071167, 'objective/scores': 3.331847906112671, 'policy/approxkl_avg': 0.0007743012974970043, 'policy/clipfrac_avg': 0.010613207705318928, 'loss/policy_avg': -0.0034652650356292725, 'loss/value_avg': 0.05203508958220482, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.11238561570644379, 'val/ratio': 0.9995511770248413, 'val/ratio_var': 5.392942625803698e-07, 'val/num_eos_tokens': 0, 'lr': 4.627633512983832e-05, 'episode': 612, 'epoch': 0.07}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [16:43<1:05:39, 131kB/s]
  8%|▊         | 154/2041 [13:25<2:42:44,  5.17s/it][A

{'eps': 0, 'objective/kl': 39.660972595214844, 'objective/entropy': 15.931875228881836, 'objective/non_score_reward': -1.9830485582351685, 'objective/rlhf_reward': 1.5932918787002563, 'objective/scores': 3.576340436935425, 'policy/approxkl_avg': 0.03622918576002121, 'policy/clipfrac_avg': 0.12146226316690445, 'loss/policy_avg': -0.002411874942481518, 'loss/value_avg': 0.09571465849876404, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.38629376888275146, 'val/ratio': 0.9820653796195984, 'val/ratio_var': 0.00016772236267570406, 'val/num_eos_tokens': 0, 'lr': 4.6251837334639886e-05, 'episode': 616, 'epoch': 0.08}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [16:49<1:05:39, 131kB/s]
  8%|▊         | 155/2041 [13:30<2:42:45,  5.18s/it][A

{'eps': 0, 'objective/kl': 35.095069885253906, 'objective/entropy': 18.42256736755371, 'objective/non_score_reward': -1.7547533512115479, 'objective/rlhf_reward': 1.6182734966278076, 'objective/scores': 3.3730268478393555, 'policy/approxkl_avg': 0.010551990009844303, 'policy/clipfrac_avg': 0.05896226316690445, 'loss/policy_avg': -0.011221055872738361, 'loss/value_avg': 0.10471904277801514, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.42249035835266113, 'val/ratio': 0.9892785549163818, 'val/ratio_var': 0.00011082150012953207, 'val/num_eos_tokens': 0, 'lr': 4.622733953944145e-05, 'episode': 620, 'epoch': 0.08}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [16:54<1:05:39, 131kB/s]
  8%|▊         | 156/2041 [13:35<2:42:54,  5.19s/it][A

{'eps': 0, 'objective/kl': 41.14375686645508, 'objective/entropy': 18.38106918334961, 'objective/non_score_reward': -2.057188034057617, 'objective/rlhf_reward': 0.9457828998565674, 'objective/scores': 3.0029709339141846, 'policy/approxkl_avg': 0.010592677630484104, 'policy/clipfrac_avg': 0.07193396240472794, 'loss/policy_avg': -0.0010361559689044952, 'loss/value_avg': 0.08135398477315903, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.3253176510334015, 'val/ratio': 0.9929437041282654, 'val/ratio_var': 3.543392813298851e-05, 'val/num_eos_tokens': 0, 'lr': 4.620284174424302e-05, 'episode': 624, 'epoch': 0.08}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [16:59<1:05:39, 131kB/s]
  8%|▊         | 157/2041 [13:40<2:42:18,  5.17s/it][A

{'eps': 0, 'objective/kl': 41.22984313964844, 'objective/entropy': 17.854766845703125, 'objective/non_score_reward': -2.061492443084717, 'objective/rlhf_reward': 1.0962786674499512, 'objective/scores': 3.157771110534668, 'policy/approxkl_avg': 0.0030585615895688534, 'policy/clipfrac_avg': 0.015330188907682896, 'loss/policy_avg': -0.008866080082952976, 'loss/value_avg': 0.12181556224822998, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.3637581467628479, 'val/ratio': 1.001312017440796, 'val/ratio_var': 2.0400113498908468e-05, 'val/num_eos_tokens': 0, 'lr': 4.617834394904459e-05, 'episode': 628, 'epoch': 0.08}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [17:04<1:05:39, 131kB/s]
  8%|▊         | 158/2041 [13:46<2:42:37,  5.18s/it][A

{'eps': 0, 'objective/kl': 44.02880096435547, 'objective/entropy': 18.70953941345215, 'objective/non_score_reward': -2.20143985748291, 'objective/rlhf_reward': 0.7155704498291016, 'objective/scores': 2.9170103073120117, 'policy/approxkl_avg': 0.005023635923862457, 'policy/clipfrac_avg': 0.03419811278581619, 'loss/policy_avg': -0.004674287047237158, 'loss/value_avg': 0.10757280886173248, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.33063167333602905, 'val/ratio': 1.0005927085876465, 'val/ratio_var': 5.661288014380261e-06, 'val/num_eos_tokens': 0, 'lr': 4.615384615384616e-05, 'episode': 632, 'epoch': 0.08}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [17:09<1:05:39, 131kB/s]
  8%|▊         | 159/2041 [13:51<2:42:48,  5.19s/it][A

{'eps': 0, 'objective/kl': 35.926292419433594, 'objective/entropy': 8.087965965270996, 'objective/non_score_reward': -1.7963144779205322, 'objective/rlhf_reward': 0.7466855049133301, 'objective/scores': 2.5429999828338623, 'policy/approxkl_avg': 0.010376510210335255, 'policy/clipfrac_avg': 0.028301887214183807, 'loss/policy_avg': -0.00404017698019743, 'loss/value_avg': 0.07276476919651031, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.14048820734024048, 'val/ratio': 0.9890831708908081, 'val/ratio_var': 7.59613249101676e-05, 'val/num_eos_tokens': 0, 'lr': 4.612934835864772e-05, 'episode': 636, 'epoch': 0.08}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [17:15<1:05:39, 131kB/s]
  8%|▊         | 160/2041 [13:56<2:43:24,  5.21s/it][A

{'eps': 0, 'objective/kl': 39.163124084472656, 'objective/entropy': 8.509784698486328, 'objective/non_score_reward': -1.9581562280654907, 'objective/rlhf_reward': 0.6650105714797974, 'objective/scores': 2.623166799545288, 'policy/approxkl_avg': 0.00722584780305624, 'policy/clipfrac_avg': 0.041273586452007294, 'loss/policy_avg': -0.00823371484875679, 'loss/value_avg': 0.06674133241176605, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.16767024993896484, 'val/ratio': 0.9957840442657471, 'val/ratio_var': 1.2702999811153859e-05, 'val/num_eos_tokens': 0, 'lr': 4.6104850563449294e-05, 'episode': 640, 'epoch': 0.08}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [17:20<1:05:39, 131kB/s]
  8%|▊         | 161/2041 [14:01<2:42:35,  5.19s/it][A

{'eps': 0, 'objective/kl': 41.51166915893555, 'objective/entropy': 13.995454788208008, 'objective/non_score_reward': -2.0755834579467773, 'objective/rlhf_reward': 0.8777914047241211, 'objective/scores': 2.9533748626708984, 'policy/approxkl_avg': 0.06100703775882721, 'policy/clipfrac_avg': 0.11438679695129395, 'loss/policy_avg': -0.0008619270520284772, 'loss/value_avg': 0.07785262167453766, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.20422780513763428, 'val/ratio': 0.9986120462417603, 'val/ratio_var': 7.68240715842694e-06, 'val/num_eos_tokens': 0, 'lr': 4.608035276825086e-05, 'episode': 644, 'epoch': 0.08}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [17:25<1:05:39, 131kB/s]
  8%|▊         | 162/2041 [14:06<2:42:18,  5.18s/it][A

{'eps': 0, 'objective/kl': 37.360679626464844, 'objective/entropy': 6.490554332733154, 'objective/non_score_reward': -1.8680342435836792, 'objective/rlhf_reward': 1.1603513956069946, 'objective/scores': 3.028385639190674, 'policy/approxkl_avg': 0.008403157815337181, 'policy/clipfrac_avg': 0.02712264098227024, 'loss/policy_avg': -0.008108473382890224, 'loss/value_avg': 0.04592249542474747, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1145995631814003, 'val/ratio': 0.9980019330978394, 'val/ratio_var': 2.499116590115591e-06, 'val/num_eos_tokens': 0, 'lr': 4.6055854973052424e-05, 'episode': 648, 'epoch': 0.08}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [17:30<1:05:39, 131kB/s]
  8%|▊         | 163/2041 [14:12<2:41:53,  5.17s/it][A

{'eps': 0, 'objective/kl': 35.50275421142578, 'objective/entropy': 4.6862993240356445, 'objective/non_score_reward': -1.7751376628875732, 'objective/rlhf_reward': 1.3668415546417236, 'objective/scores': 3.141979217529297, 'policy/approxkl_avg': 0.003084630472585559, 'policy/clipfrac_avg': 0.014150943607091904, 'loss/policy_avg': -0.005662649404257536, 'loss/value_avg': 0.04739096760749817, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.09255224466323853, 'val/ratio': 1.0045697689056396, 'val/ratio_var': 1.9239283574279398e-05, 'val/num_eos_tokens': 0, 'lr': 4.6031357177854e-05, 'episode': 652, 'epoch': 0.08}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [17:35<1:05:39, 131kB/s]
  8%|▊         | 164/2041 [14:17<2:41:20,  5.16s/it][A

{'eps': 0, 'objective/kl': 36.60369110107422, 'objective/entropy': 5.165100574493408, 'objective/non_score_reward': -1.8301846981048584, 'objective/rlhf_reward': 1.4390101432800293, 'objective/scores': 3.2691948413848877, 'policy/approxkl_avg': 0.0006116409786045551, 'policy/clipfrac_avg': 0.004716981202363968, 'loss/policy_avg': -0.00047116458881646395, 'loss/value_avg': 0.04762505739927292, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.17832043766975403, 'val/ratio': 0.9951915740966797, 'val/ratio_var': 1.1172832273587119e-05, 'val/num_eos_tokens': 0, 'lr': 4.6006859382655566e-05, 'episode': 656, 'epoch': 0.08}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [17:40<1:05:39, 131kB/s]
  8%|▊         | 165/2041 [14:22<2:41:25,  5.16s/it][A

{'eps': 0, 'objective/kl': 41.294979095458984, 'objective/entropy': 12.267993927001953, 'objective/non_score_reward': -2.064749002456665, 'objective/rlhf_reward': 1.4833776950836182, 'objective/scores': 3.548126697540283, 'policy/approxkl_avg': 0.003156946273520589, 'policy/clipfrac_avg': 0.036556605249643326, 'loss/policy_avg': -0.007148160133510828, 'loss/value_avg': 0.08577549457550049, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.23238950967788696, 'val/ratio': 0.9987019300460815, 'val/ratio_var': 7.486216873076046e-06, 'val/num_eos_tokens': 0, 'lr': 4.598236158745713e-05, 'episode': 660, 'epoch': 0.08}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [17:46<1:05:39, 131kB/s]
  8%|▊         | 166/2041 [14:27<2:41:52,  5.18s/it][A

{'eps': 0, 'objective/kl': 34.751739501953125, 'objective/entropy': 4.995331764221191, 'objective/non_score_reward': -1.7375869750976562, 'objective/rlhf_reward': 1.5098564624786377, 'objective/scores': 3.247443437576294, 'policy/approxkl_avg': 0.0004956152988597751, 'policy/clipfrac_avg': 0.004716981202363968, 'loss/policy_avg': -0.001047074212692678, 'loss/value_avg': 0.06006338819861412, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.15736468136310577, 'val/ratio': 0.9993761777877808, 'val/ratio_var': 4.397138582135085e-06, 'val/num_eos_tokens': 0, 'lr': 4.5957863792258696e-05, 'episode': 664, 'epoch': 0.08}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [17:51<1:05:39, 131kB/s]
  8%|▊         | 167/2041 [14:32<2:42:03,  5.19s/it][A

{'eps': 0, 'objective/kl': 39.70453643798828, 'objective/entropy': 12.343390464782715, 'objective/non_score_reward': -1.9852266311645508, 'objective/rlhf_reward': 1.8081886768341064, 'objective/scores': 3.7934153079986572, 'policy/approxkl_avg': 0.016774993389844894, 'policy/clipfrac_avg': 0.09669811278581619, 'loss/policy_avg': -0.006304648704826832, 'loss/value_avg': 0.1000843346118927, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.28300920128822327, 'val/ratio': 1.0026017427444458, 'val/ratio_var': 5.687955763278296e-06, 'val/num_eos_tokens': 0, 'lr': 4.593336599706027e-05, 'episode': 668, 'epoch': 0.08}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [17:56<1:05:39, 131kB/s]
  8%|▊         | 168/2041 [14:37<2:41:41,  5.18s/it][A

{'eps': 0, 'objective/kl': 37.77736282348633, 'objective/entropy': 14.359600067138672, 'objective/non_score_reward': -1.8888683319091797, 'objective/rlhf_reward': 1.856194257736206, 'objective/scores': 3.7450625896453857, 'policy/approxkl_avg': 0.0020816363394260406, 'policy/clipfrac_avg': 0.028301887214183807, 'loss/policy_avg': -0.006880374159663916, 'loss/value_avg': 0.09595853090286255, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.26297658681869507, 'val/ratio': 0.9985829591751099, 'val/ratio_var': 1.5042569430079311e-06, 'val/num_eos_tokens': 0, 'lr': 4.590886820186183e-05, 'episode': 672, 'epoch': 0.08}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [18:01<1:05:39, 131kB/s]
  8%|▊         | 169/2041 [14:43<2:41:54,  5.19s/it][A

{'eps': 0, 'objective/kl': 39.151145935058594, 'objective/entropy': 16.533924102783203, 'objective/non_score_reward': -1.9575574398040771, 'objective/rlhf_reward': 1.8174419403076172, 'objective/scores': 3.7749993801116943, 'policy/approxkl_avg': 0.0030434392392635345, 'policy/clipfrac_avg': 0.02358490601181984, 'loss/policy_avg': -0.004753784742206335, 'loss/value_avg': 0.10845386236906052, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.29707401990890503, 'val/ratio': 1.0006494522094727, 'val/ratio_var': 2.147344503100612e-06, 'val/num_eos_tokens': 0, 'lr': 4.58843704066634e-05, 'episode': 676, 'epoch': 0.08}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [18:06<1:05:39, 131kB/s]
  8%|▊         | 170/2041 [14:48<2:41:47,  5.19s/it][A

{'eps': 0, 'objective/kl': 37.54145431518555, 'objective/entropy': 11.903359413146973, 'objective/non_score_reward': -1.8770726919174194, 'objective/rlhf_reward': 1.7358375787734985, 'objective/scores': 3.612910270690918, 'policy/approxkl_avg': 0.007017326541244984, 'policy/clipfrac_avg': 0.0554245263338089, 'loss/policy_avg': -0.0061507499776780605, 'loss/value_avg': 0.08495059609413147, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.27153509855270386, 'val/ratio': 1.000456690788269, 'val/ratio_var': 1.883849108708091e-06, 'val/num_eos_tokens': 0, 'lr': 4.585987261146497e-05, 'episode': 680, 'epoch': 0.08}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [18:11<1:05:39, 131kB/s]
  8%|▊         | 171/2041 [14:53<2:40:45,  5.16s/it][A

{'eps': 0, 'objective/kl': 39.7712287902832, 'objective/entropy': 15.491304397583008, 'objective/non_score_reward': -1.9885613918304443, 'objective/rlhf_reward': 1.863011360168457, 'objective/scores': 3.8515727519989014, 'policy/approxkl_avg': 0.0016336911357939243, 'policy/clipfrac_avg': 0.017688680440187454, 'loss/policy_avg': -0.0015584107022732496, 'loss/value_avg': 0.09279318898916245, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.2821735143661499, 'val/ratio': 1.0001863241195679, 'val/ratio_var': 8.213001819967758e-06, 'val/num_eos_tokens': 0, 'lr': 4.583537481626654e-05, 'episode': 684, 'epoch': 0.08}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [18:17<1:05:39, 131kB/s]
  8%|▊         | 172/2041 [14:58<2:40:36,  5.16s/it][A

{'eps': 0, 'objective/kl': 40.92139434814453, 'objective/entropy': 17.288780212402344, 'objective/non_score_reward': -2.046069622039795, 'objective/rlhf_reward': 1.7746481895446777, 'objective/scores': 3.8207178115844727, 'policy/approxkl_avg': 0.0023528768215328455, 'policy/clipfrac_avg': 0.02712264284491539, 'loss/policy_avg': -0.00849227886646986, 'loss/value_avg': 0.11226125806570053, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.32971346378326416, 'val/ratio': 1.001945972442627, 'val/ratio_var': 8.894262464309577e-06, 'val/num_eos_tokens': 0, 'lr': 4.5810877021068104e-05, 'episode': 688, 'epoch': 0.08}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [18:22<1:05:39, 131kB/s]
  8%|▊         | 173/2041 [15:03<2:40:16,  5.15s/it][A

{'eps': 0, 'objective/kl': 40.374778747558594, 'objective/entropy': 14.513599395751953, 'objective/non_score_reward': -2.0187389850616455, 'objective/rlhf_reward': 1.8720645904541016, 'objective/scores': 3.890803575515747, 'policy/approxkl_avg': 0.006409823894500732, 'policy/clipfrac_avg': 0.033018868416547775, 'loss/policy_avg': -0.004103752784430981, 'loss/value_avg': 0.09944282472133636, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.27670982480049133, 'val/ratio': 1.0005500316619873, 'val/ratio_var': 9.199225701195246e-07, 'val/num_eos_tokens': 0, 'lr': 4.578637922586967e-05, 'episode': 692, 'epoch': 0.08}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [18:27<1:05:39, 131kB/s]
  9%|▊         | 174/2041 [15:08<2:39:58,  5.14s/it][A

{'eps': 0, 'objective/kl': 37.83220672607422, 'objective/entropy': 13.943809509277344, 'objective/non_score_reward': -1.8916103839874268, 'objective/rlhf_reward': 1.4562580585479736, 'objective/scores': 3.3478684425354004, 'policy/approxkl_avg': 0.022316990420222282, 'policy/clipfrac_avg': 0.0625, 'loss/policy_avg': -0.007283987943083048, 'loss/value_avg': 0.07674521207809448, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.2114778459072113, 'val/ratio': 0.9912350177764893, 'val/ratio_var': 3.792467396124266e-05, 'val/num_eos_tokens': 0, 'lr': 4.576188143067125e-05, 'episode': 696, 'epoch': 0.09}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [18:32<1:05:39, 131kB/s]
  9%|▊         | 175/2041 [15:14<2:40:49,  5.17s/it][A

{'eps': 0, 'objective/kl': 35.78602600097656, 'objective/entropy': 13.440879821777344, 'objective/non_score_reward': -1.7893015146255493, 'objective/rlhf_reward': 2.1404457092285156, 'objective/scores': 3.9297473430633545, 'policy/approxkl_avg': 0.05308790132403374, 'policy/clipfrac_avg': 0.11320754885673523, 'loss/policy_avg': -0.0023104187566787004, 'loss/value_avg': 0.1223975419998169, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.22277136147022247, 'val/ratio': 1.0039150714874268, 'val/ratio_var': 3.272222602390684e-05, 'val/num_eos_tokens': 0, 'lr': 4.573738363547281e-05, 'episode': 700, 'epoch': 0.09}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [18:37<1:05:39, 131kB/s]
  9%|▊         | 176/2041 [15:19<2:40:47,  5.17s/it][A

{'eps': 0, 'objective/kl': 39.287811279296875, 'objective/entropy': 12.718406677246094, 'objective/non_score_reward': -1.964390754699707, 'objective/rlhf_reward': 2.0043628215789795, 'objective/scores': 3.9687535762786865, 'policy/approxkl_avg': 0.024322954937815666, 'policy/clipfrac_avg': 0.05778301879763603, 'loss/policy_avg': -0.011502099223434925, 'loss/value_avg': 0.12992346286773682, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.21552425622940063, 'val/ratio': 0.9875892996788025, 'val/ratio_var': 0.00032539560925215483, 'val/num_eos_tokens': 0, 'lr': 4.5712885840274376e-05, 'episode': 704, 'epoch': 0.09}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [18:42<1:05:39, 131kB/s]
  9%|▊         | 177/2041 [15:24<2:41:25,  5.20s/it][A

{'eps': 0, 'objective/kl': 42.95761489868164, 'objective/entropy': 12.068042755126953, 'objective/non_score_reward': -2.1478805541992188, 'objective/rlhf_reward': 1.8161940574645996, 'objective/scores': 3.9640746116638184, 'policy/approxkl_avg': 0.02400929108262062, 'policy/clipfrac_avg': 0.06367924809455872, 'loss/policy_avg': -0.013753984123468399, 'loss/value_avg': 0.15562012791633606, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.19513067603111267, 'val/ratio': 0.9922296404838562, 'val/ratio_var': 3.8037589547457173e-05, 'val/num_eos_tokens': 0, 'lr': 4.5688388045075944e-05, 'episode': 708, 'epoch': 0.09}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [18:48<1:05:39, 131kB/s]
  9%|▊         | 178/2041 [15:29<2:41:51,  5.21s/it][A

{'eps': 0, 'objective/kl': 37.83735275268555, 'objective/entropy': 4.385076522827148, 'objective/non_score_reward': -1.8918675184249878, 'objective/rlhf_reward': 1.8664034605026245, 'objective/scores': 3.7582709789276123, 'policy/approxkl_avg': 0.0008233568514697254, 'policy/clipfrac_avg': 0.01179245300590992, 'loss/policy_avg': -0.003618914168328047, 'loss/value_avg': 0.046660084277391434, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.11390172690153122, 'val/ratio': 1.0043972730636597, 'val/ratio_var': 9.540534847474191e-06, 'val/num_eos_tokens': 0, 'lr': 4.566389024987751e-05, 'episode': 712, 'epoch': 0.09}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [18:53<1:05:39, 131kB/s]
  9%|▉         | 179/2041 [15:34<2:41:08,  5.19s/it][A

{'eps': 0, 'objective/kl': 41.382789611816406, 'objective/entropy': 8.082560539245605, 'objective/non_score_reward': -2.0691394805908203, 'objective/rlhf_reward': 1.8098242282867432, 'objective/scores': 3.8789637088775635, 'policy/approxkl_avg': 0.04100736975669861, 'policy/clipfrac_avg': 0.02712264098227024, 'loss/policy_avg': -0.0032002138905227184, 'loss/value_avg': 0.07344231754541397, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.09727220982313156, 'val/ratio': 0.9872850179672241, 'val/ratio_var': 8.914170757634565e-05, 'val/num_eos_tokens': 0, 'lr': 4.563939245467908e-05, 'episode': 716, 'epoch': 0.09}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [18:58<1:05:39, 131kB/s]
  9%|▉         | 180/2041 [15:40<2:41:39,  5.21s/it][A

{'eps': 0, 'objective/kl': 37.795867919921875, 'objective/entropy': 6.090153694152832, 'objective/non_score_reward': -1.8897936344146729, 'objective/rlhf_reward': 2.124702215194702, 'objective/scores': 4.014495849609375, 'policy/approxkl_avg': 0.03700804337859154, 'policy/clipfrac_avg': 0.040094342082738876, 'loss/policy_avg': -0.001252402551472187, 'loss/value_avg': 0.0685984194278717, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1699591428041458, 'val/ratio': 1.0260047912597656, 'val/ratio_var': 0.00040860477020032704, 'val/num_eos_tokens': 0, 'lr': 4.561489465948065e-05, 'episode': 720, 'epoch': 0.09}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [19:03<1:05:39, 131kB/s]
  9%|▉         | 181/2041 [15:45<2:41:12,  5.20s/it][A

{'eps': 0, 'objective/kl': 39.646812438964844, 'objective/entropy': 4.512598037719727, 'objective/non_score_reward': -1.9823405742645264, 'objective/rlhf_reward': 1.7663228511810303, 'objective/scores': 3.7486634254455566, 'policy/approxkl_avg': 0.005377134773880243, 'policy/clipfrac_avg': 0.02358490601181984, 'loss/policy_avg': -0.004788469523191452, 'loss/value_avg': 0.06142555922269821, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.12758105993270874, 'val/ratio': 1.000827431678772, 'val/ratio_var': 4.814407930098241e-06, 'val/num_eos_tokens': 0, 'lr': 4.5590396864282216e-05, 'episode': 724, 'epoch': 0.09}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [19:08<1:05:39, 131kB/s]
  9%|▉         | 182/2041 [15:50<2:40:17,  5.17s/it][A

{'eps': 0, 'objective/kl': 43.840476989746094, 'objective/entropy': 4.239185810089111, 'objective/non_score_reward': -2.192023992538452, 'objective/rlhf_reward': 1.657850980758667, 'objective/scores': 3.849874973297119, 'policy/approxkl_avg': 0.004415649920701981, 'policy/clipfrac_avg': 0.024764150381088257, 'loss/policy_avg': -0.00810755230486393, 'loss/value_avg': 0.11739042401313782, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.11651332676410675, 'val/ratio': 0.9961835741996765, 'val/ratio_var': 1.9728253391804174e-05, 'val/num_eos_tokens': 0, 'lr': 4.5565899069083784e-05, 'episode': 728, 'epoch': 0.09}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [19:14<1:05:39, 131kB/s]
  9%|▉         | 183/2041 [15:55<2:40:06,  5.17s/it][A

{'eps': 0, 'objective/kl': 39.133544921875, 'objective/entropy': 4.90446662902832, 'objective/non_score_reward': -1.9566771984100342, 'objective/rlhf_reward': 0.9330644607543945, 'objective/scores': 2.8897416591644287, 'policy/approxkl_avg': 0.008276667445898056, 'policy/clipfrac_avg': 0.03773584961891174, 'loss/policy_avg': -0.010560364462435246, 'loss/value_avg': 0.10182538628578186, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.08380406349897385, 'val/ratio': 0.9862070679664612, 'val/ratio_var': 0.00013030617265030742, 'val/num_eos_tokens': 0, 'lr': 4.554140127388535e-05, 'episode': 732, 'epoch': 0.09}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [19:19<1:05:39, 131kB/s]
  9%|▉         | 184/2041 [16:00<2:40:48,  5.20s/it][A

{'eps': 0, 'objective/kl': 38.6881103515625, 'objective/entropy': 3.552155017852783, 'objective/non_score_reward': -1.9344055652618408, 'objective/rlhf_reward': 1.1231000423431396, 'objective/scores': 3.0575056076049805, 'policy/approxkl_avg': 0.004879331681877375, 'policy/clipfrac_avg': 0.02358490601181984, 'loss/policy_avg': -0.007771622389554977, 'loss/value_avg': 0.09144418686628342, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.08994529396295547, 'val/ratio': 1.0005607604980469, 'val/ratio_var': 7.458646109625988e-07, 'val/num_eos_tokens': 0, 'lr': 4.551690347868692e-05, 'episode': 736, 'epoch': 0.09}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [19:24<1:05:39, 131kB/s]
  9%|▉         | 185/2041 [16:06<2:40:52,  5.20s/it][A

{'eps': 0, 'objective/kl': 46.19679641723633, 'objective/entropy': 9.08409595489502, 'objective/non_score_reward': -2.309839963912964, 'objective/rlhf_reward': 1.0727696418762207, 'objective/scores': 3.3826096057891846, 'policy/approxkl_avg': 0.0400504432618618, 'policy/clipfrac_avg': 0.05778301879763603, 'loss/policy_avg': -0.013687603175640106, 'loss/value_avg': 0.1451301872730255, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.19859054684638977, 'val/ratio': 1.0110740661621094, 'val/ratio_var': 0.00018261419609189034, 'val/num_eos_tokens': 0, 'lr': 4.549240568348849e-05, 'episode': 740, 'epoch': 0.09}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [19:29<1:05:39, 131kB/s]
  9%|▉         | 186/2041 [16:11<2:40:10,  5.18s/it][A

{'eps': 0, 'objective/kl': 53.2000846862793, 'objective/entropy': 10.22661304473877, 'objective/non_score_reward': -2.660004138946533, 'objective/rlhf_reward': 0.3115997314453125, 'objective/scores': 2.9716038703918457, 'policy/approxkl_avg': 0.03052845224738121, 'policy/clipfrac_avg': 0.06839622557163239, 'loss/policy_avg': -0.015752702951431274, 'loss/value_avg': 0.26878076791763306, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.16660329699516296, 'val/ratio': 0.9888291358947754, 'val/ratio_var': 5.6952809245558456e-05, 'val/num_eos_tokens': 0, 'lr': 4.5467907888290057e-05, 'episode': 744, 'epoch': 0.09}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [19:34<1:05:39, 131kB/s]
  9%|▉         | 187/2041 [16:16<2:40:10,  5.18s/it][A

{'eps': 0, 'objective/kl': 52.508243560791016, 'objective/entropy': 9.632719039916992, 'objective/non_score_reward': -2.6254119873046875, 'objective/rlhf_reward': 0.007110118865966797, 'objective/scores': 2.6325221061706543, 'policy/approxkl_avg': 0.056753404438495636, 'policy/clipfrac_avg': 0.044811319559812546, 'loss/policy_avg': -0.01341230422258377, 'loss/value_avg': 0.37884604930877686, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.17954696714878082, 'val/ratio': 0.9921183586120605, 'val/ratio_var': 4.863475260208361e-05, 'val/num_eos_tokens': 1, 'lr': 4.5443410093091625e-05, 'episode': 748, 'epoch': 0.09}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [19:39<1:05:39, 131kB/s]
  9%|▉         | 188/2041 [16:21<2:39:29,  5.16s/it][A

{'eps': 0, 'objective/kl': 55.47532653808594, 'objective/entropy': 10.363481521606445, 'objective/non_score_reward': -2.77376651763916, 'objective/rlhf_reward': 0.4046347141265869, 'objective/scores': 3.178401231765747, 'policy/approxkl_avg': 0.012130826711654663, 'policy/clipfrac_avg': 0.0554245300590992, 'loss/policy_avg': -0.00673577468842268, 'loss/value_avg': 0.37378519773483276, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1518688201904297, 'val/ratio': 0.9858298301696777, 'val/ratio_var': 0.00011775215534726158, 'val/num_eos_tokens': 0, 'lr': 4.541891229789319e-05, 'episode': 752, 'epoch': 0.09}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [19:45<1:05:39, 131kB/s]
  9%|▉         | 189/2041 [16:26<2:39:30,  5.17s/it][A

{'eps': 0, 'objective/kl': 54.3533935546875, 'objective/entropy': 6.635310649871826, 'objective/non_score_reward': -2.71766996383667, 'objective/rlhf_reward': 0.6551740169525146, 'objective/scores': 3.3728439807891846, 'policy/approxkl_avg': 0.028299929574131966, 'policy/clipfrac_avg': 0.016509434208273888, 'loss/policy_avg': -0.003551133442670107, 'loss/value_avg': 0.5072471499443054, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.12569420039653778, 'val/ratio': 0.9895954728126526, 'val/ratio_var': 7.635539077455178e-05, 'val/num_eos_tokens': 3, 'lr': 4.539441450269476e-05, 'episode': 756, 'epoch': 0.09}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [19:50<1:05:39, 131kB/s]
  9%|▉         | 190/2041 [16:31<2:39:50,  5.18s/it][A

{'eps': 0, 'objective/kl': 47.40943145751953, 'objective/entropy': 6.9938130378723145, 'objective/non_score_reward': -2.370471477508545, 'objective/rlhf_reward': 1.0575368404388428, 'objective/scores': 3.4280083179473877, 'policy/approxkl_avg': 0.31258222460746765, 'policy/clipfrac_avg': 0.03537736088037491, 'loss/policy_avg': -0.009806070476770401, 'loss/value_avg': 0.1350824534893036, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.13611051440238953, 'val/ratio': 0.9885216951370239, 'val/ratio_var': 6.0678594309138134e-05, 'val/num_eos_tokens': 0, 'lr': 4.536991670749633e-05, 'episode': 760, 'epoch': 0.09}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [19:55<1:05:39, 131kB/s]
  9%|▉         | 191/2041 [16:37<2:39:34,  5.18s/it][A

{'eps': 0, 'objective/kl': 44.04616928100586, 'objective/entropy': 4.723623275756836, 'objective/non_score_reward': -2.202308416366577, 'objective/rlhf_reward': 1.0830895900726318, 'objective/scores': 3.285398006439209, 'policy/approxkl_avg': 0.04401802644133568, 'policy/clipfrac_avg': 0.04363207519054413, 'loss/policy_avg': -0.01140465959906578, 'loss/value_avg': 0.0766478106379509, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1804559975862503, 'val/ratio': 0.9834598302841187, 'val/ratio_var': 0.00047005616943351924, 'val/num_eos_tokens': 0, 'lr': 4.534541891229789e-05, 'episode': 764, 'epoch': 0.09}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [20:00<1:05:39, 131kB/s]
  9%|▉         | 192/2041 [16:42<2:39:56,  5.19s/it][A

{'eps': 0, 'objective/kl': 51.5201530456543, 'objective/entropy': 12.865375518798828, 'objective/non_score_reward': -2.576007843017578, 'objective/rlhf_reward': 0.4135465621948242, 'objective/scores': 2.9895544052124023, 'policy/approxkl_avg': 0.049308113753795624, 'policy/clipfrac_avg': 0.056603770703077316, 'loss/policy_avg': -0.01745528355240822, 'loss/value_avg': 0.1630808562040329, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.18311338126659393, 'val/ratio': 0.9826552867889404, 'val/ratio_var': 0.00019546777184586972, 'val/num_eos_tokens': 0, 'lr': 4.5320921117099465e-05, 'episode': 768, 'epoch': 0.09}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [20:05<1:05:39, 131kB/s]
  9%|▉         | 193/2041 [16:47<2:39:53,  5.19s/it][A

{'eps': 0, 'objective/kl': 41.65142822265625, 'objective/entropy': 4.114953517913818, 'objective/non_score_reward': -2.082571506500244, 'objective/rlhf_reward': 0.21511459350585938, 'objective/scores': 2.2976861000061035, 'policy/approxkl_avg': 0.07927332073450089, 'policy/clipfrac_avg': 0.049528300762176514, 'loss/policy_avg': -0.012534437701106071, 'loss/value_avg': 0.1171785369515419, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.0583917461335659, 'val/ratio': 0.9684332609176636, 'val/ratio_var': 0.0005584924947470427, 'val/num_eos_tokens': 0, 'lr': 4.529642332190103e-05, 'episode': 772, 'epoch': 0.09}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [20:11<1:05:39, 131kB/s]
 10%|▉         | 194/2041 [16:52<2:39:36,  5.18s/it][A

{'eps': 0, 'objective/kl': 36.996395111083984, 'objective/entropy': 5.335104942321777, 'objective/non_score_reward': -1.8498198986053467, 'objective/rlhf_reward': 1.1633172035217285, 'objective/scores': 3.013137102127075, 'policy/approxkl_avg': 0.02034713514149189, 'policy/clipfrac_avg': 0.03066037781536579, 'loss/policy_avg': -0.0009404802694916725, 'loss/value_avg': 0.054540760815143585, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.07015913724899292, 'val/ratio': 1.008094072341919, 'val/ratio_var': 3.0263399821706116e-05, 'val/num_eos_tokens': 0, 'lr': 4.5271925526702594e-05, 'episode': 776, 'epoch': 0.1}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [20:16<1:05:39, 131kB/s]
 10%|▉         | 195/2041 [16:57<2:38:39,  5.16s/it][A

{'eps': 0, 'objective/kl': 35.923622131347656, 'objective/entropy': 0.2735414505004883, 'objective/non_score_reward': -1.796181082725525, 'objective/rlhf_reward': 1.272750735282898, 'objective/scores': 3.068931818008423, 'policy/approxkl_avg': 1.1352205547154881e-05, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -9.39841538638575e-06, 'loss/value_avg': 0.018383312970399857, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.04451502114534378, 'val/ratio': 0.9971969127655029, 'val/ratio_var': 6.693849627481541e-06, 'val/num_eos_tokens': 0, 'lr': 4.524742773150417e-05, 'episode': 780, 'epoch': 0.1}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [20:21<1:05:39, 131kB/s]
 10%|▉         | 196/2041 [17:02<2:39:10,  5.18s/it][A

{'eps': 0, 'objective/kl': 39.59361267089844, 'objective/entropy': 8.7360258102417, 'objective/non_score_reward': -1.9796807765960693, 'objective/rlhf_reward': 1.2792754173278809, 'objective/scores': 3.25895619392395, 'policy/approxkl_avg': 0.0017075175419449806, 'policy/clipfrac_avg': 0.017688680440187454, 'loss/policy_avg': -0.007351896725594997, 'loss/value_avg': 0.05205965042114258, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1317664235830307, 'val/ratio': 0.9969090819358826, 'val/ratio_var': 5.9964881984342355e-06, 'val/num_eos_tokens': 0, 'lr': 4.522292993630574e-05, 'episode': 784, 'epoch': 0.1}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [20:26<1:05:39, 131kB/s]
 10%|▉         | 197/2041 [17:08<2:39:42,  5.20s/it][A

{'eps': 0, 'objective/kl': 35.18440628051758, 'objective/entropy': 10.358325958251953, 'objective/non_score_reward': -1.7592203617095947, 'objective/rlhf_reward': 1.3247556686401367, 'objective/scores': 3.0839760303497314, 'policy/approxkl_avg': 0.008089608512818813, 'policy/clipfrac_avg': 0.03655660152435303, 'loss/policy_avg': -0.01097355131059885, 'loss/value_avg': 0.06937792152166367, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.14484165608882904, 'val/ratio': 0.9922921657562256, 'val/ratio_var': 3.684606053866446e-05, 'val/num_eos_tokens': 0, 'lr': 4.5198432141107305e-05, 'episode': 788, 'epoch': 0.1}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [20:31<1:05:39, 131kB/s]
 10%|▉         | 198/2041 [17:13<2:38:56,  5.17s/it][A

{'eps': 0, 'objective/kl': 37.42763900756836, 'objective/entropy': 8.621920585632324, 'objective/non_score_reward': -1.8713819980621338, 'objective/rlhf_reward': 1.5851149559020996, 'objective/scores': 3.4564969539642334, 'policy/approxkl_avg': 0.015462973155081272, 'policy/clipfrac_avg': 0.021226415410637856, 'loss/policy_avg': -0.003959035500884056, 'loss/value_avg': 0.043747082352638245, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.18409854173660278, 'val/ratio': 1.018388032913208, 'val/ratio_var': 0.0002628863148856908, 'val/num_eos_tokens': 0, 'lr': 4.5173934345908866e-05, 'episode': 792, 'epoch': 0.1}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [20:36<1:05:39, 131kB/s]
 10%|▉         | 199/2041 [17:18<2:38:05,  5.15s/it][A

{'eps': 0, 'objective/kl': 41.091251373291016, 'objective/entropy': 11.539445877075195, 'objective/non_score_reward': -2.054562568664551, 'objective/rlhf_reward': 1.3578581809997559, 'objective/scores': 3.4124207496643066, 'policy/approxkl_avg': 0.005923489108681679, 'policy/clipfrac_avg': 0.03419811278581619, 'loss/policy_avg': -0.006486373487859964, 'loss/value_avg': 0.05457314848899841, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.20424693822860718, 'val/ratio': 0.9942374229431152, 'val/ratio_var': 3.665843541966751e-05, 'val/num_eos_tokens': 0, 'lr': 4.514943655071044e-05, 'episode': 796, 'epoch': 0.1}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [20:42<1:05:39, 131kB/s]
 10%|▉         | 200/2041 [17:23<2:38:40,  5.17s/it][A

{'eps': 0, 'objective/kl': 36.946998596191406, 'objective/entropy': 10.432537078857422, 'objective/non_score_reward': -1.8473498821258545, 'objective/rlhf_reward': 0.7313477993011475, 'objective/scores': 2.578697681427002, 'policy/approxkl_avg': 0.008012598380446434, 'policy/clipfrac_avg': 0.04009433835744858, 'loss/policy_avg': -0.013144792057573795, 'loss/value_avg': 0.08588938415050507, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1642998456954956, 'val/ratio': 0.9860597252845764, 'val/ratio_var': 0.0001578990340931341, 'val/num_eos_tokens': 0, 'lr': 4.512493875551201e-05, 'episode': 800, 'epoch': 0.1}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [20:47<1:05:39, 131kB/s]
 10%|▉         | 201/2041 [17:28<2:39:06,  5.19s/it][A

{'eps': 0, 'objective/kl': 41.871849060058594, 'objective/entropy': 12.890911102294922, 'objective/non_score_reward': -2.093592405319214, 'objective/rlhf_reward': 1.1043815612792969, 'objective/scores': 3.1979739665985107, 'policy/approxkl_avg': 0.01497445534914732, 'policy/clipfrac_avg': 0.049528300762176514, 'loss/policy_avg': -0.007418056484311819, 'loss/value_avg': 0.08152171969413757, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.2569689452648163, 'val/ratio': 0.9958829283714294, 'val/ratio_var': 8.472019544569775e-06, 'val/num_eos_tokens': 0, 'lr': 4.510044096031357e-05, 'episode': 804, 'epoch': 0.1}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [20:52<1:05:39, 131kB/s]
 10%|▉         | 202/2041 [17:34<2:39:29,  5.20s/it][A

{'eps': 0, 'objective/kl': 38.78450393676758, 'objective/entropy': 10.437366485595703, 'objective/non_score_reward': -1.939225435256958, 'objective/rlhf_reward': 0.7029457092285156, 'objective/scores': 2.6421711444854736, 'policy/approxkl_avg': 0.011902350932359695, 'policy/clipfrac_avg': 0.0625, 'loss/policy_avg': -0.015245198272168636, 'loss/value_avg': 0.07782311737537384, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.18819990754127502, 'val/ratio': 0.9991783499717712, 'val/ratio_var': 5.374963052418025e-07, 'val/num_eos_tokens': 0, 'lr': 4.5075943165115145e-05, 'episode': 808, 'epoch': 0.1}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [20:57<1:05:39, 131kB/s]
 10%|▉         | 203/2041 [17:39<2:38:45,  5.18s/it][A

{'eps': 0, 'objective/kl': 38.25154113769531, 'objective/entropy': 12.189001083374023, 'objective/non_score_reward': -1.9125769138336182, 'objective/rlhf_reward': 1.3033862113952637, 'objective/scores': 3.215963125228882, 'policy/approxkl_avg': 0.00388232315890491, 'policy/clipfrac_avg': 0.014150943607091904, 'loss/policy_avg': -0.006320873275399208, 'loss/value_avg': 0.06047726050019264, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.26672351360321045, 'val/ratio': 1.005598783493042, 'val/ratio_var': 2.732409666350577e-05, 'val/num_eos_tokens': 0, 'lr': 4.505144536991671e-05, 'episode': 812, 'epoch': 0.1}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [21:02<1:05:39, 131kB/s]
 10%|▉         | 204/2041 [17:44<2:38:46,  5.19s/it][A

{'eps': 0, 'objective/kl': 35.214717864990234, 'objective/entropy': 7.116078853607178, 'objective/non_score_reward': -1.760736107826233, 'objective/rlhf_reward': 0.5230017900466919, 'objective/scores': 2.283737897872925, 'policy/approxkl_avg': 0.012482733465731144, 'policy/clipfrac_avg': 0.049528300762176514, 'loss/policy_avg': -0.015061349608004093, 'loss/value_avg': 0.07009799778461456, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.09922172129154205, 'val/ratio': 0.9854731559753418, 'val/ratio_var': 0.00015715851623099297, 'val/num_eos_tokens': 0, 'lr': 4.5026947574718275e-05, 'episode': 816, 'epoch': 0.1}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [21:08<1:05:39, 131kB/s]
 10%|█         | 205/2041 [17:49<2:38:51,  5.19s/it][A

{'eps': 0, 'objective/kl': 39.08777618408203, 'objective/entropy': 4.798330307006836, 'objective/non_score_reward': -1.954388976097107, 'objective/rlhf_reward': 1.3514429330825806, 'objective/scores': 3.3058319091796875, 'policy/approxkl_avg': 0.002252711448818445, 'policy/clipfrac_avg': 0.015330188907682896, 'loss/policy_avg': -0.00809796154499054, 'loss/value_avg': 0.04163285344839096, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.15416885912418365, 'val/ratio': 0.9921932220458984, 'val/ratio_var': 6.971069524297491e-05, 'val/num_eos_tokens': 0, 'lr': 4.500244977951984e-05, 'episode': 820, 'epoch': 0.1}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [21:16<1:05:39, 131kB/s]
 10%|█         | 206/2041 [17:57<3:06:56,  6.11s/it][A

{'eps': 0, 'objective/kl': 42.05260467529297, 'objective/entropy': 11.352415084838867, 'objective/non_score_reward': -2.102630138397217, 'objective/rlhf_reward': 1.0156359672546387, 'objective/scores': 3.1182661056518555, 'policy/approxkl_avg': 0.010955284349620342, 'policy/clipfrac_avg': 0.02358490601181984, 'loss/policy_avg': -0.010856308974325657, 'loss/value_avg': 0.04679938033223152, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.15759792923927307, 'val/ratio': 0.9916508197784424, 'val/ratio_var': 3.89452870876994e-05, 'val/num_eos_tokens': 0, 'lr': 4.497795198432142e-05, 'episode': 824, 'epoch': 0.1}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [21:21<1:05:39, 131kB/s]
 10%|█         | 207/2041 [18:03<2:58:34,  5.84s/it][A

{'eps': 0, 'objective/kl': 41.2537841796875, 'objective/entropy': 8.144973754882812, 'objective/non_score_reward': -2.0626890659332275, 'objective/rlhf_reward': 1.1225101947784424, 'objective/scores': 3.18519926071167, 'policy/approxkl_avg': 0.004845474846661091, 'policy/clipfrac_avg': 0.03537735715508461, 'loss/policy_avg': -0.011728156358003616, 'loss/value_avg': 0.04580456018447876, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.20686236023902893, 'val/ratio': 0.9926208257675171, 'val/ratio_var': 6.849199417047203e-05, 'val/num_eos_tokens': 0, 'lr': 4.495345418912298e-05, 'episode': 828, 'epoch': 0.1}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [21:26<1:05:39, 131kB/s]
 10%|█         | 208/2041 [18:08<2:52:21,  5.64s/it][A

{'eps': 0, 'objective/kl': 41.06779861450195, 'objective/entropy': 9.393627166748047, 'objective/non_score_reward': -2.0533900260925293, 'objective/rlhf_reward': 1.0876343250274658, 'objective/scores': 3.141024351119995, 'policy/approxkl_avg': 0.014740731567144394, 'policy/clipfrac_avg': 0.04599056765437126, 'loss/policy_avg': -0.004873958881944418, 'loss/value_avg': 0.047712795436382294, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.16749286651611328, 'val/ratio': 0.9907311201095581, 'val/ratio_var': 4.9571168347029015e-05, 'val/num_eos_tokens': 0, 'lr': 4.492895639392455e-05, 'episode': 832, 'epoch': 0.1}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [21:31<1:05:39, 131kB/s]
 10%|█         | 209/2041 [18:13<2:48:10,  5.51s/it][A

{'eps': 0, 'objective/kl': 40.09307861328125, 'objective/entropy': 8.971412658691406, 'objective/non_score_reward': -2.0046536922454834, 'objective/rlhf_reward': 1.1537306308746338, 'objective/scores': 3.158384323120117, 'policy/approxkl_avg': 0.004603125154972076, 'policy/clipfrac_avg': 0.03183962404727936, 'loss/policy_avg': -0.003659514943137765, 'loss/value_avg': 0.044782042503356934, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.22709137201309204, 'val/ratio': 0.9983747005462646, 'val/ratio_var': 3.4575696190586314e-06, 'val/num_eos_tokens': 0, 'lr': 4.4904458598726115e-05, 'episode': 836, 'epoch': 0.1}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [21:37<1:05:39, 131kB/s]
 10%|█         | 210/2041 [18:18<2:45:32,  5.42s/it][A

{'eps': 0, 'objective/kl': 40.026611328125, 'objective/entropy': 11.914738655090332, 'objective/non_score_reward': -2.001330614089966, 'objective/rlhf_reward': 1.2110717296600342, 'objective/scores': 3.21240234375, 'policy/approxkl_avg': 0.0032804012298583984, 'policy/clipfrac_avg': 0.041273586452007294, 'loss/policy_avg': -0.011059246025979519, 'loss/value_avg': 0.05581193044781685, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.22410929203033447, 'val/ratio': 0.9968580007553101, 'val/ratio_var': 1.0557996574789286e-05, 'val/num_eos_tokens': 0, 'lr': 4.487996080352768e-05, 'episode': 840, 'epoch': 0.1}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [21:42<1:05:39, 131kB/s]
 10%|█         | 211/2041 [18:23<2:42:48,  5.34s/it][A

{'eps': 0, 'objective/kl': 41.928314208984375, 'objective/entropy': 11.049315452575684, 'objective/non_score_reward': -2.0964157581329346, 'objective/rlhf_reward': 1.1853859424591064, 'objective/scores': 3.281801700592041, 'policy/approxkl_avg': 0.008185631595551968, 'policy/clipfrac_avg': 0.033018868416547775, 'loss/policy_avg': -0.01256062462925911, 'loss/value_avg': 0.04171207547187805, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.21336603164672852, 'val/ratio': 0.986669659614563, 'val/ratio_var': 0.0001388497621519491, 'val/num_eos_tokens': 0, 'lr': 4.485546300832925e-05, 'episode': 844, 'epoch': 0.1}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [21:47<1:05:39, 131kB/s]
 10%|█         | 212/2041 [18:29<2:40:55,  5.28s/it][A

{'eps': 0, 'objective/kl': 36.32837677001953, 'objective/entropy': 10.31696605682373, 'objective/non_score_reward': -1.8164187669754028, 'objective/rlhf_reward': 1.4410122632980347, 'objective/scores': 3.2574310302734375, 'policy/approxkl_avg': 0.005239120684564114, 'policy/clipfrac_avg': 0.02712264284491539, 'loss/policy_avg': -0.004382893908768892, 'loss/value_avg': 0.03964463993906975, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.20521263778209686, 'val/ratio': 0.9986019730567932, 'val/ratio_var': 1.5465486285393126e-05, 'val/num_eos_tokens': 0, 'lr': 4.483096521313082e-05, 'episode': 848, 'epoch': 0.1}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [21:52<1:05:39, 131kB/s]
 10%|█         | 213/2041 [18:34<2:38:57,  5.22s/it][A

{'eps': 0, 'objective/kl': 38.86288833618164, 'objective/entropy': 13.018682479858398, 'objective/non_score_reward': -1.9431443214416504, 'objective/rlhf_reward': 1.5932519435882568, 'objective/scores': 3.5363962650299072, 'policy/approxkl_avg': 0.004820013884454966, 'policy/clipfrac_avg': 0.02712264284491539, 'loss/policy_avg': -0.006508568301796913, 'loss/value_avg': 0.025827988982200623, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.20205098390579224, 'val/ratio': 1.0014569759368896, 'val/ratio_var': 2.522984459574218e-06, 'val/num_eos_tokens': 0, 'lr': 4.4806467417932394e-05, 'episode': 852, 'epoch': 0.1}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [21:57<1:05:39, 131kB/s]
 10%|█         | 214/2041 [18:39<2:38:25,  5.20s/it][A

{'eps': 0, 'objective/kl': 38.876983642578125, 'objective/entropy': 11.99203872680664, 'objective/non_score_reward': -1.9438490867614746, 'objective/rlhf_reward': 1.502176284790039, 'objective/scores': 3.4460253715515137, 'policy/approxkl_avg': 0.015446759760379791, 'policy/clipfrac_avg': 0.06485849618911743, 'loss/policy_avg': -0.0010879725450649858, 'loss/value_avg': 0.03936012461781502, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.21682652831077576, 'val/ratio': 1.0168272256851196, 'val/ratio_var': 0.00013289037451613694, 'val/num_eos_tokens': 0, 'lr': 4.4781969622733955e-05, 'episode': 856, 'epoch': 0.1}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [22:02<1:05:39, 131kB/s]
 11%|█         | 215/2041 [18:44<2:37:47,  5.18s/it][A

{'eps': 0, 'objective/kl': 44.960411071777344, 'objective/entropy': 15.740501403808594, 'objective/non_score_reward': -2.248020648956299, 'objective/rlhf_reward': 0.4403362274169922, 'objective/scores': 2.688356876373291, 'policy/approxkl_avg': 0.009173521772027016, 'policy/clipfrac_avg': 0.06839622557163239, 'loss/policy_avg': -0.016244972124695778, 'loss/value_avg': 0.08553099632263184, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.24151504039764404, 'val/ratio': 0.9834260940551758, 'val/ratio_var': 0.0001557416544528678, 'val/num_eos_tokens': 0, 'lr': 4.475747182753552e-05, 'episode': 860, 'epoch': 0.11}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [22:07<1:05:39, 131kB/s]
 11%|█         | 216/2041 [18:49<2:37:22,  5.17s/it][A

{'eps': 0, 'objective/kl': 41.571311950683594, 'objective/entropy': 13.407329559326172, 'objective/non_score_reward': -2.0785655975341797, 'objective/rlhf_reward': 1.3675026893615723, 'objective/scores': 3.446068286895752, 'policy/approxkl_avg': 0.004160721320658922, 'policy/clipfrac_avg': 0.03773584961891174, 'loss/policy_avg': -0.0059625799767673016, 'loss/value_avg': 0.047330908477306366, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.2802159786224365, 'val/ratio': 1.000248908996582, 'val/ratio_var': 7.862465736252489e-07, 'val/num_eos_tokens': 0, 'lr': 4.473297403233709e-05, 'episode': 864, 'epoch': 0.11}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [22:13<1:05:39, 131kB/s]
 11%|█         | 217/2041 [18:54<2:37:32,  5.18s/it][A

{'eps': 0, 'objective/kl': 39.20814514160156, 'objective/entropy': 16.618877410888672, 'objective/non_score_reward': -1.9604073762893677, 'objective/rlhf_reward': 0.9729987382888794, 'objective/scores': 2.933406114578247, 'policy/approxkl_avg': 0.045251794159412384, 'policy/clipfrac_avg': 0.11438679695129395, 'loss/policy_avg': 0.0029019510839134455, 'loss/value_avg': 0.06960324943065643, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.23119452595710754, 'val/ratio': 0.9731917381286621, 'val/ratio_var': 0.00033055245876312256, 'val/num_eos_tokens': 0, 'lr': 4.470847623713866e-05, 'episode': 868, 'epoch': 0.11}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [22:18<1:05:39, 131kB/s]
 11%|█         | 218/2041 [18:59<2:36:55,  5.17s/it][A

{'eps': 0, 'objective/kl': 43.14392852783203, 'objective/entropy': 14.340719223022461, 'objective/non_score_reward': -2.157196521759033, 'objective/rlhf_reward': 1.3701233863830566, 'objective/scores': 3.52731990814209, 'policy/approxkl_avg': 0.004726002924144268, 'policy/clipfrac_avg': 0.033018868416547775, 'loss/policy_avg': -0.002435325412079692, 'loss/value_avg': 0.046590469777584076, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.30908632278442383, 'val/ratio': 1.0017672777175903, 'val/ratio_var': 1.4171109796734527e-06, 'val/num_eos_tokens': 0, 'lr': 4.468397844194023e-05, 'episode': 872, 'epoch': 0.11}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [22:23<1:05:39, 131kB/s]
 11%|█         | 219/2041 [19:05<2:36:52,  5.17s/it][A

{'eps': 0, 'objective/kl': 38.26182556152344, 'objective/entropy': 9.315788269042969, 'objective/non_score_reward': -1.9130910634994507, 'objective/rlhf_reward': 1.6343432664871216, 'objective/scores': 3.5474343299865723, 'policy/approxkl_avg': 0.009819644503295422, 'policy/clipfrac_avg': 0.05896226316690445, 'loss/policy_avg': -0.01038258895277977, 'loss/value_avg': 0.020714297890663147, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.31369882822036743, 'val/ratio': 0.9830939769744873, 'val/ratio_var': 0.0002840098168235272, 'val/num_eos_tokens': 0, 'lr': 4.4659480646741795e-05, 'episode': 876, 'epoch': 0.11}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [22:28<1:05:39, 131kB/s]
 11%|█         | 220/2041 [19:10<2:36:01,  5.14s/it][A

{'eps': 0, 'objective/kl': 42.431312561035156, 'objective/entropy': 16.106733322143555, 'objective/non_score_reward': -2.121565818786621, 'objective/rlhf_reward': 1.1891391277313232, 'objective/scores': 3.3107049465179443, 'policy/approxkl_avg': 0.00470017408952117, 'policy/clipfrac_avg': 0.04009433835744858, 'loss/policy_avg': -0.007498886436223984, 'loss/value_avg': 0.035049401223659515, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.37250810861587524, 'val/ratio': 0.9939009547233582, 'val/ratio_var': 4.557524880510755e-05, 'val/num_eos_tokens': 0, 'lr': 4.463498285154336e-05, 'episode': 880, 'epoch': 0.11}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [22:33<1:05:39, 131kB/s]
 11%|█         | 221/2041 [19:15<2:36:38,  5.16s/it][A

{'eps': 0, 'objective/kl': 47.5775146484375, 'objective/entropy': 17.43748664855957, 'objective/non_score_reward': -2.378875494003296, 'objective/rlhf_reward': 0.5416510105133057, 'objective/scores': 2.9205265045166016, 'policy/approxkl_avg': 0.052510663866996765, 'policy/clipfrac_avg': 0.12382075190544128, 'loss/policy_avg': 0.005016550421714783, 'loss/value_avg': 0.07706166803836823, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.2602514624595642, 'val/ratio': 0.979343593120575, 'val/ratio_var': 0.00020893185865134, 'val/num_eos_tokens': 0, 'lr': 4.461048505634493e-05, 'episode': 884, 'epoch': 0.11}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [22:38<1:05:39, 131kB/s]
 11%|█         | 222/2041 [19:20<2:36:19,  5.16s/it][A

{'eps': 0, 'objective/kl': 37.278709411621094, 'objective/entropy': 8.75888442993164, 'objective/non_score_reward': -1.8639354705810547, 'objective/rlhf_reward': 1.3746037483215332, 'objective/scores': 3.238539218902588, 'policy/approxkl_avg': 0.0012634925078600645, 'policy/clipfrac_avg': 0.012971698306500912, 'loss/policy_avg': -0.003115586470812559, 'loss/value_avg': 0.020219020545482635, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.23307907581329346, 'val/ratio': 0.9992013573646545, 'val/ratio_var': 2.5511355943308445e-06, 'val/num_eos_tokens': 0, 'lr': 4.45859872611465e-05, 'episode': 888, 'epoch': 0.11}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [22:44<1:05:39, 131kB/s]
 11%|█         | 223/2041 [19:25<2:36:06,  5.15s/it][A

{'eps': 0, 'objective/kl': 41.869163513183594, 'objective/entropy': 15.28891372680664, 'objective/non_score_reward': -2.093458414077759, 'objective/rlhf_reward': 0.9912970066070557, 'objective/scores': 3.0847554206848145, 'policy/approxkl_avg': 0.009604757651686668, 'policy/clipfrac_avg': 0.05070754513144493, 'loss/policy_avg': -0.010265428572893143, 'loss/value_avg': 0.06789014488458633, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.21633519232273102, 'val/ratio': 0.9903740882873535, 'val/ratio_var': 6.364675209624693e-05, 'val/num_eos_tokens': 0, 'lr': 4.456148946594807e-05, 'episode': 892, 'epoch': 0.11}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [22:49<1:05:39, 131kB/s]
 11%|█         | 224/2041 [19:30<2:36:19,  5.16s/it][A

{'eps': 0, 'objective/kl': 38.701717376708984, 'objective/entropy': 11.840131759643555, 'objective/non_score_reward': -1.9350858926773071, 'objective/rlhf_reward': 0.8479593992233276, 'objective/scores': 2.7830452919006348, 'policy/approxkl_avg': 0.00487561197951436, 'policy/clipfrac_avg': 0.03773584961891174, 'loss/policy_avg': -0.00824984721839428, 'loss/value_avg': 0.058872513473033905, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.18520605564117432, 'val/ratio': 1.001866340637207, 'val/ratio_var': 2.7130906801176025e-06, 'val/num_eos_tokens': 0, 'lr': 4.4536991670749635e-05, 'episode': 896, 'epoch': 0.11}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [22:54<1:05:39, 131kB/s]
 11%|█         | 225/2041 [19:36<2:37:11,  5.19s/it][A

{'eps': 0, 'objective/kl': 37.68519592285156, 'objective/entropy': 9.981688499450684, 'objective/non_score_reward': -1.8842597007751465, 'objective/rlhf_reward': 1.455679178237915, 'objective/scores': 3.3399388790130615, 'policy/approxkl_avg': 0.00731606176123023, 'policy/clipfrac_avg': 0.01886792480945587, 'loss/policy_avg': -0.008684823289513588, 'loss/value_avg': 0.03220782428979874, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.15824872255325317, 'val/ratio': 1.0109941959381104, 'val/ratio_var': 9.53481430769898e-05, 'val/num_eos_tokens': 0, 'lr': 4.4512493875551204e-05, 'episode': 900, 'epoch': 0.11}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [22:59<1:05:39, 131kB/s]
 11%|█         | 226/2041 [19:41<2:37:12,  5.20s/it][A

{'eps': 0, 'objective/kl': 36.04558563232422, 'objective/entropy': 8.562112808227539, 'objective/non_score_reward': -1.8022794723510742, 'objective/rlhf_reward': 1.4382009506225586, 'objective/scores': 3.240480422973633, 'policy/approxkl_avg': 0.004822366870939732, 'policy/clipfrac_avg': 0.02712264284491539, 'loss/policy_avg': -0.005766476504504681, 'loss/value_avg': 0.033263593912124634, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.19850996136665344, 'val/ratio': 1.0018281936645508, 'val/ratio_var': 1.7905193772094208e-06, 'val/num_eos_tokens': 0, 'lr': 4.448799608035277e-05, 'episode': 904, 'epoch': 0.11}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [23:04<1:05:39, 131kB/s]
 11%|█         | 227/2041 [19:46<2:36:42,  5.18s/it][A

{'eps': 0, 'objective/kl': 38.36800003051758, 'objective/entropy': 12.334138870239258, 'objective/non_score_reward': -1.9183999300003052, 'objective/rlhf_reward': 0.8781963586807251, 'objective/scores': 2.7965962886810303, 'policy/approxkl_avg': 0.017067808657884598, 'policy/clipfrac_avg': 0.06721697747707367, 'loss/policy_avg': -0.010984945110976696, 'loss/value_avg': 0.06165652722120285, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1926862895488739, 'val/ratio': 0.9875216484069824, 'val/ratio_var': 8.543281728634611e-05, 'val/num_eos_tokens': 0, 'lr': 4.446349828515434e-05, 'episode': 908, 'epoch': 0.11}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [23:10<1:05:39, 131kB/s]
 11%|█         | 228/2041 [19:51<2:37:19,  5.21s/it][A

{'eps': 0, 'objective/kl': 45.52561950683594, 'objective/entropy': 11.012520790100098, 'objective/non_score_reward': -2.2762808799743652, 'objective/rlhf_reward': 0.881505012512207, 'objective/scores': 3.1577858924865723, 'policy/approxkl_avg': 0.0034833704121410847, 'policy/clipfrac_avg': 0.021226415410637856, 'loss/policy_avg': -0.0050881849601864815, 'loss/value_avg': 0.060301799327135086, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.2085016965866089, 'val/ratio': 1.0018986463546753, 'val/ratio_var': 1.3185511306801345e-05, 'val/num_eos_tokens': 0, 'lr': 4.443900048995591e-05, 'episode': 912, 'epoch': 0.11}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [23:15<1:05:39, 131kB/s]
 11%|█         | 229/2041 [19:56<2:37:01,  5.20s/it][A

{'eps': 0, 'objective/kl': 40.63441467285156, 'objective/entropy': 8.801637649536133, 'objective/non_score_reward': -2.0317208766937256, 'objective/rlhf_reward': 1.4068491458892822, 'objective/scores': 3.438570022583008, 'policy/approxkl_avg': 0.0016127066919580102, 'policy/clipfrac_avg': 0.024764152243733406, 'loss/policy_avg': -0.006211899686604738, 'loss/value_avg': 0.04342825710773468, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.19196730852127075, 'val/ratio': 1.0031840801239014, 'val/ratio_var': 5.506006800715113e-06, 'val/num_eos_tokens': 0, 'lr': 4.4414502694757476e-05, 'episode': 916, 'epoch': 0.11}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [23:20<1:05:39, 131kB/s]
 11%|█▏        | 230/2041 [20:01<2:36:05,  5.17s/it][A

{'eps': 0, 'objective/kl': 38.17137908935547, 'objective/entropy': 12.386674880981445, 'objective/non_score_reward': -1.9085688591003418, 'objective/rlhf_reward': 1.2045068740844727, 'objective/scores': 3.1130757331848145, 'policy/approxkl_avg': 0.010059996508061886, 'policy/clipfrac_avg': 0.03537736088037491, 'loss/policy_avg': -0.011034340597689152, 'loss/value_avg': 0.03613752871751785, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.17482838034629822, 'val/ratio': 0.9972789287567139, 'val/ratio_var': 8.739650183997583e-06, 'val/num_eos_tokens': 0, 'lr': 4.439000489955904e-05, 'episode': 920, 'epoch': 0.11}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [23:25<1:05:39, 131kB/s]
 11%|█▏        | 231/2041 [20:07<2:36:12,  5.18s/it][A

{'eps': 0, 'objective/kl': 38.03944396972656, 'objective/entropy': 6.152148246765137, 'objective/non_score_reward': -1.9019720554351807, 'objective/rlhf_reward': 1.3603646755218506, 'objective/scores': 3.2623367309570312, 'policy/approxkl_avg': 0.00725933164358139, 'policy/clipfrac_avg': 0.024764152243733406, 'loss/policy_avg': -0.006251919083297253, 'loss/value_avg': 0.029458902776241302, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.0860251784324646, 'val/ratio': 1.003423810005188, 'val/ratio_var': 7.037256636976963e-06, 'val/num_eos_tokens': 0, 'lr': 4.436550710436061e-05, 'episode': 924, 'epoch': 0.11}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [23:30<1:05:39, 131kB/s]
 11%|█▏        | 232/2041 [20:12<2:35:28,  5.16s/it][A

{'eps': 0, 'objective/kl': 37.280799865722656, 'objective/entropy': 4.414309978485107, 'objective/non_score_reward': -1.8640398979187012, 'objective/rlhf_reward': 1.3909330368041992, 'objective/scores': 3.2549729347229004, 'policy/approxkl_avg': 0.0022579915821552277, 'policy/clipfrac_avg': 0.007075471803545952, 'loss/policy_avg': -0.0018541509052738547, 'loss/value_avg': 0.026156611740589142, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.05106829106807709, 'val/ratio': 1.0063114166259766, 'val/ratio_var': 2.2525091480929404e-05, 'val/num_eos_tokens': 0, 'lr': 4.434100930916218e-05, 'episode': 928, 'epoch': 0.11}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [23:35<1:05:39, 131kB/s]
 11%|█▏        | 233/2041 [20:17<2:35:51,  5.17s/it][A

{'eps': 0, 'objective/kl': 35.628211975097656, 'objective/entropy': 0.43252038955688477, 'objective/non_score_reward': -1.7814104557037354, 'objective/rlhf_reward': 1.2930941581726074, 'objective/scores': 3.0745046138763428, 'policy/approxkl_avg': 2.7983118343399838e-05, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.0017058993689715862, 'loss/value_avg': 0.005625331774353981, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.06368489563465118, 'val/ratio': 0.9956026673316956, 'val/ratio_var': 1.8980304957949556e-05, 'val/num_eos_tokens': 0, 'lr': 4.431651151396374e-05, 'episode': 932, 'epoch': 0.11}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [23:41<1:05:39, 131kB/s]
 11%|█▏        | 234/2041 [20:22<2:36:49,  5.21s/it][A

{'eps': 0, 'objective/kl': 37.90219497680664, 'objective/entropy': 5.616354942321777, 'objective/non_score_reward': -1.8951095342636108, 'objective/rlhf_reward': 1.3910781145095825, 'objective/scores': 3.2861876487731934, 'policy/approxkl_avg': 0.00880467426031828, 'policy/clipfrac_avg': 0.014150943607091904, 'loss/policy_avg': 0.003655079286545515, 'loss/value_avg': 0.030013179406523705, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.14763787388801575, 'val/ratio': 1.0017907619476318, 'val/ratio_var': 3.027141474376549e-06, 'val/num_eos_tokens': 0, 'lr': 4.4292013718765316e-05, 'episode': 936, 'epoch': 0.11}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [23:46<1:05:39, 131kB/s]
 12%|█▏        | 235/2041 [20:28<2:37:27,  5.23s/it][A

{'eps': 0, 'objective/kl': 37.52643585205078, 'objective/entropy': 5.10687255859375, 'objective/non_score_reward': -1.876321792602539, 'objective/rlhf_reward': 1.622518539428711, 'objective/scores': 3.49884033203125, 'policy/approxkl_avg': 0.0016159226652234793, 'policy/clipfrac_avg': 0.007075471803545952, 'loss/policy_avg': 0.001739748753607273, 'loss/value_avg': 0.030724477022886276, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.10662494599819183, 'val/ratio': 1.0028433799743652, 'val/ratio_var': 4.948749847244471e-06, 'val/num_eos_tokens': 0, 'lr': 4.4267515923566884e-05, 'episode': 940, 'epoch': 0.12}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [23:51<1:05:39, 131kB/s]
 12%|█▏        | 236/2041 [20:33<2:36:36,  5.21s/it][A

{'eps': 0, 'objective/kl': 37.199378967285156, 'objective/entropy': 5.7628655433654785, 'objective/non_score_reward': -1.8599690198898315, 'objective/rlhf_reward': 1.4363881349563599, 'objective/scores': 3.2963571548461914, 'policy/approxkl_avg': 0.010219217278063297, 'policy/clipfrac_avg': 0.024764150381088257, 'loss/policy_avg': -0.007067790254950523, 'loss/value_avg': 0.030041120946407318, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1786259114742279, 'val/ratio': 0.9970276355743408, 'val/ratio_var': 1.3630036846734583e-05, 'val/num_eos_tokens': 0, 'lr': 4.4243018128368445e-05, 'episode': 944, 'epoch': 0.12}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [23:56<1:05:39, 131kB/s]
 12%|█▏        | 237/2041 [20:38<2:36:04,  5.19s/it][A

{'eps': 0, 'objective/kl': 36.05042266845703, 'objective/entropy': 8.685432434082031, 'objective/non_score_reward': -1.8025212287902832, 'objective/rlhf_reward': 1.3903822898864746, 'objective/scores': 3.192903518676758, 'policy/approxkl_avg': 0.0030739624053239822, 'policy/clipfrac_avg': 0.014150943607091904, 'loss/policy_avg': -0.007295409217476845, 'loss/value_avg': 0.030244950205087662, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.23089337348937988, 'val/ratio': 1.0034714937210083, 'val/ratio_var': 2.1928273781668395e-05, 'val/num_eos_tokens': 0, 'lr': 4.421852033317001e-05, 'episode': 948, 'epoch': 0.12}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [24:01<1:05:39, 131kB/s]
 12%|█▏        | 238/2041 [20:43<2:35:05,  5.16s/it][A

{'eps': 0, 'objective/kl': 40.15449523925781, 'objective/entropy': 14.888025283813477, 'objective/non_score_reward': -2.0077247619628906, 'objective/rlhf_reward': 1.4100172519683838, 'objective/scores': 3.4177420139312744, 'policy/approxkl_avg': 0.01822744682431221, 'policy/clipfrac_avg': 0.07311321049928665, 'loss/policy_avg': -0.005554873030632734, 'loss/value_avg': 0.04641208052635193, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.21233031153678894, 'val/ratio': 1.0072743892669678, 'val/ratio_var': 4.983328472007997e-05, 'val/num_eos_tokens': 0, 'lr': 4.419402253797159e-05, 'episode': 952, 'epoch': 0.12}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [24:07<1:05:39, 131kB/s]
 12%|█▏        | 239/2041 [20:48<2:35:24,  5.17s/it][A

{'eps': 0, 'objective/kl': 40.74954605102539, 'objective/entropy': 13.118633270263672, 'objective/non_score_reward': -2.037477493286133, 'objective/rlhf_reward': 1.3763580322265625, 'objective/scores': 3.4138355255126953, 'policy/approxkl_avg': 0.004056720994412899, 'policy/clipfrac_avg': 0.04363207519054413, 'loss/policy_avg': -0.0004305287729948759, 'loss/value_avg': 0.0345383957028389, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.253853440284729, 'val/ratio': 1.0007404088974, 'val/ratio_var': 3.4946751270581444e-07, 'val/num_eos_tokens': 0, 'lr': 4.4169524742773156e-05, 'episode': 956, 'epoch': 0.12}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [24:12<1:05:39, 131kB/s]
 12%|█▏        | 240/2041 [20:53<2:35:47,  5.19s/it][A

{'eps': 0, 'objective/kl': 41.490806579589844, 'objective/entropy': 11.840995788574219, 'objective/non_score_reward': -2.074540376663208, 'objective/rlhf_reward': 1.3639798164367676, 'objective/scores': 3.4385201930999756, 'policy/approxkl_avg': 0.007444307208061218, 'policy/clipfrac_avg': 0.03419811278581619, 'loss/policy_avg': -0.0033255591988563538, 'loss/value_avg': 0.03663882613182068, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.17398008704185486, 'val/ratio': 1.0065875053405762, 'val/ratio_var': 0.00010085802205139771, 'val/num_eos_tokens': 0, 'lr': 4.414502694757472e-05, 'episode': 960, 'epoch': 0.12}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [24:17<1:05:39, 131kB/s]
 12%|█▏        | 241/2041 [20:59<2:35:18,  5.18s/it][A

{'eps': 0, 'objective/kl': 39.26039123535156, 'objective/entropy': 10.287853240966797, 'objective/non_score_reward': -1.9630194902420044, 'objective/rlhf_reward': 1.4340800046920776, 'objective/scores': 3.397099494934082, 'policy/approxkl_avg': 0.0063306656666100025, 'policy/clipfrac_avg': 0.03891509398818016, 'loss/policy_avg': -0.006945744622498751, 'loss/value_avg': 0.03331676125526428, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.2529430389404297, 'val/ratio': 0.9981126189231873, 'val/ratio_var': 8.504702236677986e-06, 'val/num_eos_tokens': 0, 'lr': 4.4120529152376285e-05, 'episode': 964, 'epoch': 0.12}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [24:22<1:05:39, 131kB/s]
 12%|█▏        | 242/2041 [21:04<2:34:58,  5.17s/it][A

{'eps': 0, 'objective/kl': 41.87887954711914, 'objective/entropy': 16.807355880737305, 'objective/non_score_reward': -2.0939440727233887, 'objective/rlhf_reward': 1.151991844177246, 'objective/scores': 3.2459359169006348, 'policy/approxkl_avg': 0.005213518626987934, 'policy/clipfrac_avg': 0.049528300762176514, 'loss/policy_avg': -0.006117427721619606, 'loss/value_avg': 0.04234849661588669, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.3074004054069519, 'val/ratio': 0.9960079193115234, 'val/ratio_var': 1.8636796085047536e-05, 'val/num_eos_tokens': 0, 'lr': 4.409603135717786e-05, 'episode': 968, 'epoch': 0.12}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [24:27<1:05:39, 131kB/s]
 12%|█▏        | 243/2041 [21:09<2:34:27,  5.15s/it][A

{'eps': 0, 'objective/kl': 41.16102600097656, 'objective/entropy': 16.913536071777344, 'objective/non_score_reward': -2.058051347732544, 'objective/rlhf_reward': 1.314206600189209, 'objective/scores': 3.372257947921753, 'policy/approxkl_avg': 0.004916724748909473, 'policy/clipfrac_avg': 0.04009433835744858, 'loss/policy_avg': -0.010088294744491577, 'loss/value_avg': 0.036739129573106766, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.3652397096157074, 'val/ratio': 0.9892050623893738, 'val/ratio_var': 7.033051952021196e-05, 'val/num_eos_tokens': 0, 'lr': 4.407153356197942e-05, 'episode': 972, 'epoch': 0.12}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [24:32<1:05:39, 131kB/s]
 12%|█▏        | 244/2041 [21:14<2:34:54,  5.17s/it][A

{'eps': 0, 'objective/kl': 45.50829315185547, 'objective/entropy': 21.618436813354492, 'objective/non_score_reward': -2.2754147052764893, 'objective/rlhf_reward': 1.0122184753417969, 'objective/scores': 3.287633180618286, 'policy/approxkl_avg': 0.011074248701334, 'policy/clipfrac_avg': 0.07193395495414734, 'loss/policy_avg': -0.009506641887128353, 'loss/value_avg': 0.05234619602560997, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.32592350244522095, 'val/ratio': 1.0155822038650513, 'val/ratio_var': 0.00011119421105831861, 'val/num_eos_tokens': 0, 'lr': 4.404703576678099e-05, 'episode': 976, 'epoch': 0.12}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [24:38<1:05:39, 131kB/s]
 12%|█▏        | 245/2041 [21:19<2:34:08,  5.15s/it][A

{'eps': 0, 'objective/kl': 42.15085983276367, 'objective/entropy': 19.8997802734375, 'objective/non_score_reward': -2.1075427532196045, 'objective/rlhf_reward': 1.2023584842681885, 'objective/scores': 3.309901237487793, 'policy/approxkl_avg': 0.02311934530735016, 'policy/clipfrac_avg': 0.12735849618911743, 'loss/policy_avg': -0.015607243403792381, 'loss/value_avg': 0.05908169969916344, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.2946649193763733, 'val/ratio': 0.9942755699157715, 'val/ratio_var': 2.019660132646095e-05, 'val/num_eos_tokens': 0, 'lr': 4.4022537971582564e-05, 'episode': 980, 'epoch': 0.12}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [24:43<1:05:39, 131kB/s]
 12%|█▏        | 246/2041 [21:24<2:34:45,  5.17s/it][A

{'eps': 0, 'objective/kl': 41.04000473022461, 'objective/entropy': 13.776455879211426, 'objective/non_score_reward': -2.0520002841949463, 'objective/rlhf_reward': 1.102102518081665, 'objective/scores': 3.1541028022766113, 'policy/approxkl_avg': 0.0051919883117079735, 'policy/clipfrac_avg': 0.02712264098227024, 'loss/policy_avg': -0.008019623346626759, 'loss/value_avg': 0.030006730929017067, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.31284645199775696, 'val/ratio': 0.995612621307373, 'val/ratio_var': 1.194422657135874e-05, 'val/num_eos_tokens': 0, 'lr': 4.3998040176384126e-05, 'episode': 984, 'epoch': 0.12}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [24:48<1:05:39, 131kB/s]
 12%|█▏        | 247/2041 [21:30<2:34:51,  5.18s/it][A

{'eps': 0, 'objective/kl': 38.85826110839844, 'objective/entropy': 11.96580696105957, 'objective/non_score_reward': -1.9429131746292114, 'objective/rlhf_reward': 1.3758801221847534, 'objective/scores': 3.318793296813965, 'policy/approxkl_avg': 0.0032344346400350332, 'policy/clipfrac_avg': 0.022405661642551422, 'loss/policy_avg': -0.003243325976654887, 'loss/value_avg': 0.021143946796655655, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.2326795607805252, 'val/ratio': 0.9968249797821045, 'val/ratio_var': 1.115908435167512e-05, 'val/num_eos_tokens': 0, 'lr': 4.3973542381185694e-05, 'episode': 988, 'epoch': 0.12}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [24:53<1:05:39, 131kB/s]
 12%|█▏        | 248/2041 [21:35<2:34:47,  5.18s/it][A

{'eps': 0, 'objective/kl': 36.59764862060547, 'objective/entropy': 11.605644226074219, 'objective/non_score_reward': -1.8298826217651367, 'objective/rlhf_reward': 1.1286191940307617, 'objective/scores': 2.9585018157958984, 'policy/approxkl_avg': 0.00957840122282505, 'policy/clipfrac_avg': 0.06132075563073158, 'loss/policy_avg': -0.009498769417405128, 'loss/value_avg': 0.05837833881378174, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.27555954456329346, 'val/ratio': 0.9810627102851868, 'val/ratio_var': 0.00021981890313327312, 'val/num_eos_tokens': 0, 'lr': 4.394904458598726e-05, 'episode': 992, 'epoch': 0.12}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [24:58<1:05:39, 131kB/s]
 12%|█▏        | 249/2041 [21:40<2:34:16,  5.17s/it][A

{'eps': 0, 'objective/kl': 43.298362731933594, 'objective/entropy': 14.94902229309082, 'objective/non_score_reward': -2.1649179458618164, 'objective/rlhf_reward': 0.48616456985473633, 'objective/scores': 2.6510825157165527, 'policy/approxkl_avg': 0.017006807029247284, 'policy/clipfrac_avg': 0.0766509473323822, 'loss/policy_avg': -0.012541687116026878, 'loss/value_avg': 0.059010930359363556, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.26537758111953735, 'val/ratio': 1.0014660358428955, 'val/ratio_var': 1.7018642211041879e-06, 'val/num_eos_tokens': 0, 'lr': 4.392454679078883e-05, 'episode': 996, 'epoch': 0.12}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [25:04<1:05:39, 131kB/s]
 12%|█▏        | 250/2041 [21:45<2:34:54,  5.19s/it][A

{'eps': 0, 'objective/kl': 36.0301399230957, 'objective/entropy': 19.120994567871094, 'objective/non_score_reward': -1.8015071153640747, 'objective/rlhf_reward': 1.1956294775009155, 'objective/scores': 2.9971365928649902, 'policy/approxkl_avg': 0.007526072673499584, 'policy/clipfrac_avg': 0.02358490601181984, 'loss/policy_avg': -0.01017055194824934, 'loss/value_avg': 0.02068597637116909, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.3096599280834198, 'val/ratio': 1.0159900188446045, 'val/ratio_var': 0.00013419291644822806, 'val/num_eos_tokens': 0, 'lr': 4.39000489955904e-05, 'episode': 1000, 'epoch': 0.12}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [25:09<1:05:39, 131kB/s]
 12%|█▏        | 251/2041 [21:50<2:33:45,  5.15s/it][A

{'eps': 0, 'objective/kl': 46.125938415527344, 'objective/entropy': 10.603660583496094, 'objective/non_score_reward': -2.3062970638275146, 'objective/rlhf_reward': 0.03134894371032715, 'objective/scores': 2.337646007537842, 'policy/approxkl_avg': 0.01461278647184372, 'policy/clipfrac_avg': 0.07429245114326477, 'loss/policy_avg': -0.01177499070763588, 'loss/value_avg': 0.10397379845380783, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.20267191529273987, 'val/ratio': 0.9967461824417114, 'val/ratio_var': 8.415713637077715e-06, 'val/num_eos_tokens': 0, 'lr': 4.3875551200391966e-05, 'episode': 1004, 'epoch': 0.12}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [25:14<1:05:39, 131kB/s]
 12%|█▏        | 252/2041 [21:55<2:33:21,  5.14s/it][A

{'eps': 0, 'objective/kl': 35.430076599121094, 'objective/entropy': 0.49737024307250977, 'objective/non_score_reward': -1.7715039253234863, 'objective/rlhf_reward': 1.176419973373413, 'objective/scores': 2.9479238986968994, 'policy/approxkl_avg': 4.578316179504327e-07, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': 0.00022671863553114235, 'loss/value_avg': 0.004978869576007128, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.04958155006170273, 'val/ratio': 1.000608205795288, 'val/ratio_var': 2.0512163700914243e-07, 'val/num_eos_tokens': 0, 'lr': 4.385105340519354e-05, 'episode': 1008, 'epoch': 0.12}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [25:19<1:05:39, 131kB/s]
 12%|█▏        | 253/2041 [22:00<2:33:35,  5.15s/it][A

{'eps': 0, 'objective/kl': 38.73017883300781, 'objective/entropy': 6.161774158477783, 'objective/non_score_reward': -1.9365088939666748, 'objective/rlhf_reward': 0.8047676086425781, 'objective/scores': 2.741276502609253, 'policy/approxkl_avg': 0.0027152877300977707, 'policy/clipfrac_avg': 0.03066037781536579, 'loss/policy_avg': -0.010031705722212791, 'loss/value_avg': 0.027233945205807686, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.13192901015281677, 'val/ratio': 0.994006872177124, 'val/ratio_var': 2.4100590962916613e-05, 'val/num_eos_tokens': 0, 'lr': 4.38265556099951e-05, 'episode': 1012, 'epoch': 0.12}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [25:24<1:05:39, 131kB/s]
 12%|█▏        | 254/2041 [22:06<2:34:06,  5.17s/it][A

{'eps': 0, 'objective/kl': 39.46459197998047, 'objective/entropy': 6.7913384437561035, 'objective/non_score_reward': -1.9732297658920288, 'objective/rlhf_reward': 0.6924983263015747, 'objective/scores': 2.6657280921936035, 'policy/approxkl_avg': 0.007085809484124184, 'policy/clipfrac_avg': 0.04245283082127571, 'loss/policy_avg': -0.013708974234759808, 'loss/value_avg': 0.03844914212822914, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.09052343666553497, 'val/ratio': 0.9923652410507202, 'val/ratio_var': 4.337994323577732e-05, 'val/num_eos_tokens': 0, 'lr': 4.380205781479667e-05, 'episode': 1016, 'epoch': 0.12}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [25:29<1:05:39, 131kB/s]
 12%|█▏        | 255/2041 [22:11<2:33:44,  5.16s/it][A

{'eps': 0, 'objective/kl': 35.31202697753906, 'objective/entropy': 3.1521048545837402, 'objective/non_score_reward': -1.765601396560669, 'objective/rlhf_reward': 1.1828677654266357, 'objective/scores': 2.9484691619873047, 'policy/approxkl_avg': 0.0012235239846631885, 'policy/clipfrac_avg': 0.00589622650295496, 'loss/policy_avg': -0.002961699850857258, 'loss/value_avg': 0.005889089312404394, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.03677406907081604, 'val/ratio': 0.9968048930168152, 'val/ratio_var': 5.514886197488522e-06, 'val/num_eos_tokens': 0, 'lr': 4.377756001959824e-05, 'episode': 1020, 'epoch': 0.12}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [25:34<1:05:39, 131kB/s]
 13%|█▎        | 256/2041 [22:16<2:33:14,  5.15s/it][A

{'eps': 0, 'objective/kl': 35.377235412597656, 'objective/entropy': 2.159792423248291, 'objective/non_score_reward': -1.7688616514205933, 'objective/rlhf_reward': 1.1487947702407837, 'objective/scores': 2.917656421661377, 'policy/approxkl_avg': 0.0018283568788319826, 'policy/clipfrac_avg': 0.007075471803545952, 'loss/policy_avg': -0.0045878528617322445, 'loss/value_avg': 0.022610152140259743, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.06777622550725937, 'val/ratio': 1.0019580125808716, 'val/ratio_var': 2.796090711854049e-06, 'val/num_eos_tokens': 0, 'lr': 4.3753062224399806e-05, 'episode': 1024, 'epoch': 0.13}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [25:40<1:05:39, 131kB/s]
 13%|█▎        | 257/2041 [22:21<2:33:32,  5.16s/it][A

{'eps': 0, 'objective/kl': 36.11920928955078, 'objective/entropy': 12.050447463989258, 'objective/non_score_reward': -1.8059604167938232, 'objective/rlhf_reward': 1.086667776107788, 'objective/scores': 2.8926281929016113, 'policy/approxkl_avg': 0.004002509173005819, 'policy/clipfrac_avg': 0.03773584961891174, 'loss/policy_avg': -0.010619674809277058, 'loss/value_avg': 0.020464200526475906, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.14093223214149475, 'val/ratio': 1.0026123523712158, 'val/ratio_var': 4.408106633491116e-06, 'val/num_eos_tokens': 0, 'lr': 4.3728564429201374e-05, 'episode': 1028, 'epoch': 0.13}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [25:45<1:05:39, 131kB/s]
 13%|█▎        | 258/2041 [22:26<2:33:21,  5.16s/it][A

{'eps': 0, 'objective/kl': 33.11474609375, 'objective/entropy': 2.295301914215088, 'objective/non_score_reward': -1.655737280845642, 'objective/rlhf_reward': 1.051988959312439, 'objective/scores': 2.707726240158081, 'policy/approxkl_avg': 0.0003947941295336932, 'policy/clipfrac_avg': 0.003537735901772976, 'loss/policy_avg': -0.0013612036127597094, 'loss/value_avg': 0.006239567883312702, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.11350896209478378, 'val/ratio': 0.9945977330207825, 'val/ratio_var': 2.226847573183477e-05, 'val/num_eos_tokens': 0, 'lr': 4.370406663400294e-05, 'episode': 1032, 'epoch': 0.13}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [25:50<1:05:39, 131kB/s]
 13%|█▎        | 259/2041 [22:32<2:33:43,  5.18s/it][A

{'eps': 0, 'objective/kl': 35.81574630737305, 'objective/entropy': 11.772647857666016, 'objective/non_score_reward': -1.7907873392105103, 'objective/rlhf_reward': 1.326602816581726, 'objective/scores': 3.1173901557922363, 'policy/approxkl_avg': 0.0014966980088502169, 'policy/clipfrac_avg': 0.02358490601181984, 'loss/policy_avg': -0.005435992497950792, 'loss/value_avg': 0.021684208884835243, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.20455606281757355, 'val/ratio': 1.0007933378219604, 'val/ratio_var': 2.8875529096694663e-06, 'val/num_eos_tokens': 0, 'lr': 4.367956883880451e-05, 'episode': 1036, 'epoch': 0.13}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [25:55<1:05:39, 131kB/s]
 13%|█▎        | 260/2041 [22:37<2:33:36,  5.17s/it][A

{'eps': 0, 'objective/kl': 37.41325378417969, 'objective/entropy': 7.117922782897949, 'objective/non_score_reward': -1.870662808418274, 'objective/rlhf_reward': 1.1769713163375854, 'objective/scores': 3.0476341247558594, 'policy/approxkl_avg': 0.001012755325064063, 'policy/clipfrac_avg': 0.017688680440187454, 'loss/policy_avg': -0.004220035392791033, 'loss/value_avg': 0.021991882473230362, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.15611857175827026, 'val/ratio': 1.0008094310760498, 'val/ratio_var': 3.430050981023669e-07, 'val/num_eos_tokens': 0, 'lr': 4.365507104360608e-05, 'episode': 1040, 'epoch': 0.13}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [26:00<1:05:39, 131kB/s]
 13%|█▎        | 261/2041 [22:42<2:33:23,  5.17s/it][A

{'eps': 0, 'objective/kl': 37.029685974121094, 'objective/entropy': 10.458780288696289, 'objective/non_score_reward': -1.8514842987060547, 'objective/rlhf_reward': 1.2671513557434082, 'objective/scores': 3.118635654449463, 'policy/approxkl_avg': 0.015861425548791885, 'policy/clipfrac_avg': 0.044811323285102844, 'loss/policy_avg': -0.006721868179738522, 'loss/value_avg': 0.027361512184143066, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.17200422286987305, 'val/ratio': 1.0328888893127441, 'val/ratio_var': 0.0005497232195921242, 'val/num_eos_tokens': 0, 'lr': 4.3630573248407646e-05, 'episode': 1044, 'epoch': 0.13}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [26:05<1:05:39, 131kB/s]
 13%|█▎        | 262/2041 [22:47<2:33:18,  5.17s/it][A

{'eps': 0, 'objective/kl': 40.167076110839844, 'objective/entropy': 9.447000503540039, 'objective/non_score_reward': -2.0083537101745605, 'objective/rlhf_reward': 1.3281655311584473, 'objective/scores': 3.336519241333008, 'policy/approxkl_avg': 0.0008184831240214407, 'policy/clipfrac_avg': 0.009433962404727936, 'loss/policy_avg': -0.0037393125239759684, 'loss/value_avg': 0.025800175964832306, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1486903578042984, 'val/ratio': 1.0006382465362549, 'val/ratio_var': 1.774935412868217e-06, 'val/num_eos_tokens': 0, 'lr': 4.360607545320921e-05, 'episode': 1048, 'epoch': 0.13}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [26:11<1:05:39, 131kB/s]
 13%|█▎        | 263/2041 [22:52<2:33:29,  5.18s/it][A

{'eps': 0, 'objective/kl': 37.39519500732422, 'objective/entropy': 9.009979248046875, 'objective/non_score_reward': -1.8697597980499268, 'objective/rlhf_reward': 1.144085168838501, 'objective/scores': 3.0138449668884277, 'policy/approxkl_avg': 0.0008028222946450114, 'policy/clipfrac_avg': 0.010613207705318928, 'loss/policy_avg': -0.0015198240289464593, 'loss/value_avg': 0.020558318123221397, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.20988334715366364, 'val/ratio': 1.0005812644958496, 'val/ratio_var': 2.3925574623717694e-06, 'val/num_eos_tokens': 0, 'lr': 4.358157765801078e-05, 'episode': 1052, 'epoch': 0.13}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [26:16<1:05:39, 131kB/s]
 13%|█▎        | 264/2041 [22:57<2:33:12,  5.17s/it][A

{'eps': 0, 'objective/kl': 44.23360824584961, 'objective/entropy': 11.691645622253418, 'objective/non_score_reward': -2.2116804122924805, 'objective/rlhf_reward': 0.8209457397460938, 'objective/scores': 3.032626152038574, 'policy/approxkl_avg': 0.0011800277279689908, 'policy/clipfrac_avg': 0.015330188907682896, 'loss/policy_avg': -0.006278116721659899, 'loss/value_avg': 0.038983818143606186, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1914435774087906, 'val/ratio': 1.001151442527771, 'val/ratio_var': 8.691028483553964e-07, 'val/num_eos_tokens': 0, 'lr': 4.355707986281235e-05, 'episode': 1056, 'epoch': 0.13}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [26:21<1:05:39, 131kB/s]
 13%|█▎        | 265/2041 [23:03<2:32:48,  5.16s/it][A

{'eps': 0, 'objective/kl': 37.84832000732422, 'objective/entropy': 9.70113468170166, 'objective/non_score_reward': -1.8924161195755005, 'objective/rlhf_reward': 1.1648684740066528, 'objective/scores': 3.0572845935821533, 'policy/approxkl_avg': 0.005212209653109312, 'policy/clipfrac_avg': 0.021226415410637856, 'loss/policy_avg': -0.004962777718901634, 'loss/value_avg': 0.015709061175584793, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.15228703618049622, 'val/ratio': 0.9988735914230347, 'val/ratio_var': 1.5114213738343096e-06, 'val/num_eos_tokens': 0, 'lr': 4.353258206761392e-05, 'episode': 1060, 'epoch': 0.13}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [26:26<1:05:39, 131kB/s]
 13%|█▎        | 266/2041 [23:08<2:32:27,  5.15s/it][A

{'eps': 0, 'objective/kl': 38.73644256591797, 'objective/entropy': 6.264904975891113, 'objective/non_score_reward': -1.9368219375610352, 'objective/rlhf_reward': 1.1524724960327148, 'objective/scores': 3.08929443359375, 'policy/approxkl_avg': 0.0008461897959932685, 'policy/clipfrac_avg': 0.012971698306500912, 'loss/policy_avg': -0.0033593031112104654, 'loss/value_avg': 0.016620617359876633, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.11902794241905212, 'val/ratio': 1.0018577575683594, 'val/ratio_var': 4.74547186968266e-06, 'val/num_eos_tokens': 0, 'lr': 4.3508084272415487e-05, 'episode': 1064, 'epoch': 0.13}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [26:31<1:05:39, 131kB/s]
 13%|█▎        | 267/2041 [23:13<2:32:06,  5.14s/it][A

{'eps': 0, 'objective/kl': 39.53581237792969, 'objective/entropy': 7.3799147605896, 'objective/non_score_reward': -1.9767906665802002, 'objective/rlhf_reward': 1.2428061962127686, 'objective/scores': 3.2195968627929688, 'policy/approxkl_avg': 0.0014876052737236023, 'policy/clipfrac_avg': 0.02358490601181984, 'loss/policy_avg': -0.00375546608120203, 'loss/value_avg': 0.01637781225144863, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1842394471168518, 'val/ratio': 0.9990123510360718, 'val/ratio_var': 1.0709017033150303e-06, 'val/num_eos_tokens': 0, 'lr': 4.3483586477217055e-05, 'episode': 1068, 'epoch': 0.13}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [26:36<1:05:39, 131kB/s]
 13%|█▎        | 268/2041 [23:18<2:32:02,  5.15s/it][A

{'eps': 0, 'objective/kl': 40.192142486572266, 'objective/entropy': 11.392515182495117, 'objective/non_score_reward': -2.0096073150634766, 'objective/rlhf_reward': 1.371903896331787, 'objective/scores': 3.3815112113952637, 'policy/approxkl_avg': 0.0035807075910270214, 'policy/clipfrac_avg': 0.03066037781536579, 'loss/policy_avg': -0.007413336541503668, 'loss/value_avg': 0.02534027397632599, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.22790458798408508, 'val/ratio': 1.0063811540603638, 'val/ratio_var': 2.2331807485898025e-05, 'val/num_eos_tokens': 0, 'lr': 4.345908868201862e-05, 'episode': 1072, 'epoch': 0.13}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [26:42<1:05:39, 131kB/s]
 13%|█▎        | 269/2041 [23:23<2:32:54,  5.18s/it][A

{'eps': 0, 'objective/kl': 51.37782287597656, 'objective/entropy': 16.963335037231445, 'objective/non_score_reward': -2.5688910484313965, 'objective/rlhf_reward': 0.4912147521972656, 'objective/scores': 3.060105800628662, 'policy/approxkl_avg': 0.0059580616652965546, 'policy/clipfrac_avg': 0.024764150381088257, 'loss/policy_avg': -0.0034973081201314926, 'loss/value_avg': 0.15314090251922607, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.23317956924438477, 'val/ratio': 0.9933822154998779, 'val/ratio_var': 2.45829487539595e-05, 'val/num_eos_tokens': 0, 'lr': 4.3434590886820184e-05, 'episode': 1076, 'epoch': 0.13}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [26:47<1:05:39, 131kB/s]
 13%|█▎        | 270/2041 [23:28<2:32:48,  5.18s/it][A

{'eps': 0, 'objective/kl': 36.80544662475586, 'objective/entropy': 7.316166400909424, 'objective/non_score_reward': -1.8402724266052246, 'objective/rlhf_reward': 1.0787296295166016, 'objective/scores': 2.919002056121826, 'policy/approxkl_avg': 0.001435932470485568, 'policy/clipfrac_avg': 0.009433962404727936, 'loss/policy_avg': -0.0025143364910036325, 'loss/value_avg': 0.01596181094646454, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.20253854990005493, 'val/ratio': 0.9999400973320007, 'val/ratio_var': 8.343126722820671e-08, 'val/num_eos_tokens': 0, 'lr': 4.341009309162176e-05, 'episode': 1080, 'epoch': 0.13}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [26:52<1:05:39, 131kB/s]
 13%|█▎        | 271/2041 [23:34<2:32:32,  5.17s/it][A

{'eps': 0, 'objective/kl': 40.443870544433594, 'objective/entropy': 12.519330024719238, 'objective/non_score_reward': -2.022193670272827, 'objective/rlhf_reward': 1.1764180660247803, 'objective/scores': 3.1986117362976074, 'policy/approxkl_avg': 0.001970646670088172, 'policy/clipfrac_avg': 0.01886792480945587, 'loss/policy_avg': -0.0034049192909151316, 'loss/value_avg': 0.017187297344207764, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.2824058532714844, 'val/ratio': 1.0004023313522339, 'val/ratio_var': 4.054205930970056e-07, 'val/num_eos_tokens': 0, 'lr': 4.338559529642333e-05, 'episode': 1084, 'epoch': 0.13}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [26:57<1:05:39, 131kB/s]
 13%|█▎        | 272/2041 [23:39<2:32:24,  5.17s/it][A

{'eps': 0, 'objective/kl': 39.20223617553711, 'objective/entropy': 16.21457290649414, 'objective/non_score_reward': -1.9601120948791504, 'objective/rlhf_reward': 1.1983530521392822, 'objective/scores': 3.1584651470184326, 'policy/approxkl_avg': 0.0018233637092635036, 'policy/clipfrac_avg': 0.01179245300590992, 'loss/policy_avg': -0.004038335755467415, 'loss/value_avg': 0.026093626394867897, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.31087160110473633, 'val/ratio': 1.002742052078247, 'val/ratio_var': 5.608892024611123e-06, 'val/num_eos_tokens': 0, 'lr': 4.336109750122489e-05, 'episode': 1088, 'epoch': 0.13}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [27:02<1:05:39, 131kB/s]
 13%|█▎        | 273/2041 [23:44<2:32:01,  5.16s/it][A

{'eps': 0, 'objective/kl': 37.36286544799805, 'objective/entropy': 15.698322296142578, 'objective/non_score_reward': -1.8681433200836182, 'objective/rlhf_reward': 1.1322057247161865, 'objective/scores': 3.0003490447998047, 'policy/approxkl_avg': 0.006869838573038578, 'policy/clipfrac_avg': 0.04009433835744858, 'loss/policy_avg': -0.008985529653728008, 'loss/value_avg': 0.02923218533396721, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.3604240417480469, 'val/ratio': 0.9852077960968018, 'val/ratio_var': 0.00015700496442150325, 'val/num_eos_tokens': 0, 'lr': 4.333659970602646e-05, 'episode': 1092, 'epoch': 0.13}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [27:07<1:05:39, 131kB/s]
 13%|█▎        | 274/2041 [23:49<2:31:03,  5.13s/it][A

{'eps': 0, 'objective/kl': 39.11585235595703, 'objective/entropy': 17.075925827026367, 'objective/non_score_reward': -1.955792784690857, 'objective/rlhf_reward': 1.1547304391860962, 'objective/scores': 3.110523223876953, 'policy/approxkl_avg': 0.007409142330288887, 'policy/clipfrac_avg': 0.030660375952720642, 'loss/policy_avg': -0.009829126298427582, 'loss/value_avg': 0.022904470562934875, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.34777092933654785, 'val/ratio': 1.0074172019958496, 'val/ratio_var': 3.465857298579067e-05, 'val/num_eos_tokens': 0, 'lr': 4.331210191082803e-05, 'episode': 1096, 'epoch': 0.13}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [27:12<1:05:39, 131kB/s]
 13%|█▎        | 275/2041 [23:54<2:31:04,  5.13s/it][A

{'eps': 0, 'objective/kl': 36.880462646484375, 'objective/entropy': 12.37898063659668, 'objective/non_score_reward': -1.8440232276916504, 'objective/rlhf_reward': 0.9177663326263428, 'objective/scores': 2.761789560317993, 'policy/approxkl_avg': 0.008561988361179829, 'policy/clipfrac_avg': 0.03419811278581619, 'loss/policy_avg': -0.014449591748416424, 'loss/value_avg': 0.04109806567430496, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.20879590511322021, 'val/ratio': 1.0052430629730225, 'val/ratio_var': 2.1243840819806792e-05, 'val/num_eos_tokens': 0, 'lr': 4.328760411562959e-05, 'episode': 1100, 'epoch': 0.13}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [27:18<1:05:39, 131kB/s]
 14%|█▎        | 276/2041 [23:59<2:31:42,  5.16s/it][A

{'eps': 0, 'objective/kl': 37.26462173461914, 'objective/entropy': 16.592853546142578, 'objective/non_score_reward': -1.8632311820983887, 'objective/rlhf_reward': 1.1045684814453125, 'objective/scores': 2.967799663543701, 'policy/approxkl_avg': 0.0034320419654250145, 'policy/clipfrac_avg': 0.02712264098227024, 'loss/policy_avg': -0.003295311937108636, 'loss/value_avg': 0.03314358741044998, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1998245120048523, 'val/ratio': 0.9975183010101318, 'val/ratio_var': 4.577440449793357e-06, 'val/num_eos_tokens': 0, 'lr': 4.326310632043116e-05, 'episode': 1104, 'epoch': 0.14}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [27:23<1:05:39, 131kB/s]
 14%|█▎        | 277/2041 [24:04<2:31:06,  5.14s/it][A

{'eps': 0, 'objective/kl': 36.15657043457031, 'objective/entropy': 11.269983291625977, 'objective/non_score_reward': -1.807828426361084, 'objective/rlhf_reward': 1.062605857849121, 'objective/scores': 2.870434284210205, 'policy/approxkl_avg': 0.010907422751188278, 'policy/clipfrac_avg': 0.04363207519054413, 'loss/policy_avg': -0.011163334362208843, 'loss/value_avg': 0.030727703124284744, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.15883958339691162, 'val/ratio': 0.9895048141479492, 'val/ratio_var': 6.75744959153235e-05, 'val/num_eos_tokens': 0, 'lr': 4.3238608525232735e-05, 'episode': 1108, 'epoch': 0.14}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [27:28<1:05:39, 131kB/s]
 14%|█▎        | 278/2041 [24:09<2:30:07,  5.11s/it][A

{'eps': 0, 'objective/kl': 34.198116302490234, 'objective/entropy': 9.60105037689209, 'objective/non_score_reward': -1.7099061012268066, 'objective/rlhf_reward': 1.4351587295532227, 'objective/scores': 3.1450648307800293, 'policy/approxkl_avg': 0.017353655770421028, 'policy/clipfrac_avg': 0.010613207705318928, 'loss/policy_avg': -0.0008610205841250718, 'loss/value_avg': 0.014436980709433556, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1376342922449112, 'val/ratio': 1.0385392904281616, 'val/ratio_var': 0.0006673629395663738, 'val/num_eos_tokens': 0, 'lr': 4.3214110730034296e-05, 'episode': 1112, 'epoch': 0.14}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [27:33<1:05:39, 131kB/s]
 14%|█▎        | 279/2041 [24:14<2:29:23,  5.09s/it][A

{'eps': 0, 'objective/kl': 45.93527603149414, 'objective/entropy': 17.871463775634766, 'objective/non_score_reward': -2.2967638969421387, 'objective/rlhf_reward': 0.6346817016601562, 'objective/scores': 2.931445598602295, 'policy/approxkl_avg': 0.00853944756090641, 'policy/clipfrac_avg': 0.06367924809455872, 'loss/policy_avg': -0.012246946804225445, 'loss/value_avg': 0.07240661978721619, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.20746982097625732, 'val/ratio': 0.9940818548202515, 'val/ratio_var': 1.725867514323909e-05, 'val/num_eos_tokens': 0, 'lr': 4.3189612934835864e-05, 'episode': 1116, 'epoch': 0.14}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [27:38<1:05:39, 131kB/s]
 14%|█▎        | 280/2041 [24:19<2:29:16,  5.09s/it][A

{'eps': 0, 'objective/kl': 32.24076843261719, 'objective/entropy': 4.155628204345703, 'objective/non_score_reward': -1.6120383739471436, 'objective/rlhf_reward': 1.2155983448028564, 'objective/scores': 2.82763671875, 'policy/approxkl_avg': 0.003831026377156377, 'policy/clipfrac_avg': 0.016509434208273888, 'loss/policy_avg': -0.005342666991055012, 'loss/value_avg': 0.015932565554976463, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.22037455439567566, 'val/ratio': 0.9847181439399719, 'val/ratio_var': 0.00021197843307163566, 'val/num_eos_tokens': 0, 'lr': 4.316511513963743e-05, 'episode': 1120, 'epoch': 0.14}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [27:43<1:05:39, 131kB/s]
 14%|█▍        | 281/2041 [24:25<2:29:08,  5.08s/it][A

{'eps': 0, 'objective/kl': 41.345191955566406, 'objective/entropy': 18.29029083251953, 'objective/non_score_reward': -2.0672597885131836, 'objective/rlhf_reward': 0.7990615367889404, 'objective/scores': 2.866321325302124, 'policy/approxkl_avg': 0.004822565242648125, 'policy/clipfrac_avg': 0.04009433835744858, 'loss/policy_avg': -0.00919243786484003, 'loss/value_avg': 0.036973390728235245, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.306293785572052, 'val/ratio': 1.0013391971588135, 'val/ratio_var': 4.315681962907547e-06, 'val/num_eos_tokens': 0, 'lr': 4.314061734443901e-05, 'episode': 1124, 'epoch': 0.14}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [27:48<1:05:39, 131kB/s]
 14%|█▍        | 282/2041 [24:30<2:28:51,  5.08s/it][A

{'eps': 0, 'objective/kl': 44.476375579833984, 'objective/entropy': 15.008377075195312, 'objective/non_score_reward': -2.223818778991699, 'objective/rlhf_reward': 0.7362620830535889, 'objective/scores': 2.960080862045288, 'policy/approxkl_avg': 0.011253487318754196, 'policy/clipfrac_avg': 0.07075472176074982, 'loss/policy_avg': -0.005686872638761997, 'loss/value_avg': 0.03915712982416153, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.22819775342941284, 'val/ratio': 0.9885903000831604, 'val/ratio_var': 6.304404814727604e-05, 'val/num_eos_tokens': 0, 'lr': 4.311611954924057e-05, 'episode': 1128, 'epoch': 0.14}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [27:53<1:05:39, 131kB/s]
 14%|█▍        | 283/2041 [24:35<2:29:14,  5.09s/it][A

{'eps': 0, 'objective/kl': 36.5306396484375, 'objective/entropy': 9.546808242797852, 'objective/non_score_reward': -1.8265321254730225, 'objective/rlhf_reward': 0.9970934391021729, 'objective/scores': 2.8236255645751953, 'policy/approxkl_avg': 0.0018826406449079514, 'policy/clipfrac_avg': 0.017688680440187454, 'loss/policy_avg': -0.00676765339449048, 'loss/value_avg': 0.033187974244356155, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.25781020522117615, 'val/ratio': 0.9985402822494507, 'val/ratio_var': 6.316995040833717e-06, 'val/num_eos_tokens': 0, 'lr': 4.3091621754042136e-05, 'episode': 1132, 'epoch': 0.14}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [27:58<1:05:39, 131kB/s]
 14%|█▍        | 284/2041 [24:40<2:29:29,  5.11s/it][A

{'eps': 0, 'objective/kl': 68.00672912597656, 'objective/entropy': 31.482582092285156, 'objective/non_score_reward': -3.400336503982544, 'objective/rlhf_reward': -0.6662485599517822, 'objective/scores': 2.7340879440307617, 'policy/approxkl_avg': 0.02195386029779911, 'policy/clipfrac_avg': 0.10731132328510284, 'loss/policy_avg': -0.02956969477236271, 'loss/value_avg': 0.8101913928985596, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.4710853397846222, 'val/ratio': 1.0625882148742676, 'val/ratio_var': 0.003062875708565116, 'val/num_eos_tokens': 0, 'lr': 4.306712395884371e-05, 'episode': 1136, 'epoch': 0.14}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [28:03<1:05:39, 131kB/s]
 14%|█▍        | 285/2041 [24:45<2:29:45,  5.12s/it][A

{'eps': 0, 'objective/kl': 36.65635681152344, 'objective/entropy': 14.462276458740234, 'objective/non_score_reward': -1.8328180313110352, 'objective/rlhf_reward': 0.8962337970733643, 'objective/scores': 2.7290518283843994, 'policy/approxkl_avg': 0.015406467020511627, 'policy/clipfrac_avg': 0.07547169923782349, 'loss/policy_avg': -0.007210546638816595, 'loss/value_avg': 0.03229133039712906, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.37396201491355896, 'val/ratio': 0.9989486932754517, 'val/ratio_var': 1.4112088138062973e-05, 'val/num_eos_tokens': 0, 'lr': 4.304262616364527e-05, 'episode': 1140, 'epoch': 0.14}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [28:09<1:05:39, 131kB/s]
 14%|█▍        | 286/2041 [24:50<2:29:51,  5.12s/it][A

{'eps': 0, 'objective/kl': 59.47235107421875, 'objective/entropy': 22.008840560913086, 'objective/non_score_reward': -2.9736175537109375, 'objective/rlhf_reward': -1.9474670886993408, 'objective/scores': 1.0261504650115967, 'policy/approxkl_avg': 0.014916314743459225, 'policy/clipfrac_avg': 0.06367924064397812, 'loss/policy_avg': -0.020319657400250435, 'loss/value_avg': 0.9405509829521179, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.4166841506958008, 'val/ratio': 1.0166820287704468, 'val/ratio_var': 0.00025788144557736814, 'val/num_eos_tokens': 0, 'lr': 4.301812836844684e-05, 'episode': 1144, 'epoch': 0.14}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [28:14<1:05:39, 131kB/s]
 14%|█▍        | 287/2041 [24:55<2:30:13,  5.14s/it][A

{'eps': 0, 'objective/kl': 47.273155212402344, 'objective/entropy': 21.68271827697754, 'objective/non_score_reward': -2.3636577129364014, 'objective/rlhf_reward': -0.5323206186294556, 'objective/scores': 1.8313370943069458, 'policy/approxkl_avg': 0.042850311845541, 'policy/clipfrac_avg': 0.08136793226003647, 'loss/policy_avg': -0.017580948770046234, 'loss/value_avg': 0.2698834538459778, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.3943749964237213, 'val/ratio': 1.272516131401062, 'val/ratio_var': 0.1341036856174469, 'val/num_eos_tokens': 0, 'lr': 4.299363057324841e-05, 'episode': 1148, 'epoch': 0.14}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [28:19<1:05:39, 131kB/s]
 14%|█▍        | 288/2041 [25:01<2:30:44,  5.16s/it][A

{'eps': 0, 'objective/kl': 41.796295166015625, 'objective/entropy': 12.942220687866211, 'objective/non_score_reward': -2.0898146629333496, 'objective/rlhf_reward': -0.8668415546417236, 'objective/scores': 1.222973108291626, 'policy/approxkl_avg': 0.004554376471787691, 'policy/clipfrac_avg': 0.02004716917872429, 'loss/policy_avg': -0.009460985660552979, 'loss/value_avg': 0.28197839856147766, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.280005544424057, 'val/ratio': 1.0092192888259888, 'val/ratio_var': 4.7262379666790366e-05, 'val/num_eos_tokens': 0, 'lr': 4.296913277804998e-05, 'episode': 1152, 'epoch': 0.14}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [28:24<1:05:39, 131kB/s]
 14%|█▍        | 289/2041 [25:06<2:31:23,  5.18s/it][A

{'eps': 0, 'objective/kl': 61.08294677734375, 'objective/entropy': 21.38732147216797, 'objective/non_score_reward': -3.054147243499756, 'objective/rlhf_reward': -2.2385241985321045, 'objective/scores': 0.8156230449676514, 'policy/approxkl_avg': 0.01901833713054657, 'policy/clipfrac_avg': 0.07075472176074982, 'loss/policy_avg': -0.019612882286310196, 'loss/value_avg': 0.6940014362335205, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.30398714542388916, 'val/ratio': 1.0218851566314697, 'val/ratio_var': 0.0005167932831682265, 'val/num_eos_tokens': 0, 'lr': 4.2944634982851545e-05, 'episode': 1156, 'epoch': 0.14}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [28:29<1:05:39, 131kB/s]
 14%|█▍        | 290/2041 [25:11<2:30:57,  5.17s/it][A

{'eps': 0, 'objective/kl': 38.995933532714844, 'objective/entropy': 10.69847297668457, 'objective/non_score_reward': -1.9497967958450317, 'objective/rlhf_reward': -0.8341799974441528, 'objective/scores': 1.115616798400879, 'policy/approxkl_avg': 0.005013291724026203, 'policy/clipfrac_avg': 0.03537736088037491, 'loss/policy_avg': -0.0054491725750267506, 'loss/value_avg': 0.09717927128076553, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.19707031548023224, 'val/ratio': 0.9933270812034607, 'val/ratio_var': 4.8478483222424984e-05, 'val/num_eos_tokens': 0, 'lr': 4.292013718765311e-05, 'episode': 1160, 'epoch': 0.14}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [28:34<1:05:39, 131kB/s]
 14%|█▍        | 291/2041 [25:16<2:30:13,  5.15s/it][A

{'eps': 0, 'objective/kl': 41.92096710205078, 'objective/entropy': 13.154680252075195, 'objective/non_score_reward': -2.096048355102539, 'objective/rlhf_reward': -0.7928440570831299, 'objective/scores': 1.3032042980194092, 'policy/approxkl_avg': 0.008706886321306229, 'policy/clipfrac_avg': 0.060141511261463165, 'loss/policy_avg': -0.010141083039343357, 'loss/value_avg': 0.21155229210853577, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1863556206226349, 'val/ratio': 1.011559247970581, 'val/ratio_var': 8.748486288823187e-05, 'val/num_eos_tokens': 0, 'lr': 4.289563939245468e-05, 'episode': 1164, 'epoch': 0.14}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [28:40<1:05:39, 131kB/s]
 14%|█▍        | 292/2041 [25:21<2:29:45,  5.14s/it][A

{'eps': 0, 'objective/kl': 39.762168884277344, 'objective/entropy': 11.936635971069336, 'objective/non_score_reward': -1.9881083965301514, 'objective/rlhf_reward': -0.2968931198120117, 'objective/scores': 1.6912152767181396, 'policy/approxkl_avg': 0.007071644067764282, 'policy/clipfrac_avg': 0.048349056392908096, 'loss/policy_avg': -0.002601112937554717, 'loss/value_avg': 0.04347088187932968, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.20601123571395874, 'val/ratio': 0.9971605539321899, 'val/ratio_var': 5.1265810725453775e-06, 'val/num_eos_tokens': 0, 'lr': 4.287114159725625e-05, 'episode': 1168, 'epoch': 0.14}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [28:45<1:05:39, 131kB/s]
 14%|█▍        | 293/2041 [25:26<2:29:48,  5.14s/it][A

{'eps': 0, 'objective/kl': 34.67437744140625, 'objective/entropy': 0.6082649230957031, 'objective/non_score_reward': -1.733718991279602, 'objective/rlhf_reward': -0.05959200859069824, 'objective/scores': 1.6741269826889038, 'policy/approxkl_avg': 3.048631242563715e-06, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': 0.00027943222085013986, 'loss/value_avg': 0.010269470512866974, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.07850369065999985, 'val/ratio': 0.9981381893157959, 'val/ratio_var': 2.341887238799245e-06, 'val/num_eos_tokens': 0, 'lr': 4.284664380205782e-05, 'episode': 1172, 'epoch': 0.14}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [28:50<1:05:39, 131kB/s]
 14%|█▍        | 294/2041 [25:31<2:30:13,  5.16s/it][A

{'eps': 0, 'objective/kl': 34.39942169189453, 'objective/entropy': 4.0982513427734375, 'objective/non_score_reward': -1.719970941543579, 'objective/rlhf_reward': 0.0203549861907959, 'objective/scores': 1.740325927734375, 'policy/approxkl_avg': 7.582156831631437e-05, 'policy/clipfrac_avg': 0.001179245300590992, 'loss/policy_avg': -0.0009613942238502204, 'loss/value_avg': 0.01943351700901985, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1479179561138153, 'val/ratio': 0.9981476664543152, 'val/ratio_var': 2.4126713924488286e-06, 'val/num_eos_tokens': 0, 'lr': 4.2822146006859385e-05, 'episode': 1176, 'epoch': 0.14}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [28:55<1:05:39, 131kB/s]
 14%|█▍        | 295/2041 [25:37<2:30:09,  5.16s/it][A

{'eps': 0, 'objective/kl': 38.45170974731445, 'objective/entropy': 8.355085372924805, 'objective/non_score_reward': -1.9225856065750122, 'objective/rlhf_reward': -0.409443736076355, 'objective/scores': 1.5131418704986572, 'policy/approxkl_avg': 0.00180785171687603, 'policy/clipfrac_avg': 0.01886792480945587, 'loss/policy_avg': -0.007539994083344936, 'loss/value_avg': 0.040088869631290436, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.23106509447097778, 'val/ratio': 0.994905948638916, 'val/ratio_var': 2.8571004804689437e-05, 'val/num_eos_tokens': 0, 'lr': 4.279764821166095e-05, 'episode': 1180, 'epoch': 0.14}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [29:00<1:05:39, 131kB/s]
 15%|█▍        | 296/2041 [25:42<2:30:14,  5.17s/it][A

{'eps': 0, 'objective/kl': 52.90467071533203, 'objective/entropy': 18.180248260498047, 'objective/non_score_reward': -2.645233392715454, 'objective/rlhf_reward': -1.3324036598205566, 'objective/scores': 1.3128297328948975, 'policy/approxkl_avg': 0.012539840303361416, 'policy/clipfrac_avg': 0.06721697747707367, 'loss/policy_avg': -0.0259842649102211, 'loss/value_avg': 0.21756581962108612, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.3095957040786743, 'val/ratio': 1.0340712070465088, 'val/ratio_var': 0.0008859217632561922, 'val/num_eos_tokens': 0, 'lr': 4.277315041646252e-05, 'episode': 1184, 'epoch': 0.15}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [29:05<1:05:39, 131kB/s]
 15%|█▍        | 297/2041 [25:47<2:29:59,  5.16s/it][A

{'eps': 0, 'objective/kl': 69.01506805419922, 'objective/entropy': 18.96181869506836, 'objective/non_score_reward': -3.4507532119750977, 'objective/rlhf_reward': -2.812492847442627, 'objective/scores': 0.6382602453231812, 'policy/approxkl_avg': 0.26951247453689575, 'policy/clipfrac_avg': 0.06485848873853683, 'loss/policy_avg': -0.02048271894454956, 'loss/value_avg': 0.5100305080413818, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.27130353450775146, 'val/ratio': 0.9957836866378784, 'val/ratio_var': 8.098528269329108e-06, 'val/num_eos_tokens': 0, 'lr': 4.274865262126409e-05, 'episode': 1188, 'epoch': 0.15}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [29:11<1:05:39, 131kB/s]
 15%|█▍        | 298/2041 [25:52<2:30:10,  5.17s/it][A

{'eps': 0, 'objective/kl': 33.433677673339844, 'objective/entropy': 6.426629543304443, 'objective/non_score_reward': -1.6716837882995605, 'objective/rlhf_reward': -0.41286492347717285, 'objective/scores': 1.2588188648223877, 'policy/approxkl_avg': 0.0020396802574396133, 'policy/clipfrac_avg': 0.01886792480945587, 'loss/policy_avg': -0.009686043485999107, 'loss/value_avg': 0.04549945145845413, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.17485500872135162, 'val/ratio': 1.0009644031524658, 'val/ratio_var': 1.871368567663012e-06, 'val/num_eos_tokens': 0, 'lr': 4.272415482606566e-05, 'episode': 1192, 'epoch': 0.15}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [29:16<1:05:39, 131kB/s]
 15%|█▍        | 299/2041 [25:57<2:30:16,  5.18s/it][A

{'eps': 0, 'objective/kl': 67.68628692626953, 'objective/entropy': 17.490699768066406, 'objective/non_score_reward': -3.384314775466919, 'objective/rlhf_reward': -2.307544231414795, 'objective/scores': 1.0767706632614136, 'policy/approxkl_avg': 0.04736882820725441, 'policy/clipfrac_avg': 0.0625, 'loss/policy_avg': -0.02081368863582611, 'loss/value_avg': 0.3047221899032593, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.2544213533401489, 'val/ratio': 0.9787589311599731, 'val/ratio_var': 0.00032356841256842017, 'val/num_eos_tokens': 0, 'lr': 4.2699657030867225e-05, 'episode': 1196, 'epoch': 0.15}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [29:21<1:05:39, 131kB/s]
 15%|█▍        | 300/2041 [26:02<2:29:21,  5.15s/it][A

{'eps': 0, 'objective/kl': 33.0701789855957, 'objective/entropy': 0.7250890731811523, 'objective/non_score_reward': -1.6535091400146484, 'objective/rlhf_reward': -0.26719915866851807, 'objective/scores': 1.3863099813461304, 'policy/approxkl_avg': 0.005006544757634401, 'policy/clipfrac_avg': 0.01179245300590992, 'loss/policy_avg': -0.012782635167241096, 'loss/value_avg': 0.012397296726703644, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.07815158367156982, 'val/ratio': 0.9951940178871155, 'val/ratio_var': 2.1963429389870726e-05, 'val/num_eos_tokens': 0, 'lr': 4.267515923566879e-05, 'episode': 1200, 'epoch': 0.15}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [29:26<1:05:39, 131kB/s]
 15%|█▍        | 301/2041 [26:08<2:29:24,  5.15s/it][A

{'eps': 0, 'objective/kl': 64.20407104492188, 'objective/entropy': 4.780108451843262, 'objective/non_score_reward': -3.2102036476135254, 'objective/rlhf_reward': -3.7175469398498535, 'objective/scores': -0.5073432326316833, 'policy/approxkl_avg': 0.00884866900742054, 'policy/clipfrac_avg': 0.029481131583452225, 'loss/policy_avg': -0.014218557626008987, 'loss/value_avg': 0.5433911085128784, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1387024223804474, 'val/ratio': 0.9893983602523804, 'val/ratio_var': 7.585391722386703e-05, 'val/num_eos_tokens': 0, 'lr': 4.2650661440470355e-05, 'episode': 1204, 'epoch': 0.15}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [29:31<1:05:39, 131kB/s]
 15%|█▍        | 302/2041 [26:13<2:29:04,  5.14s/it][A

{'eps': 0, 'objective/kl': 58.94358825683594, 'objective/entropy': 8.397037506103516, 'objective/non_score_reward': -2.9471793174743652, 'objective/rlhf_reward': -2.3186585903167725, 'objective/scores': 0.628520667552948, 'policy/approxkl_avg': 0.016880929470062256, 'policy/clipfrac_avg': 0.02712264098227024, 'loss/policy_avg': -0.014992574229836464, 'loss/value_avg': 0.23876148462295532, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1135312020778656, 'val/ratio': 0.9868004322052002, 'val/ratio_var': 9.637610492063686e-05, 'val/num_eos_tokens': 0, 'lr': 4.262616364527193e-05, 'episode': 1208, 'epoch': 0.15}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [29:36<1:05:39, 131kB/s]
 15%|█▍        | 303/2041 [26:18<2:28:15,  5.12s/it][A

{'eps': 0, 'objective/kl': 63.34092712402344, 'objective/entropy': 6.13202428817749, 'objective/non_score_reward': -3.167046308517456, 'objective/rlhf_reward': -2.037616491317749, 'objective/scores': 1.129429817199707, 'policy/approxkl_avg': 0.024144120514392853, 'policy/clipfrac_avg': 0.016509434208273888, 'loss/policy_avg': -0.005912770517170429, 'loss/value_avg': 0.05387771129608154, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.07832714915275574, 'val/ratio': 0.9914715886116028, 'val/ratio_var': 3.832394941127859e-05, 'val/num_eos_tokens': 0, 'lr': 4.26016658500735e-05, 'episode': 1212, 'epoch': 0.15}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [29:41<1:05:39, 131kB/s]
 15%|█▍        | 304/2041 [26:23<2:28:55,  5.14s/it][A

{'eps': 0, 'objective/kl': 93.97614288330078, 'objective/entropy': 5.612311363220215, 'objective/non_score_reward': -4.6988067626953125, 'objective/rlhf_reward': -4.04927921295166, 'objective/scores': 0.6495274901390076, 'policy/approxkl_avg': 0.11836802959442139, 'policy/clipfrac_avg': 0.028301887214183807, 'loss/policy_avg': -0.01407760102301836, 'loss/value_avg': 0.3121556043624878, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.05738992616534233, 'val/ratio': 0.9743736982345581, 'val/ratio_var': 0.00034339886042289436, 'val/num_eos_tokens': 0, 'lr': 4.257716805487506e-05, 'episode': 1216, 'epoch': 0.15}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [29:47<1:05:39, 131kB/s]
 15%|█▍        | 305/2041 [26:28<2:28:53,  5.15s/it][A

{'eps': 0, 'objective/kl': 58.39448547363281, 'objective/entropy': 0.16133737564086914, 'objective/non_score_reward': -2.919724464416504, 'objective/rlhf_reward': -2.066039562225342, 'objective/scores': 0.8536849021911621, 'policy/approxkl_avg': 3.874674439430237e-05, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': 0.0009281545062549412, 'loss/value_avg': 0.05580047518014908, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.01302194595336914, 'val/ratio': 1.0010199546813965, 'val/ratio_var': 5.597993890660291e-07, 'val/num_eos_tokens': 0, 'lr': 4.2552670259676633e-05, 'episode': 1220, 'epoch': 0.15}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [29:52<1:05:39, 131kB/s]
 15%|█▍        | 306/2041 [26:33<2:29:04,  5.16s/it][A

{'eps': 0, 'objective/kl': 61.58646011352539, 'objective/entropy': 1.0973668098449707, 'objective/non_score_reward': -3.0793228149414062, 'objective/rlhf_reward': -2.2343459129333496, 'objective/scores': 0.8449769020080566, 'policy/approxkl_avg': 0.0018881272990256548, 'policy/clipfrac_avg': 0.003537735901772976, 'loss/policy_avg': -0.0006778774550184608, 'loss/value_avg': 0.07545880228281021, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.009269941598176956, 'val/ratio': 0.9980686902999878, 'val/ratio_var': 1.995240154428757e-06, 'val/num_eos_tokens': 0, 'lr': 4.25281724644782e-05, 'episode': 1224, 'epoch': 0.15}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [29:57<1:05:39, 131kB/s]
 15%|█▌        | 307/2041 [26:38<2:28:52,  5.15s/it][A

{'eps': 0, 'objective/kl': 59.34607696533203, 'objective/entropy': 0.05403900146484375, 'objective/non_score_reward': -2.96730375289917, 'objective/rlhf_reward': -2.208625078201294, 'objective/scores': 0.758678674697876, 'policy/approxkl_avg': 5.833551952605376e-08, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': 4.062112930114381e-05, 'loss/value_avg': 0.05494894087314606, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.008173614740371704, 'val/ratio': 1.0000602006912231, 'val/ratio_var': 2.2876871508259455e-09, 'val/num_eos_tokens': 0, 'lr': 4.250367466927977e-05, 'episode': 1228, 'epoch': 0.15}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [30:02<1:05:39, 131kB/s]
 15%|█▌        | 308/2041 [26:44<2:29:17,  5.17s/it][A

{'eps': 0, 'objective/kl': 60.6959342956543, 'objective/entropy': 0.04495382308959961, 'objective/non_score_reward': -3.034796714782715, 'objective/rlhf_reward': -2.269371509552002, 'objective/scores': 0.7654250860214233, 'policy/approxkl_avg': 2.4625470551598028e-09, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': 4.160516255069524e-06, 'loss/value_avg': 0.0659511387348175, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.007162780500948429, 'val/ratio': 1.0000238418579102, 'val/ratio_var': 3.7880690251235194e-10, 'val/num_eos_tokens': 0, 'lr': 4.247917687408133e-05, 'episode': 1232, 'epoch': 0.15}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [30:07<1:05:39, 131kB/s]
 15%|█▌        | 309/2041 [26:49<2:28:54,  5.16s/it][A

{'eps': 0, 'objective/kl': 60.10365295410156, 'objective/entropy': 0.037456512451171875, 'objective/non_score_reward': -3.0051827430725098, 'objective/rlhf_reward': -2.3580493927001953, 'objective/scores': 0.6471332907676697, 'policy/approxkl_avg': 3.93776899976217e-10, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -7.9499104685965e-06, 'loss/value_avg': 0.055431053042411804, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.006138891912996769, 'val/ratio': 1.000010371208191, 'val/ratio_var': 7.247535904753022e-11, 'val/num_eos_tokens': 0, 'lr': 4.2454679078882906e-05, 'episode': 1236, 'epoch': 0.15}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [30:12<1:05:39, 131kB/s]
 15%|█▌        | 310/2041 [26:54<2:28:31,  5.15s/it][A

{'eps': 0, 'objective/kl': 59.96915054321289, 'objective/entropy': 1.8957276344299316, 'objective/non_score_reward': -2.998457431793213, 'objective/rlhf_reward': -2.2615909576416016, 'objective/scores': 0.7368664741516113, 'policy/approxkl_avg': 0.0001010213018162176, 'policy/clipfrac_avg': 0.002358490601181984, 'loss/policy_avg': -0.0007377789588645101, 'loss/value_avg': 0.06353496015071869, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.009125221520662308, 'val/ratio': 1.0007346868515015, 'val/ratio_var': 5.659055091200571e-07, 'val/num_eos_tokens': 0, 'lr': 4.2430181283684474e-05, 'episode': 1240, 'epoch': 0.15}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [30:18<1:05:39, 131kB/s]
 15%|█▌        | 311/2041 [26:59<2:29:13,  5.18s/it][A

{'eps': 0, 'objective/kl': 60.018463134765625, 'objective/entropy': 0.05755805969238281, 'objective/non_score_reward': -3.0009231567382812, 'objective/rlhf_reward': -2.3858392238616943, 'objective/scores': 0.6150838732719421, 'policy/approxkl_avg': 1.7541310626256745e-08, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -6.156709787319414e-06, 'loss/value_avg': 0.04908691346645355, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.009936697781085968, 'val/ratio': 0.9998769760131836, 'val/ratio_var': 1.1011514366998654e-08, 'val/num_eos_tokens': 0, 'lr': 4.2405683488486035e-05, 'episode': 1244, 'epoch': 0.15}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [30:23<1:05:39, 131kB/s]
 15%|█▌        | 312/2041 [27:04<2:29:13,  5.18s/it][A

{'eps': 0, 'objective/kl': 61.79713439941406, 'objective/entropy': 0.07014656066894531, 'objective/non_score_reward': -3.0898566246032715, 'objective/rlhf_reward': -2.3951480388641357, 'objective/scores': 0.6947085857391357, 'policy/approxkl_avg': 1.9111915605662944e-08, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -3.548894528648816e-05, 'loss/value_avg': 0.06163499504327774, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.01146049052476883, 'val/ratio': 0.9998977184295654, 'val/ratio_var': 7.585112093977386e-09, 'val/num_eos_tokens': 0, 'lr': 4.23811856932876e-05, 'episode': 1248, 'epoch': 0.15}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [30:28<1:05:39, 131kB/s]
 15%|█▌        | 313/2041 [27:09<2:28:10,  5.14s/it][A

{'eps': 0, 'objective/kl': 59.69579315185547, 'objective/entropy': 0.0881047248840332, 'objective/non_score_reward': -2.9847896099090576, 'objective/rlhf_reward': -2.420290946960449, 'objective/scores': 0.5644986629486084, 'policy/approxkl_avg': 4.359937477715903e-08, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -7.865001680329442e-05, 'loss/value_avg': 0.04474804550409317, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.013701597228646278, 'val/ratio': 0.9999061822891235, 'val/ratio_var': 6.5269500915121625e-09, 'val/num_eos_tokens': 0, 'lr': 4.235668789808918e-05, 'episode': 1252, 'epoch': 0.15}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [30:33<1:05:39, 131kB/s]
 15%|█▌        | 314/2041 [27:15<2:27:45,  5.13s/it][A

{'eps': 0, 'objective/kl': 60.12400436401367, 'objective/entropy': 0.11438703536987305, 'objective/non_score_reward': -3.0062003135681152, 'objective/rlhf_reward': -2.3414759635925293, 'objective/scores': 0.6647243499755859, 'policy/approxkl_avg': 4.4382639430295967e-07, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.00019619542581494898, 'loss/value_avg': 0.045422956347465515, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.017243362963199615, 'val/ratio': 0.9998422861099243, 'val/ratio_var': 2.0963923930139572e-08, 'val/num_eos_tokens': 0, 'lr': 4.233219010289074e-05, 'episode': 1256, 'epoch': 0.15}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [30:38<1:05:39, 131kB/s]
 15%|█▌        | 315/2041 [27:20<2:27:50,  5.14s/it][A

{'eps': 0, 'objective/kl': 60.221824645996094, 'objective/entropy': 0.1321086883544922, 'objective/non_score_reward': -3.0110912322998047, 'objective/rlhf_reward': -2.3976383209228516, 'objective/scores': 0.6134527921676636, 'policy/approxkl_avg': 8.6540412667091e-06, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.0012835359666496515, 'loss/value_avg': 0.05004328489303589, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.02019568718969822, 'val/ratio': 0.9994887113571167, 'val/ratio_var': 2.900702895658469e-07, 'val/num_eos_tokens': 0, 'lr': 4.230769230769231e-05, 'episode': 1260, 'epoch': 0.15}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [30:43<1:05:39, 131kB/s]
 15%|█▌        | 316/2041 [27:25<2:27:23,  5.13s/it][A

{'eps': 0, 'objective/kl': 71.4256362915039, 'objective/entropy': 4.8273115158081055, 'objective/non_score_reward': -3.571281909942627, 'objective/rlhf_reward': -3.0062367916107178, 'objective/scores': 0.5650451183319092, 'policy/approxkl_avg': 0.10239063948392868, 'policy/clipfrac_avg': 0.01886792480945587, 'loss/policy_avg': -0.014979925006628036, 'loss/value_avg': 0.14244717359542847, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.03793063759803772, 'val/ratio': 0.9868646860122681, 'val/ratio_var': 0.00013088349078316242, 'val/num_eos_tokens': 0, 'lr': 4.228319451249388e-05, 'episode': 1264, 'epoch': 0.15}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [30:48<1:05:39, 131kB/s]
 16%|█▌        | 317/2041 [27:30<2:27:46,  5.14s/it][A

{'eps': 0, 'objective/kl': 33.257118225097656, 'objective/entropy': 1.2389025688171387, 'objective/non_score_reward': -1.6628559827804565, 'objective/rlhf_reward': -0.7144662737846375, 'objective/scores': 0.9483897089958191, 'policy/approxkl_avg': 0.00047063559759408236, 'policy/clipfrac_avg': 0.00589622650295496, 'loss/policy_avg': -0.001307032653130591, 'loss/value_avg': 0.008279955014586449, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.028723450377583504, 'val/ratio': 1.0024664402008057, 'val/ratio_var': 4.533933861239348e-06, 'val/num_eos_tokens': 0, 'lr': 4.225869671729544e-05, 'episode': 1268, 'epoch': 0.16}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [30:54<1:05:39, 131kB/s]
 16%|█▌        | 318/2041 [27:35<2:28:46,  5.18s/it][A

{'eps': 0, 'objective/kl': 34.62370300292969, 'objective/entropy': 2.772519588470459, 'objective/non_score_reward': -1.7311854362487793, 'objective/rlhf_reward': -0.8979460597038269, 'objective/scores': 0.8332393765449524, 'policy/approxkl_avg': 0.0011035356437787414, 'policy/clipfrac_avg': 0.008254717104136944, 'loss/policy_avg': -0.0036386377178132534, 'loss/value_avg': 0.012582232244312763, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.03758285939693451, 'val/ratio': 1.0011299848556519, 'val/ratio_var': 1.376112663820095e-06, 'val/num_eos_tokens': 0, 'lr': 4.223419892209701e-05, 'episode': 1272, 'epoch': 0.16}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [30:59<1:05:39, 131kB/s]
 16%|█▌        | 319/2041 [27:40<2:28:03,  5.16s/it][A

{'eps': 0, 'objective/kl': 32.8104133605957, 'objective/entropy': 2.7634530067443848, 'objective/non_score_reward': -1.6405205726623535, 'objective/rlhf_reward': -0.6406029462814331, 'objective/scores': 0.9999176263809204, 'policy/approxkl_avg': 0.002834264189004898, 'policy/clipfrac_avg': 0.009433962404727936, 'loss/policy_avg': -0.005704023409634829, 'loss/value_avg': 0.01074089016765356, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.08244321495294571, 'val/ratio': 0.9949514865875244, 'val/ratio_var': 1.6081043213489465e-05, 'val/num_eos_tokens': 0, 'lr': 4.220970112689858e-05, 'episode': 1276, 'epoch': 0.16}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [31:04<1:05:39, 131kB/s]
 16%|█▌        | 320/2041 [27:45<2:27:41,  5.15s/it][A

{'eps': 0, 'objective/kl': 34.75116729736328, 'objective/entropy': 4.265421390533447, 'objective/non_score_reward': -1.737558364868164, 'objective/rlhf_reward': -0.7269365787506104, 'objective/scores': 1.0106217861175537, 'policy/approxkl_avg': 0.009314474649727345, 'policy/clipfrac_avg': 0.02004716917872429, 'loss/policy_avg': -0.007993909530341625, 'loss/value_avg': 0.014929486438632011, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.14149153232574463, 'val/ratio': 0.9936212301254272, 'val/ratio_var': 2.624473381729331e-05, 'val/num_eos_tokens': 0, 'lr': 4.2185203331700154e-05, 'episode': 1280, 'epoch': 0.16}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [31:09<1:05:39, 131kB/s]
 16%|█▌        | 321/2041 [27:51<2:27:24,  5.14s/it][A

{'eps': 0, 'objective/kl': 58.05783462524414, 'objective/entropy': 15.278493881225586, 'objective/non_score_reward': -2.9028921127319336, 'objective/rlhf_reward': -2.0429201126098633, 'objective/scores': 0.8599720597267151, 'policy/approxkl_avg': 4.3340163230896, 'policy/clipfrac_avg': 0.08844339847564697, 'loss/policy_avg': 0.0939026027917862, 'loss/value_avg': 0.2737402319908142, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6123050451278687, 'val/ratio': 0.9596096277236938, 'val/ratio_var': 0.002759300172328949, 'val/num_eos_tokens': 0, 'lr': 4.2160705536501715e-05, 'episode': 1284, 'epoch': 0.16}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [31:14<1:05:39, 131kB/s]
 16%|█▌        | 322/2041 [27:56<2:27:41,  5.16s/it][A

{'eps': 0, 'objective/kl': 44.68977737426758, 'objective/entropy': 13.605148315429688, 'objective/non_score_reward': -2.2344889640808105, 'objective/rlhf_reward': -1.7903075218200684, 'objective/scores': 0.4441814422607422, 'policy/approxkl_avg': 0.7116634845733643, 'policy/clipfrac_avg': 0.033018868416547775, 'loss/policy_avg': -0.007600352633744478, 'loss/value_avg': 0.2660443186759949, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1992129236459732, 'val/ratio': 0.9979385137557983, 'val/ratio_var': 6.086382654757472e-06, 'val/num_eos_tokens': 0, 'lr': 4.2136207741303283e-05, 'episode': 1288, 'epoch': 0.16}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [31:19<1:05:39, 131kB/s]
 16%|█▌        | 323/2041 [28:01<2:26:26,  5.11s/it][A

{'eps': 0, 'objective/kl': 34.03144836425781, 'objective/entropy': 0.7486257553100586, 'objective/non_score_reward': -1.7015724182128906, 'objective/rlhf_reward': -0.8419559001922607, 'objective/scores': 0.8596165180206299, 'policy/approxkl_avg': 0.01091733667999506, 'policy/clipfrac_avg': 0.009433962404727936, 'loss/policy_avg': 0.0015416486421599984, 'loss/value_avg': 0.00834318995475769, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.07033628225326538, 'val/ratio': 0.989811897277832, 'val/ratio_var': 0.00027170893736183643, 'val/num_eos_tokens': 0, 'lr': 4.211170994610485e-05, 'episode': 1292, 'epoch': 0.16}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [31:24<1:05:39, 131kB/s]
 16%|█▌        | 324/2041 [28:06<2:26:33,  5.12s/it][A

{'eps': 0, 'objective/kl': 35.47069549560547, 'objective/entropy': 3.024383544921875, 'objective/non_score_reward': -1.7735345363616943, 'objective/rlhf_reward': -1.0589205026626587, 'objective/scores': 0.7146140336990356, 'policy/approxkl_avg': 0.004125835373997688, 'policy/clipfrac_avg': 0.007075471803545952, 'loss/policy_avg': -0.003800172358751297, 'loss/value_avg': 0.011452527716755867, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.044405028223991394, 'val/ratio': 0.9974108934402466, 'val/ratio_var': 3.3789699500630377e-06, 'val/num_eos_tokens': 0, 'lr': 4.208721215090642e-05, 'episode': 1296, 'epoch': 0.16}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [31:30<1:05:39, 131kB/s]
 16%|█▌        | 325/2041 [28:11<2:27:00,  5.14s/it][A

{'eps': 0, 'objective/kl': 34.241615295410156, 'objective/entropy': 0.3072471618652344, 'objective/non_score_reward': -1.712080717086792, 'objective/rlhf_reward': -0.9259711503982544, 'objective/scores': 0.7861095666885376, 'policy/approxkl_avg': 4.772301167577098e-07, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': 4.481032010517083e-05, 'loss/value_avg': 0.014551290310919285, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.032660167664289474, 'val/ratio': 1.000619649887085, 'val/ratio_var': 2.521468616123457e-07, 'val/num_eos_tokens': 0, 'lr': 4.206271435570799e-05, 'episode': 1300, 'epoch': 0.16}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [31:35<1:05:39, 131kB/s]
 16%|█▌        | 326/2041 [28:16<2:26:26,  5.12s/it][A

{'eps': 0, 'objective/kl': 33.44978332519531, 'objective/entropy': 0.1983957290649414, 'objective/non_score_reward': -1.6724891662597656, 'objective/rlhf_reward': -0.9177285432815552, 'objective/scores': 0.7547606229782104, 'policy/approxkl_avg': 1.8574559135231539e-07, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -5.055429937783629e-05, 'loss/value_avg': 0.009932214394211769, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.02307376265525818, 'val/ratio': 1.0002245903015137, 'val/ratio_var': 3.213847676875048e-08, 'val/num_eos_tokens': 0, 'lr': 4.2038216560509556e-05, 'episode': 1304, 'epoch': 0.16}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [31:40<1:05:39, 131kB/s]
 16%|█▌        | 327/2041 [28:21<2:26:32,  5.13s/it][A

{'eps': 0, 'objective/kl': 34.324241638183594, 'objective/entropy': 2.435837745666504, 'objective/non_score_reward': -1.716212272644043, 'objective/rlhf_reward': -1.0496158599853516, 'objective/scores': 0.6665964126586914, 'policy/approxkl_avg': 0.0007714095991104841, 'policy/clipfrac_avg': 0.00589622650295496, 'loss/policy_avg': -0.0032091757748275995, 'loss/value_avg': 0.015495412051677704, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.032681506127119064, 'val/ratio': 0.9990235567092896, 'val/ratio_var': 4.77539572329988e-07, 'val/num_eos_tokens': 0, 'lr': 4.2013718765311124e-05, 'episode': 1308, 'epoch': 0.16}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [31:45<1:05:39, 131kB/s]
 16%|█▌        | 328/2041 [28:27<2:27:04,  5.15s/it][A

{'eps': 0, 'objective/kl': 33.22630310058594, 'objective/entropy': 4.834573268890381, 'objective/non_score_reward': -1.6613152027130127, 'objective/rlhf_reward': -0.8828986287117004, 'objective/scores': 0.7784165740013123, 'policy/approxkl_avg': 0.0018889799248427153, 'policy/clipfrac_avg': 0.014150943607091904, 'loss/policy_avg': -0.004583587870001793, 'loss/value_avg': 0.011370792984962463, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.09698960185050964, 'val/ratio': 0.9989591240882874, 'val/ratio_var': 1.1137906312796986e-06, 'val/num_eos_tokens': 0, 'lr': 4.198922097011269e-05, 'episode': 1312, 'epoch': 0.16}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [31:50<1:05:39, 131kB/s]
 16%|█▌        | 329/2041 [28:32<2:27:36,  5.17s/it][A

{'eps': 0, 'objective/kl': 32.47320556640625, 'objective/entropy': 1.7159004211425781, 'objective/non_score_reward': -1.6236604452133179, 'objective/rlhf_reward': -1.000739336013794, 'objective/scores': 0.6229211688041687, 'policy/approxkl_avg': 4.813497980649117e-06, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -9.457570558879524e-05, 'loss/value_avg': 0.005784435663372278, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.05978346988558769, 'val/ratio': 0.9981622695922852, 'val/ratio_var': 2.6375748802820453e-06, 'val/num_eos_tokens': 0, 'lr': 4.196472317491426e-05, 'episode': 1316, 'epoch': 0.16}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [31:55<1:05:39, 131kB/s]
 16%|█▌        | 330/2041 [28:37<2:26:49,  5.15s/it][A

{'eps': 0, 'objective/kl': 33.540008544921875, 'objective/entropy': 4.810724258422852, 'objective/non_score_reward': -1.6770005226135254, 'objective/rlhf_reward': -0.993449330329895, 'objective/scores': 0.6835511922836304, 'policy/approxkl_avg': 0.01747734658420086, 'policy/clipfrac_avg': 0.007075471803545952, 'loss/policy_avg': -0.004999066237360239, 'loss/value_avg': 0.022055260837078094, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.07376302033662796, 'val/ratio': 1.096534252166748, 'val/ratio_var': 0.015131697058677673, 'val/num_eos_tokens': 0, 'lr': 4.194022537971583e-05, 'episode': 1320, 'epoch': 0.16}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [32:00<1:05:39, 131kB/s]
 16%|█▌        | 331/2041 [28:42<2:26:21,  5.14s/it][A

{'eps': 0, 'objective/kl': 33.84593200683594, 'objective/entropy': 2.1464638710021973, 'objective/non_score_reward': -1.6922967433929443, 'objective/rlhf_reward': -0.9546544551849365, 'objective/scores': 0.7376422882080078, 'policy/approxkl_avg': 0.0001883799268398434, 'policy/clipfrac_avg': 0.002358490601181984, 'loss/policy_avg': -0.0004094119358342141, 'loss/value_avg': 0.009375211782753468, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.09337100386619568, 'val/ratio': 0.9984116554260254, 'val/ratio_var': 1.772780819919717e-06, 'val/num_eos_tokens': 0, 'lr': 4.1915727584517396e-05, 'episode': 1324, 'epoch': 0.16}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [32:06<1:05:39, 131kB/s]
 16%|█▋        | 332/2041 [28:47<2:26:08,  5.13s/it][A

{'eps': 0, 'objective/kl': 34.79981231689453, 'objective/entropy': 3.1364684104919434, 'objective/non_score_reward': -1.7399909496307373, 'objective/rlhf_reward': -1.2906677722930908, 'objective/scores': 0.4493231177330017, 'policy/approxkl_avg': 0.0017611144576221704, 'policy/clipfrac_avg': 0.012971698306500912, 'loss/policy_avg': -0.006702758371829987, 'loss/value_avg': 0.032504551112651825, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.08261196315288544, 'val/ratio': 0.9984709024429321, 'val/ratio_var': 1.3312769624462817e-06, 'val/num_eos_tokens': 0, 'lr': 4.1891229789318964e-05, 'episode': 1328, 'epoch': 0.16}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [32:11<1:05:39, 131kB/s]
 16%|█▋        | 333/2041 [28:52<2:25:32,  5.11s/it][A

{'eps': 0, 'objective/kl': 34.68394470214844, 'objective/entropy': 3.3663439750671387, 'objective/non_score_reward': -1.7341973781585693, 'objective/rlhf_reward': -1.27305006980896, 'objective/scores': 0.4611473083496094, 'policy/approxkl_avg': 0.0018548858352005482, 'policy/clipfrac_avg': 0.007075471803545952, 'loss/policy_avg': -0.005371713545173407, 'loss/value_avg': 0.029704324901103973, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.04535277560353279, 'val/ratio': 0.9972752928733826, 'val/ratio_var': 4.5100705392542295e-06, 'val/num_eos_tokens': 0, 'lr': 4.186673199412053e-05, 'episode': 1332, 'epoch': 0.16}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [32:16<1:05:39, 131kB/s]
 16%|█▋        | 334/2041 [28:58<2:27:38,  5.19s/it][A

{'eps': 0, 'objective/kl': 35.52794647216797, 'objective/entropy': 4.243354320526123, 'objective/non_score_reward': -1.7763973474502563, 'objective/rlhf_reward': -0.7470502853393555, 'objective/scores': 1.0293470621109009, 'policy/approxkl_avg': 0.0020501657854765654, 'policy/clipfrac_avg': 0.007075471803545952, 'loss/policy_avg': -0.004116215743124485, 'loss/value_avg': 0.01649249717593193, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.09200407564640045, 'val/ratio': 1.0023643970489502, 'val/ratio_var': 7.1594004111830145e-06, 'val/num_eos_tokens': 0, 'lr': 4.18422341989221e-05, 'episode': 1336, 'epoch': 0.16}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [32:21<1:05:39, 131kB/s]
 16%|█▋        | 335/2041 [29:03<2:26:29,  5.15s/it][A

{'eps': 0, 'objective/kl': 35.43596649169922, 'objective/entropy': 9.909704208374023, 'objective/non_score_reward': -1.7717981338500977, 'objective/rlhf_reward': -1.4322212934494019, 'objective/scores': 0.3395768105983734, 'policy/approxkl_avg': 0.006836642511188984, 'policy/clipfrac_avg': 0.029481131583452225, 'loss/policy_avg': -0.010434404015541077, 'loss/value_avg': 0.057578280568122864, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1244674026966095, 'val/ratio': 0.9894915819168091, 'val/ratio_var': 9.381560812471434e-05, 'val/num_eos_tokens': 0, 'lr': 4.181773640372367e-05, 'episode': 1340, 'epoch': 0.16}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [32:26<1:05:39, 131kB/s]
 16%|█▋        | 336/2041 [29:08<2:26:13,  5.15s/it][A

{'eps': 0, 'objective/kl': 34.348045349121094, 'objective/entropy': 1.8655614852905273, 'objective/non_score_reward': -1.7174021005630493, 'objective/rlhf_reward': -1.0522253513336182, 'objective/scores': 0.6651767492294312, 'policy/approxkl_avg': 0.0003399963607080281, 'policy/clipfrac_avg': 0.007075471803545952, 'loss/policy_avg': -0.004257215652614832, 'loss/value_avg': 0.011618164367973804, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.09743865579366684, 'val/ratio': 0.9969773292541504, 'val/ratio_var': 8.932906894187909e-06, 'val/num_eos_tokens': 0, 'lr': 4.1793238608525236e-05, 'episode': 1344, 'epoch': 0.16}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [32:31<1:05:39, 131kB/s]
 17%|█▋        | 337/2041 [29:13<2:25:44,  5.13s/it][A

{'eps': 0, 'objective/kl': 36.152687072753906, 'objective/entropy': 6.921114444732666, 'objective/non_score_reward': -1.8076343536376953, 'objective/rlhf_reward': -1.0115529298782349, 'objective/scores': 0.7960814237594604, 'policy/approxkl_avg': 0.0013562876265496016, 'policy/clipfrac_avg': 0.01179245300590992, 'loss/policy_avg': -0.00430959602817893, 'loss/value_avg': 0.021704718470573425, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.13419875502586365, 'val/ratio': 0.9975792169570923, 'val/ratio_var': 4.0972904571390245e-06, 'val/num_eos_tokens': 0, 'lr': 4.1768740813326804e-05, 'episode': 1348, 'epoch': 0.17}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [32:36<1:05:39, 131kB/s]
 17%|█▋        | 338/2041 [29:18<2:25:15,  5.12s/it][A

{'eps': 0, 'objective/kl': 36.33639907836914, 'objective/entropy': 8.94517993927002, 'objective/non_score_reward': -1.8168201446533203, 'objective/rlhf_reward': -0.9952579736709595, 'objective/scores': 0.8215621709823608, 'policy/approxkl_avg': 0.0017392787849530578, 'policy/clipfrac_avg': 0.017688680440187454, 'loss/policy_avg': -0.004526818636804819, 'loss/value_avg': 0.018544252961874008, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1410796046257019, 'val/ratio': 0.9981881380081177, 'val/ratio_var': 1.6719978930268553e-06, 'val/num_eos_tokens': 0, 'lr': 4.174424301812837e-05, 'episode': 1352, 'epoch': 0.17}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [32:41<1:05:39, 131kB/s]
 17%|█▋        | 339/2041 [29:23<2:25:24,  5.13s/it][A

{'eps': 0, 'objective/kl': 35.76621627807617, 'objective/entropy': 7.948739528656006, 'objective/non_score_reward': -1.7883108854293823, 'objective/rlhf_reward': -0.9488463401794434, 'objective/scores': 0.839464545249939, 'policy/approxkl_avg': 0.001423742389306426, 'policy/clipfrac_avg': 0.02004716917872429, 'loss/policy_avg': -0.0036428384482860565, 'loss/value_avg': 0.01869802549481392, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.15365718305110931, 'val/ratio': 1.000891923904419, 'val/ratio_var': 9.611177347323974e-07, 'val/num_eos_tokens': 0, 'lr': 4.171974522292994e-05, 'episode': 1356, 'epoch': 0.17}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [32:47<1:05:39, 131kB/s]
 17%|█▋        | 340/2041 [29:28<2:25:39,  5.14s/it][A

{'eps': 0, 'objective/kl': 36.050926208496094, 'objective/entropy': 8.452766418457031, 'objective/non_score_reward': -1.802546501159668, 'objective/rlhf_reward': -0.8028144240379333, 'objective/scores': 0.9997320771217346, 'policy/approxkl_avg': 0.001968666911125183, 'policy/clipfrac_avg': 0.010613207705318928, 'loss/policy_avg': -0.009169417433440685, 'loss/value_avg': 0.020526789128780365, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1835351437330246, 'val/ratio': 0.9979755282402039, 'val/ratio_var': 5.315017915563658e-06, 'val/num_eos_tokens': 0, 'lr': 4.16952474277315e-05, 'episode': 1360, 'epoch': 0.17}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [32:52<1:05:39, 131kB/s]
 17%|█▋        | 341/2041 [29:33<2:25:47,  5.15s/it][A

{'eps': 0, 'objective/kl': 37.28620147705078, 'objective/entropy': 12.124799728393555, 'objective/non_score_reward': -1.8643100261688232, 'objective/rlhf_reward': -0.9853085279464722, 'objective/scores': 0.8790014982223511, 'policy/approxkl_avg': 0.0026162192225456238, 'policy/clipfrac_avg': 0.033018868416547775, 'loss/policy_avg': -0.0020128432661294937, 'loss/value_avg': 0.01882149465382099, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.22926083207130432, 'val/ratio': 1.0023646354675293, 'val/ratio_var': 3.1371612294606166e-06, 'val/num_eos_tokens': 0, 'lr': 4.1670749632533076e-05, 'episode': 1364, 'epoch': 0.17}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [32:57<1:05:39, 131kB/s]
 17%|█▋        | 342/2041 [29:39<2:25:40,  5.14s/it][A

{'eps': 0, 'objective/kl': 34.7302360534668, 'objective/entropy': 11.154878616333008, 'objective/non_score_reward': -1.7365118265151978, 'objective/rlhf_reward': -0.8359074592590332, 'objective/scores': 0.9006043672561646, 'policy/approxkl_avg': 0.0027224973309785128, 'policy/clipfrac_avg': 0.024764150381088257, 'loss/policy_avg': -0.004586347844451666, 'loss/value_avg': 0.01727735623717308, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.20918264985084534, 'val/ratio': 1.0051051378250122, 'val/ratio_var': 1.3388546904025134e-05, 'val/num_eos_tokens': 0, 'lr': 4.1646251837334644e-05, 'episode': 1368, 'epoch': 0.17}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [33:02<1:05:39, 131kB/s]
 17%|█▋        | 343/2041 [29:44<2:25:47,  5.15s/it][A

{'eps': 0, 'objective/kl': 33.999351501464844, 'objective/entropy': 10.214542388916016, 'objective/non_score_reward': -1.6999677419662476, 'objective/rlhf_reward': -0.8316011428833008, 'objective/scores': 0.8683665990829468, 'policy/approxkl_avg': 0.0008467635489068925, 'policy/clipfrac_avg': 0.008254717104136944, 'loss/policy_avg': -0.006416721269488335, 'loss/value_avg': 0.015760108828544617, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.2184998095035553, 'val/ratio': 1.0006368160247803, 'val/ratio_var': 5.990991098769882e-07, 'val/num_eos_tokens': 0, 'lr': 4.1621754042136206e-05, 'episode': 1372, 'epoch': 0.17}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [33:07<1:05:39, 131kB/s]
 17%|█▋        | 344/2041 [29:49<2:25:21,  5.14s/it][A

{'eps': 0, 'objective/kl': 35.01873016357422, 'objective/entropy': 9.128864288330078, 'objective/non_score_reward': -1.750936508178711, 'objective/rlhf_reward': -0.8602055311203003, 'objective/scores': 0.8907309770584106, 'policy/approxkl_avg': 0.002029751194640994, 'policy/clipfrac_avg': 0.025943398475646973, 'loss/policy_avg': -0.004309894982725382, 'loss/value_avg': 0.018316371366381645, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.2170732617378235, 'val/ratio': 1.0000510215759277, 'val/ratio_var': 2.878182669974194e-07, 'val/num_eos_tokens': 0, 'lr': 4.1597256246937774e-05, 'episode': 1376, 'epoch': 0.17}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [33:12<1:05:39, 131kB/s]
 17%|█▋        | 345/2041 [29:54<2:24:42,  5.12s/it][A

{'eps': 0, 'objective/kl': 37.563560485839844, 'objective/entropy': 12.699195861816406, 'objective/non_score_reward': -1.8781780004501343, 'objective/rlhf_reward': -0.9205222725868225, 'objective/scores': 0.9576557278633118, 'policy/approxkl_avg': 0.004081615246832371, 'policy/clipfrac_avg': 0.024764150381088257, 'loss/policy_avg': -0.00886157900094986, 'loss/value_avg': 0.022870786488056183, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.24906277656555176, 'val/ratio': 0.9964292645454407, 'val/ratio_var': 9.568568202666938e-06, 'val/num_eos_tokens': 0, 'lr': 4.157275845173935e-05, 'episode': 1380, 'epoch': 0.17}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [33:17<1:05:39, 131kB/s]
 17%|█▋        | 346/2041 [29:59<2:24:38,  5.12s/it][A

{'eps': 0, 'objective/kl': 38.50022506713867, 'objective/entropy': 12.004013061523438, 'objective/non_score_reward': -1.925011396408081, 'objective/rlhf_reward': -1.1110901832580566, 'objective/scores': 0.8139212131500244, 'policy/approxkl_avg': 0.011597341857850552, 'policy/clipfrac_avg': 0.04716981202363968, 'loss/policy_avg': -0.009390593506395817, 'loss/value_avg': 0.017100777477025986, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.21135090291500092, 'val/ratio': 0.9970227479934692, 'val/ratio_var': 1.125388007494621e-05, 'val/num_eos_tokens': 0, 'lr': 4.1548260656540916e-05, 'episode': 1384, 'epoch': 0.17}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [33:23<1:05:39, 131kB/s]
 17%|█▋        | 347/2041 [30:04<2:24:32,  5.12s/it][A

{'eps': 0, 'objective/kl': 44.006202697753906, 'objective/entropy': 10.913402557373047, 'objective/non_score_reward': -2.2003097534179688, 'objective/rlhf_reward': -1.3344085216522217, 'objective/scores': 0.8659012317657471, 'policy/approxkl_avg': 0.043088383972644806, 'policy/clipfrac_avg': 0.028301887214183807, 'loss/policy_avg': -0.009745488874614239, 'loss/value_avg': 0.038790263235569, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.21409918367862701, 'val/ratio': 0.9892623424530029, 'val/ratio_var': 6.038270294084214e-05, 'val/num_eos_tokens': 0, 'lr': 4.152376286134248e-05, 'episode': 1388, 'epoch': 0.17}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [33:28<1:05:39, 131kB/s]
 17%|█▋        | 348/2041 [30:09<2:24:10,  5.11s/it][A

{'eps': 0, 'objective/kl': 46.305419921875, 'objective/entropy': 11.135164260864258, 'objective/non_score_reward': -2.3152713775634766, 'objective/rlhf_reward': -1.4442315101623535, 'objective/scores': 0.8710398077964783, 'policy/approxkl_avg': 0.02796097658574581, 'policy/clipfrac_avg': 0.025943396613001823, 'loss/policy_avg': -0.010314205661416054, 'loss/value_avg': 0.055160705000162125, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.17882917821407318, 'val/ratio': 0.991152286529541, 'val/ratio_var': 4.01651268475689e-05, 'val/num_eos_tokens': 0, 'lr': 4.149926506614405e-05, 'episode': 1392, 'epoch': 0.17}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [33:33<1:05:39, 131kB/s]
 17%|█▋        | 349/2041 [30:14<2:23:53,  5.10s/it][A

{'eps': 0, 'objective/kl': 37.72270202636719, 'objective/entropy': 10.217945098876953, 'objective/non_score_reward': -1.886135220527649, 'objective/rlhf_reward': -1.010709524154663, 'objective/scores': 0.8754257559776306, 'policy/approxkl_avg': 0.0018789897440001369, 'policy/clipfrac_avg': 0.022405661642551422, 'loss/policy_avg': -0.0016184335108846426, 'loss/value_avg': 0.02879836969077587, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.17701902985572815, 'val/ratio': 1.005603313446045, 'val/ratio_var': 2.757507718342822e-05, 'val/num_eos_tokens': 0, 'lr': 4.147476727094562e-05, 'episode': 1396, 'epoch': 0.17}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [33:38<1:05:39, 131kB/s]
 17%|█▋        | 350/2041 [30:19<2:24:04,  5.11s/it][A

{'eps': 0, 'objective/kl': 34.04493713378906, 'objective/entropy': 10.232765197753906, 'objective/non_score_reward': -1.702246904373169, 'objective/rlhf_reward': -0.7645362615585327, 'objective/scores': 0.9377106428146362, 'policy/approxkl_avg': 0.002985060680657625, 'policy/clipfrac_avg': 0.021226415410637856, 'loss/policy_avg': -0.0031547907274216413, 'loss/value_avg': 0.017269060015678406, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.21202152967453003, 'val/ratio': 1.0022640228271484, 'val/ratio_var': 3.11719486489892e-06, 'val/num_eos_tokens': 0, 'lr': 4.145026947574718e-05, 'episode': 1400, 'epoch': 0.17}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [33:43<1:05:39, 131kB/s]
 17%|█▋        | 351/2041 [30:25<2:24:43,  5.14s/it][A

{'eps': 0, 'objective/kl': 36.81946563720703, 'objective/entropy': 7.136253833770752, 'objective/non_score_reward': -1.8409733772277832, 'objective/rlhf_reward': -0.8249540328979492, 'objective/scores': 1.016019344329834, 'policy/approxkl_avg': 0.0031933195423334837, 'policy/clipfrac_avg': 0.02712264284491539, 'loss/policy_avg': -0.003320372197777033, 'loss/value_avg': 0.01395491138100624, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.19775646924972534, 'val/ratio': 1.0076020956039429, 'val/ratio_var': 3.875432230415754e-05, 'val/num_eos_tokens': 0, 'lr': 4.142577168054875e-05, 'episode': 1404, 'epoch': 0.17}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [33:48<1:05:39, 131kB/s]
 17%|█▋        | 352/2041 [30:30<2:23:55,  5.11s/it][A

{'eps': 0, 'objective/kl': 39.8826904296875, 'objective/entropy': 13.202248573303223, 'objective/non_score_reward': -1.9941344261169434, 'objective/rlhf_reward': -0.9002389907836914, 'objective/scores': 1.093895435333252, 'policy/approxkl_avg': 0.001999990548938513, 'policy/clipfrac_avg': 0.02004716917872429, 'loss/policy_avg': -0.003223082982003689, 'loss/value_avg': 0.01867073029279709, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.17748306691646576, 'val/ratio': 0.9970371127128601, 'val/ratio_var': 7.696256034250837e-06, 'val/num_eos_tokens': 0, 'lr': 4.1401273885350325e-05, 'episode': 1408, 'epoch': 0.17}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [33:53<1:05:39, 131kB/s]
 17%|█▋        | 353/2041 [30:35<2:23:35,  5.10s/it][A

{'eps': 0, 'objective/kl': 37.030330657958984, 'objective/entropy': 8.724973678588867, 'objective/non_score_reward': -1.8515164852142334, 'objective/rlhf_reward': -0.7275210618972778, 'objective/scores': 1.1239954233169556, 'policy/approxkl_avg': 0.00047275994438678026, 'policy/clipfrac_avg': 0.00589622650295496, 'loss/policy_avg': -0.0012575375149026513, 'loss/value_avg': 0.011375278234481812, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.2067560851573944, 'val/ratio': 0.9991412162780762, 'val/ratio_var': 4.73731006422895e-07, 'val/num_eos_tokens': 0, 'lr': 4.1376776090151886e-05, 'episode': 1412, 'epoch': 0.17}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [33:58<1:05:39, 131kB/s]
 17%|█▋        | 354/2041 [30:40<2:24:06,  5.13s/it][A

{'eps': 0, 'objective/kl': 43.59558868408203, 'objective/entropy': 13.5299072265625, 'objective/non_score_reward': -2.1797797679901123, 'objective/rlhf_reward': -1.376307487487793, 'objective/scores': 0.8034722805023193, 'policy/approxkl_avg': 0.0012554279528558254, 'policy/clipfrac_avg': 0.021226413547992706, 'loss/policy_avg': -0.007647999096661806, 'loss/value_avg': 0.04449736326932907, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.21822944283485413, 'val/ratio': 1.0056735277175903, 'val/ratio_var': 3.522861879901029e-05, 'val/num_eos_tokens': 0, 'lr': 4.1352278294953454e-05, 'episode': 1416, 'epoch': 0.17}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [34:03<1:05:39, 131kB/s]
 17%|█▋        | 355/2041 [30:45<2:23:25,  5.10s/it][A

{'eps': 0, 'objective/kl': 42.916500091552734, 'objective/entropy': 14.844377517700195, 'objective/non_score_reward': -2.145825147628784, 'objective/rlhf_reward': -1.2577862739562988, 'objective/scores': 0.8880388736724854, 'policy/approxkl_avg': 0.0035050194710493088, 'policy/clipfrac_avg': 0.02712264284491539, 'loss/policy_avg': -0.0059562562964856625, 'loss/value_avg': 0.022816333919763565, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.23375973105430603, 'val/ratio': 0.9987186193466187, 'val/ratio_var': 1.6495238241986954e-06, 'val/num_eos_tokens': 0, 'lr': 4.132778049975503e-05, 'episode': 1420, 'epoch': 0.17}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [34:09<1:05:39, 131kB/s]
 17%|█▋        | 356/2041 [30:50<2:23:32,  5.11s/it][A

{'eps': 0, 'objective/kl': 37.95008087158203, 'objective/entropy': 10.174860000610352, 'objective/non_score_reward': -1.8975040912628174, 'objective/rlhf_reward': -1.0210652351379395, 'objective/scores': 0.8764388561248779, 'policy/approxkl_avg': 0.0009936289861798286, 'policy/clipfrac_avg': 0.009433962404727936, 'loss/policy_avg': -0.0026146303862333298, 'loss/value_avg': 0.0336717888712883, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.2041507512331009, 'val/ratio': 1.0034854412078857, 'val/ratio_var': 6.8908398134226445e-06, 'val/num_eos_tokens': 0, 'lr': 4.130328270455659e-05, 'episode': 1424, 'epoch': 0.17}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [34:14<1:05:39, 131kB/s]
 17%|█▋        | 357/2041 [30:55<2:23:39,  5.12s/it][A

{'eps': 0, 'objective/kl': 35.794898986816406, 'objective/entropy': 11.997647285461426, 'objective/non_score_reward': -1.7897448539733887, 'objective/rlhf_reward': -0.7275797128677368, 'objective/scores': 1.0621651411056519, 'policy/approxkl_avg': 0.002036536578088999, 'policy/clipfrac_avg': 0.016509434208273888, 'loss/policy_avg': -0.005187025293707848, 'loss/value_avg': 0.0130802346393466, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.23045723140239716, 'val/ratio': 1.0080457925796509, 'val/ratio_var': 6.62093298160471e-05, 'val/num_eos_tokens': 0, 'lr': 4.127878490935816e-05, 'episode': 1428, 'epoch': 0.17}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [34:19<1:05:39, 131kB/s]
 18%|█▊        | 358/2041 [31:00<2:23:44,  5.12s/it][A

{'eps': 0, 'objective/kl': 44.42962646484375, 'objective/entropy': 17.515119552612305, 'objective/non_score_reward': -2.2214815616607666, 'objective/rlhf_reward': -1.3694345951080322, 'objective/scores': 0.8520469665527344, 'policy/approxkl_avg': 0.009181280620396137, 'policy/clipfrac_avg': 0.044811323285102844, 'loss/policy_avg': -0.005964962765574455, 'loss/value_avg': 0.05311044305562973, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.2098933756351471, 'val/ratio': 0.9898709058761597, 'val/ratio_var': 7.318196730921045e-05, 'val/num_eos_tokens': 0, 'lr': 4.1254287114159726e-05, 'episode': 1432, 'epoch': 0.18}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [34:24<1:05:39, 131kB/s]
 18%|█▊        | 359/2041 [31:05<2:23:26,  5.12s/it][A

{'eps': 0, 'objective/kl': 36.40520477294922, 'objective/entropy': 10.089447021484375, 'objective/non_score_reward': -1.8202604055404663, 'objective/rlhf_reward': -0.8008871078491211, 'objective/scores': 1.0193732976913452, 'policy/approxkl_avg': 0.005623562261462212, 'policy/clipfrac_avg': 0.04245283082127571, 'loss/policy_avg': -0.005994302220642567, 'loss/value_avg': 0.016060762107372284, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.20564277470111847, 'val/ratio': 1.0127379894256592, 'val/ratio_var': 9.391872299602255e-05, 'val/num_eos_tokens': 0, 'lr': 4.1229789318961294e-05, 'episode': 1436, 'epoch': 0.18}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [34:29<1:05:39, 131kB/s]
 18%|█▊        | 360/2041 [31:11<2:22:50,  5.10s/it][A

{'eps': 0, 'objective/kl': 36.939903259277344, 'objective/entropy': 11.061809539794922, 'objective/non_score_reward': -1.846995234489441, 'objective/rlhf_reward': -0.9517389535903931, 'objective/scores': 0.8952562808990479, 'policy/approxkl_avg': 0.0028419396840035915, 'policy/clipfrac_avg': 0.01886792480945587, 'loss/policy_avg': -0.0013569683069363236, 'loss/value_avg': 0.018810303881764412, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.224389910697937, 'val/ratio': 1.00648033618927, 'val/ratio_var': 2.7855925509356894e-05, 'val/num_eos_tokens': 0, 'lr': 4.120529152376286e-05, 'episode': 1440, 'epoch': 0.18}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [34:34<1:05:39, 131kB/s]
 18%|█▊        | 361/2041 [31:16<2:23:57,  5.14s/it][A

{'eps': 0, 'objective/kl': 36.68287658691406, 'objective/entropy': 8.707748413085938, 'objective/non_score_reward': -1.834143877029419, 'objective/rlhf_reward': -1.0243332386016846, 'objective/scores': 0.8098105788230896, 'policy/approxkl_avg': 0.0036467318423092365, 'policy/clipfrac_avg': 0.028301887214183807, 'loss/policy_avg': -0.002764000091701746, 'loss/value_avg': 0.022965073585510254, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.18135809898376465, 'val/ratio': 1.007875680923462, 'val/ratio_var': 5.299828990246169e-05, 'val/num_eos_tokens': 0, 'lr': 4.118079372856443e-05, 'episode': 1444, 'epoch': 0.18}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [34:39<1:05:39, 131kB/s]
 18%|█▊        | 362/2041 [31:21<2:23:40,  5.13s/it][A

{'eps': 0, 'objective/kl': 54.52867889404297, 'objective/entropy': 18.207744598388672, 'objective/non_score_reward': -2.7264342308044434, 'objective/rlhf_reward': -1.8579449653625488, 'objective/scores': 0.8684892654418945, 'policy/approxkl_avg': 0.011608373373746872, 'policy/clipfrac_avg': 0.0695754736661911, 'loss/policy_avg': -0.015161572955548763, 'loss/value_avg': 0.08079937100410461, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.28104567527770996, 'val/ratio': 1.004807472229004, 'val/ratio_var': 1.6813910406199284e-05, 'val/num_eos_tokens': 0, 'lr': 4.1156295933366e-05, 'episode': 1448, 'epoch': 0.18}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [34:45<1:05:39, 131kB/s]
 18%|█▊        | 363/2041 [31:26<2:24:20,  5.16s/it][A

{'eps': 0, 'objective/kl': 36.6086540222168, 'objective/entropy': 10.792686462402344, 'objective/non_score_reward': -1.830432653427124, 'objective/rlhf_reward': -0.8670792579650879, 'objective/scores': 0.9633533954620361, 'policy/approxkl_avg': 0.0007552983588539064, 'policy/clipfrac_avg': 0.012971698306500912, 'loss/policy_avg': -0.005164033733308315, 'loss/value_avg': 0.010762800462543964, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.2202429622411728, 'val/ratio': 1.0031393766403198, 'val/ratio_var': 8.08158529252978e-06, 'val/num_eos_tokens': 0, 'lr': 4.1131798138167566e-05, 'episode': 1452, 'epoch': 0.18}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [34:50<1:05:39, 131kB/s]
 18%|█▊        | 364/2041 [31:31<2:24:52,  5.18s/it][A

{'eps': 0, 'objective/kl': 36.87436294555664, 'objective/entropy': 11.577598571777344, 'objective/non_score_reward': -1.84371817111969, 'objective/rlhf_reward': -0.9113448858261108, 'objective/scores': 0.9323732852935791, 'policy/approxkl_avg': 0.004802077077329159, 'policy/clipfrac_avg': 0.017688680440187454, 'loss/policy_avg': -0.005437660962343216, 'loss/value_avg': 0.015280451625585556, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.2397223860025406, 'val/ratio': 1.0055265426635742, 'val/ratio_var': 2.3082060579326935e-05, 'val/num_eos_tokens': 0, 'lr': 4.1107300342969134e-05, 'episode': 1456, 'epoch': 0.18}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [34:55<1:05:39, 131kB/s]
 18%|█▊        | 365/2041 [31:36<2:23:59,  5.15s/it][A

{'eps': 0, 'objective/kl': 35.268348693847656, 'objective/entropy': 9.37564754486084, 'objective/non_score_reward': -1.763417363166809, 'objective/rlhf_reward': -0.9207755327224731, 'objective/scores': 0.8426418304443359, 'policy/approxkl_avg': 0.004906320478767157, 'policy/clipfrac_avg': 0.015330188907682896, 'loss/policy_avg': -0.010223491117358208, 'loss/value_avg': 0.018789248540997505, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.26466891169548035, 'val/ratio': 1.0093058347702026, 'val/ratio_var': 7.47149097151123e-05, 'val/num_eos_tokens': 0, 'lr': 4.10828025477707e-05, 'episode': 1460, 'epoch': 0.18}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [35:00<1:05:39, 131kB/s]
 18%|█▊        | 366/2041 [31:42<2:24:11,  5.17s/it][A

{'eps': 0, 'objective/kl': 46.68408966064453, 'objective/entropy': 14.75088882446289, 'objective/non_score_reward': -2.33420467376709, 'objective/rlhf_reward': -1.599892020225525, 'objective/scores': 0.7343126535415649, 'policy/approxkl_avg': 0.014095618389546871, 'policy/clipfrac_avg': 0.041273586452007294, 'loss/policy_avg': -0.01066929567605257, 'loss/value_avg': 0.05445704609155655, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.16318275034427643, 'val/ratio': 0.9952551126480103, 'val/ratio_var': 1.4237521099857986e-05, 'val/num_eos_tokens': 0, 'lr': 4.105830475257227e-05, 'episode': 1464, 'epoch': 0.18}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [35:05<1:05:39, 131kB/s]
 18%|█▊        | 367/2041 [31:47<2:24:47,  5.19s/it][A

{'eps': 0, 'objective/kl': 34.68275451660156, 'objective/entropy': 2.606926441192627, 'objective/non_score_reward': -1.734137773513794, 'objective/rlhf_reward': -1.1642534732818604, 'objective/scores': 0.5698842406272888, 'policy/approxkl_avg': 0.000251724268309772, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': 0.0020353849977254868, 'loss/value_avg': 0.011440249159932137, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.06552087515592575, 'val/ratio': 1.0011705160140991, 'val/ratio_var': 7.691942869314516e-07, 'val/num_eos_tokens': 0, 'lr': 4.103380695737384e-05, 'episode': 1468, 'epoch': 0.18}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [35:10<1:05:39, 131kB/s]
 18%|█▊        | 368/2041 [31:52<2:23:44,  5.15s/it][A

{'eps': 0, 'objective/kl': 34.064964294433594, 'objective/entropy': 1.9154644012451172, 'objective/non_score_reward': -1.703248143196106, 'objective/rlhf_reward': -1.1277978420257568, 'objective/scores': 0.5754503011703491, 'policy/approxkl_avg': 7.027776882750914e-05, 'policy/clipfrac_avg': 0.001179245300590992, 'loss/policy_avg': -0.0009862296283245087, 'loss/value_avg': 0.006178409792482853, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.07100628316402435, 'val/ratio': 0.9997859597206116, 'val/ratio_var': 7.75213564452315e-08, 'val/num_eos_tokens': 0, 'lr': 4.100930916217541e-05, 'episode': 1472, 'epoch': 0.18}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [35:16<1:05:39, 131kB/s]
 18%|█▊        | 369/2041 [31:57<2:23:18,  5.14s/it][A

{'eps': 0, 'objective/kl': 34.759220123291016, 'objective/entropy': 6.5669474601745605, 'objective/non_score_reward': -1.7379611730575562, 'objective/rlhf_reward': -1.1072746515274048, 'objective/scores': 0.6306865215301514, 'policy/approxkl_avg': 0.0013890970731154084, 'policy/clipfrac_avg': 0.012971698306500912, 'loss/policy_avg': -0.0016579488292336464, 'loss/value_avg': 0.012488208711147308, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.11770687997341156, 'val/ratio': 1.003300666809082, 'val/ratio_var': 8.394466021854896e-06, 'val/num_eos_tokens': 0, 'lr': 4.0984811366976975e-05, 'episode': 1476, 'epoch': 0.18}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [35:21<1:05:39, 131kB/s]
 18%|█▊        | 370/2041 [32:02<2:23:06,  5.14s/it][A

{'eps': 0, 'objective/kl': 35.00005340576172, 'objective/entropy': 4.460241317749023, 'objective/non_score_reward': -1.7500027418136597, 'objective/rlhf_reward': -1.118111491203308, 'objective/scores': 0.6318912506103516, 'policy/approxkl_avg': 2.8665477657341398e-05, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.0006248405552469194, 'loss/value_avg': 0.007975861430168152, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.12163137644529343, 'val/ratio': 0.9998370409011841, 'val/ratio_var': 1.6327788898706785e-08, 'val/num_eos_tokens': 0, 'lr': 4.096031357177854e-05, 'episode': 1480, 'epoch': 0.18}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [35:26<1:05:39, 131kB/s]
 18%|█▊        | 371/2041 [32:07<2:22:56,  5.14s/it][A

{'eps': 0, 'objective/kl': 44.65507888793945, 'objective/entropy': 12.138925552368164, 'objective/non_score_reward': -2.2327539920806885, 'objective/rlhf_reward': -1.6479743719100952, 'objective/scores': 0.5847796201705933, 'policy/approxkl_avg': 0.0035195162054151297, 'policy/clipfrac_avg': 0.022405661642551422, 'loss/policy_avg': -0.005647524259984493, 'loss/value_avg': 0.03840932622551918, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.15802547335624695, 'val/ratio': 0.9961788654327393, 'val/ratio_var': 1.443079690943705e-05, 'val/num_eos_tokens': 0, 'lr': 4.093581577658011e-05, 'episode': 1484, 'epoch': 0.18}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [35:31<1:05:39, 131kB/s]
 18%|█▊        | 372/2041 [32:13<2:23:11,  5.15s/it][A

{'eps': 0, 'objective/kl': 33.288543701171875, 'objective/entropy': 2.6928858757019043, 'objective/non_score_reward': -1.6644275188446045, 'objective/rlhf_reward': -1.1910362243652344, 'objective/scores': 0.4733912944793701, 'policy/approxkl_avg': 1.1222919965803158e-05, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.00016704144945833832, 'loss/value_avg': 0.0066859107464551926, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.07532542198896408, 'val/ratio': 0.9999788999557495, 'val/ratio_var': 1.5798585772941465e-09, 'val/num_eos_tokens': 0, 'lr': 4.091131798138167e-05, 'episode': 1488, 'epoch': 0.18}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [35:36<1:05:39, 131kB/s]
 18%|█▊        | 373/2041 [32:18<2:23:33,  5.16s/it][A

{'eps': 0, 'objective/kl': 35.72030258178711, 'objective/entropy': 4.722696304321289, 'objective/non_score_reward': -1.786015272140503, 'objective/rlhf_reward': -1.036839485168457, 'objective/scores': 0.7491758465766907, 'policy/approxkl_avg': 0.0011694136774167418, 'policy/clipfrac_avg': 0.009433962404727936, 'loss/policy_avg': -0.0011060349643230438, 'loss/value_avg': 0.012359987944364548, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.09979203343391418, 'val/ratio': 1.0019948482513428, 'val/ratio_var': 3.699653689182014e-06, 'val/num_eos_tokens': 0, 'lr': 4.088682018618325e-05, 'episode': 1492, 'epoch': 0.18}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [35:41<1:05:39, 131kB/s]
 18%|█▊        | 374/2041 [32:23<2:23:37,  5.17s/it][A

{'eps': 0, 'objective/kl': 37.738487243652344, 'objective/entropy': 7.126598834991455, 'objective/non_score_reward': -1.8869242668151855, 'objective/rlhf_reward': -1.4226771593093872, 'objective/scores': 0.46424710750579834, 'policy/approxkl_avg': 0.00015450846694875509, 'policy/clipfrac_avg': 0.001179245300590992, 'loss/policy_avg': -0.0021134300623089075, 'loss/value_avg': 0.052651047706604004, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.13657790422439575, 'val/ratio': 0.9995120167732239, 'val/ratio_var': 3.2349882417292974e-07, 'val/num_eos_tokens': 0, 'lr': 4.0862322390984815e-05, 'episode': 1496, 'epoch': 0.18}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [35:47<1:05:39, 131kB/s]
 18%|█▊        | 375/2041 [32:28<2:23:46,  5.18s/it][A

{'eps': 0, 'objective/kl': 36.15440368652344, 'objective/entropy': 8.433027267456055, 'objective/non_score_reward': -1.8077203035354614, 'objective/rlhf_reward': -1.2402420043945312, 'objective/scores': 0.5674782991409302, 'policy/approxkl_avg': 0.0021801611874252558, 'policy/clipfrac_avg': 0.014150943607091904, 'loss/policy_avg': -0.003855710383504629, 'loss/value_avg': 0.013328198343515396, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.14225375652313232, 'val/ratio': 1.0042827129364014, 'val/ratio_var': 1.0826360266946722e-05, 'val/num_eos_tokens': 0, 'lr': 4.083782459578638e-05, 'episode': 1500, 'epoch': 0.18}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [35:52<1:05:39, 131kB/s]
 18%|█▊        | 376/2041 [32:33<2:24:05,  5.19s/it][A

{'eps': 0, 'objective/kl': 35.84455871582031, 'objective/entropy': 9.348934173583984, 'objective/non_score_reward': -1.792228102684021, 'objective/rlhf_reward': -1.143198013305664, 'objective/scores': 0.6490301489830017, 'policy/approxkl_avg': 0.0045987507328391075, 'policy/clipfrac_avg': 0.01886792480945587, 'loss/policy_avg': -0.00405243132263422, 'loss/value_avg': 0.014596543274819851, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1558263599872589, 'val/ratio': 1.0118637084960938, 'val/ratio_var': 0.00014441921666730195, 'val/num_eos_tokens': 0, 'lr': 4.081332680058795e-05, 'episode': 1504, 'epoch': 0.18}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [35:57<1:05:39, 131kB/s]
 18%|█▊        | 377/2041 [32:39<2:24:12,  5.20s/it][A

{'eps': 0, 'objective/kl': 37.82237243652344, 'objective/entropy': 12.825616836547852, 'objective/non_score_reward': -1.8911187648773193, 'objective/rlhf_reward': -1.376741647720337, 'objective/scores': 0.5143771171569824, 'policy/approxkl_avg': 0.001075626234523952, 'policy/clipfrac_avg': 0.012971698306500912, 'loss/policy_avg': -0.006816735956817865, 'loss/value_avg': 0.024570971727371216, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.20892417430877686, 'val/ratio': 0.9964978694915771, 'val/ratio_var': 1.1591499060159549e-05, 'val/num_eos_tokens': 0, 'lr': 4.078882900538952e-05, 'episode': 1508, 'epoch': 0.18}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [36:02<1:05:39, 131kB/s]
 19%|█▊        | 378/2041 [32:44<2:23:55,  5.19s/it][A

{'eps': 0, 'objective/kl': 35.43581008911133, 'objective/entropy': 6.5013508796691895, 'objective/non_score_reward': -1.7717903852462769, 'objective/rlhf_reward': -1.0754477977752686, 'objective/scores': 0.6963425874710083, 'policy/approxkl_avg': 0.0007346911588683724, 'policy/clipfrac_avg': 0.00589622650295496, 'loss/policy_avg': -0.0026442252565175295, 'loss/value_avg': 0.018425986170768738, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.19050584733486176, 'val/ratio': 0.9986687898635864, 'val/ratio_var': 4.2448987187526654e-06, 'val/num_eos_tokens': 0, 'lr': 4.076433121019109e-05, 'episode': 1512, 'epoch': 0.19}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [36:07<1:05:39, 131kB/s]
 19%|█▊        | 379/2041 [32:49<2:24:09,  5.20s/it][A

{'eps': 0, 'objective/kl': 51.234169006347656, 'objective/entropy': 9.541499137878418, 'objective/non_score_reward': -2.561708450317383, 'objective/rlhf_reward': -2.3422470092773438, 'objective/scores': 0.21946141123771667, 'policy/approxkl_avg': 0.3146284520626068, 'policy/clipfrac_avg': 0.028301887214183807, 'loss/policy_avg': -0.009454927407205105, 'loss/value_avg': 0.06964471936225891, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.18348059058189392, 'val/ratio': 1.0047515630722046, 'val/ratio_var': 1.172522024717182e-05, 'val/num_eos_tokens': 0, 'lr': 4.073983341499265e-05, 'episode': 1516, 'epoch': 0.19}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [36:13<1:05:39, 131kB/s]
 19%|█▊        | 380/2041 [32:54<2:24:37,  5.22s/it][A

{'eps': 0, 'objective/kl': 32.19513702392578, 'objective/entropy': 6.875992774963379, 'objective/non_score_reward': -1.6097567081451416, 'objective/rlhf_reward': -0.8893642425537109, 'objective/scores': 0.7203924655914307, 'policy/approxkl_avg': 0.0003032534441445023, 'policy/clipfrac_avg': 0.004716981202363968, 'loss/policy_avg': -0.00363240297883749, 'loss/value_avg': 0.008627157658338547, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.14680451154708862, 'val/ratio': 0.9999974370002747, 'val/ratio_var': 3.1089442131815304e-08, 'val/num_eos_tokens': 0, 'lr': 4.071533561979422e-05, 'episode': 1520, 'epoch': 0.19}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [36:18<1:05:39, 131kB/s]
 19%|█▊        | 381/2041 [32:59<2:23:50,  5.20s/it][A

{'eps': 0, 'objective/kl': 35.16972351074219, 'objective/entropy': 5.1504740715026855, 'objective/non_score_reward': -1.758486270904541, 'objective/rlhf_reward': -0.9859008193016052, 'objective/scores': 0.7725854516029358, 'policy/approxkl_avg': 0.0013454672880470753, 'policy/clipfrac_avg': 0.009433962404727936, 'loss/policy_avg': -0.0018317362992092967, 'loss/value_avg': 0.010323273949325085, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.15757828950881958, 'val/ratio': 0.9993083477020264, 'val/ratio_var': 9.032129923980392e-07, 'val/num_eos_tokens': 0, 'lr': 4.069083782459579e-05, 'episode': 1524, 'epoch': 0.19}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [36:23<1:05:39, 131kB/s]
 19%|█▊        | 382/2041 [33:05<2:23:48,  5.20s/it][A

{'eps': 0, 'objective/kl': 43.4124755859375, 'objective/entropy': 10.824097633361816, 'objective/non_score_reward': -2.170623779296875, 'objective/rlhf_reward': -1.4819185733795166, 'objective/scores': 0.6887052655220032, 'policy/approxkl_avg': 0.0020904813427478075, 'policy/clipfrac_avg': 0.016509434208273888, 'loss/policy_avg': -0.005711057223379612, 'loss/value_avg': 0.02966470643877983, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.21647103130817413, 'val/ratio': 0.9970601797103882, 'val/ratio_var': 6.392266641341848e-06, 'val/num_eos_tokens': 0, 'lr': 4.066634002939735e-05, 'episode': 1528, 'epoch': 0.19}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [36:28<1:05:39, 131kB/s]
 19%|█▉        | 383/2041 [33:10<2:23:41,  5.20s/it][A

{'eps': 0, 'objective/kl': 47.62359619140625, 'objective/entropy': 8.98891830444336, 'objective/non_score_reward': -2.3811798095703125, 'objective/rlhf_reward': -1.7163817882537842, 'objective/scores': 0.6647980213165283, 'policy/approxkl_avg': 0.0032008520793169737, 'policy/clipfrac_avg': 0.017688680440187454, 'loss/policy_avg': -0.007631221320480108, 'loss/value_avg': 0.024756543338298798, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.15873347222805023, 'val/ratio': 0.9969103336334229, 'val/ratio_var': 6.48194327368401e-06, 'val/num_eos_tokens': 0, 'lr': 4.064184223419892e-05, 'episode': 1532, 'epoch': 0.19}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [36:33<1:05:39, 131kB/s]
 19%|█▉        | 384/2041 [33:15<2:23:19,  5.19s/it][A

{'eps': 0, 'objective/kl': 39.891109466552734, 'objective/entropy': 6.819033145904541, 'objective/non_score_reward': -1.9945554733276367, 'objective/rlhf_reward': -1.322815179824829, 'objective/scores': 0.6717403531074524, 'policy/approxkl_avg': 0.002923032734543085, 'policy/clipfrac_avg': 0.003537735901772976, 'loss/policy_avg': -0.0037530050612986088, 'loss/value_avg': 0.016927445307374, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.15912345051765442, 'val/ratio': 0.9980963468551636, 'val/ratio_var': 2.8137080789747415e-06, 'val/num_eos_tokens': 0, 'lr': 4.0617344439000495e-05, 'episode': 1536, 'epoch': 0.19}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [36:39<1:05:39, 131kB/s]
 19%|█▉        | 385/2041 [33:20<2:23:35,  5.20s/it][A

{'eps': 0, 'objective/kl': 39.331233978271484, 'objective/entropy': 9.63569164276123, 'objective/non_score_reward': -1.9665617942810059, 'objective/rlhf_reward': -1.1818602085113525, 'objective/scores': 0.7847015261650085, 'policy/approxkl_avg': 0.0038326075300574303, 'policy/clipfrac_avg': 0.007075471803545952, 'loss/policy_avg': -0.003962453920394182, 'loss/value_avg': 0.011261790990829468, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.16848911345005035, 'val/ratio': 0.9993732571601868, 'val/ratio_var': 2.0304368320012145e-07, 'val/num_eos_tokens': 0, 'lr': 4.059284664380206e-05, 'episode': 1540, 'epoch': 0.19}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [36:44<1:05:39, 131kB/s]
 19%|█▉        | 386/2041 [33:25<2:23:06,  5.19s/it][A

{'eps': 0, 'objective/kl': 33.6449089050293, 'objective/entropy': 6.223471164703369, 'objective/non_score_reward': -1.6822454929351807, 'objective/rlhf_reward': -0.9237887859344482, 'objective/scores': 0.7584567070007324, 'policy/approxkl_avg': 0.00026223459281027317, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.0003020527074113488, 'loss/value_avg': 0.012677428312599659, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.12293456494808197, 'val/ratio': 0.9991830587387085, 'val/ratio_var': 3.285798868546408e-07, 'val/num_eos_tokens': 0, 'lr': 4.0568348848603625e-05, 'episode': 1544, 'epoch': 0.19}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [36:49<1:05:39, 131kB/s]
 19%|█▉        | 387/2041 [33:31<2:23:09,  5.19s/it][A

{'eps': 0, 'objective/kl': 34.127037048339844, 'objective/entropy': 11.750219345092773, 'objective/non_score_reward': -1.7063519954681396, 'objective/rlhf_reward': -1.0201148986816406, 'objective/scores': 0.686237096786499, 'policy/approxkl_avg': 0.0013147031422704458, 'policy/clipfrac_avg': 0.02004716917872429, 'loss/policy_avg': -0.002669780980795622, 'loss/value_avg': 0.013974440284073353, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.14405816793441772, 'val/ratio': 1.0034332275390625, 'val/ratio_var': 7.738047315797303e-06, 'val/num_eos_tokens': 0, 'lr': 4.05438510534052e-05, 'episode': 1548, 'epoch': 0.19}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [36:54<1:05:39, 131kB/s]
 19%|█▉        | 388/2041 [33:36<2:23:04,  5.19s/it][A

{'eps': 0, 'objective/kl': 36.42578887939453, 'objective/entropy': 7.597787857055664, 'objective/non_score_reward': -1.8212894201278687, 'objective/rlhf_reward': -1.1372936964035034, 'objective/scores': 0.6839957237243652, 'policy/approxkl_avg': 0.0006142951315268874, 'policy/clipfrac_avg': 0.003537735901772976, 'loss/policy_avg': -0.00292753498069942, 'loss/value_avg': 0.009303102269768715, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1503281593322754, 'val/ratio': 1.001512050628662, 'val/ratio_var': 1.1079062005592277e-06, 'val/num_eos_tokens': 0, 'lr': 4.051935325820677e-05, 'episode': 1552, 'epoch': 0.19}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [36:59<1:05:39, 131kB/s]
 19%|█▉        | 389/2041 [33:41<2:23:14,  5.20s/it][A

{'eps': 0, 'objective/kl': 38.298667907714844, 'objective/entropy': 9.757364273071289, 'objective/non_score_reward': -1.914933443069458, 'objective/rlhf_reward': -1.0333243608474731, 'objective/scores': 0.8816090822219849, 'policy/approxkl_avg': 0.0026610037311911583, 'policy/clipfrac_avg': 0.02358490601181984, 'loss/policy_avg': -0.0035133755300194025, 'loss/value_avg': 0.015336031094193459, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1795131266117096, 'val/ratio': 1.0013864040374756, 'val/ratio_var': 1.578230353516119e-06, 'val/num_eos_tokens': 0, 'lr': 4.049485546300833e-05, 'episode': 1556, 'epoch': 0.19}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [37:05<1:05:39, 131kB/s]
 19%|█▉        | 390/2041 [33:46<2:23:43,  5.22s/it][A

{'eps': 0, 'objective/kl': 35.70006561279297, 'objective/entropy': 10.92800521850586, 'objective/non_score_reward': -1.7850033044815063, 'objective/rlhf_reward': -1.1277731657028198, 'objective/scores': 0.6572301387786865, 'policy/approxkl_avg': 0.00046573596773669124, 'policy/clipfrac_avg': 0.002358490601181984, 'loss/policy_avg': -0.005097347777336836, 'loss/value_avg': 0.01239905133843422, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.21844089031219482, 'val/ratio': 1.0016992092132568, 'val/ratio_var': 1.4825471907897736e-06, 'val/num_eos_tokens': 0, 'lr': 4.04703576678099e-05, 'episode': 1560, 'epoch': 0.19}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [37:10<1:05:39, 131kB/s]
 19%|█▉        | 391/2041 [33:51<2:23:46,  5.23s/it][A

{'eps': 0, 'objective/kl': 34.309181213378906, 'objective/entropy': 11.739703178405762, 'objective/non_score_reward': -1.715458869934082, 'objective/rlhf_reward': -1.3687896728515625, 'objective/scores': 0.3466692566871643, 'policy/approxkl_avg': 0.017475895583629608, 'policy/clipfrac_avg': 0.04599056765437126, 'loss/policy_avg': -0.014881757088005543, 'loss/value_avg': 0.02439921535551548, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.16678591072559357, 'val/ratio': 0.9934511780738831, 'val/ratio_var': 2.2841724785394035e-05, 'val/num_eos_tokens': 0, 'lr': 4.044585987261147e-05, 'episode': 1564, 'epoch': 0.19}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [37:15<1:05:39, 131kB/s]
 19%|█▉        | 392/2041 [33:57<2:23:10,  5.21s/it][A

{'eps': 0, 'objective/kl': 36.603084564208984, 'objective/entropy': 6.194986820220947, 'objective/non_score_reward': -1.8301544189453125, 'objective/rlhf_reward': -1.0906624794006348, 'objective/scores': 0.7394919395446777, 'policy/approxkl_avg': 0.0011265912326052785, 'policy/clipfrac_avg': 0.010613207705318928, 'loss/policy_avg': -0.0018246525432914495, 'loss/value_avg': 0.009865724481642246, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.12193873524665833, 'val/ratio': 1.0001311302185059, 'val/ratio_var': 4.6863501523830564e-08, 'val/num_eos_tokens': 0, 'lr': 4.042136207741303e-05, 'episode': 1568, 'epoch': 0.19}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [37:20<1:05:39, 131kB/s]
 19%|█▉        | 393/2041 [34:02<2:22:54,  5.20s/it][A

{'eps': 0, 'objective/kl': 36.16484069824219, 'objective/entropy': 6.8340654373168945, 'objective/non_score_reward': -1.8082423210144043, 'objective/rlhf_reward': -1.0342090129852295, 'objective/scores': 0.7740333080291748, 'policy/approxkl_avg': 0.000829398981295526, 'policy/clipfrac_avg': 0.008254717104136944, 'loss/policy_avg': -0.0009231645963154733, 'loss/value_avg': 0.013308651745319366, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.12087906897068024, 'val/ratio': 0.9999489784240723, 'val/ratio_var': 1.0409723927296e-07, 'val/num_eos_tokens': 0, 'lr': 4.03968642822146e-05, 'episode': 1572, 'epoch': 0.19}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [37:25<1:05:39, 131kB/s]
 19%|█▉        | 394/2041 [34:07<2:21:37,  5.16s/it][A

{'eps': 0, 'objective/kl': 36.218990325927734, 'objective/entropy': 6.0094218254089355, 'objective/non_score_reward': -1.810949444770813, 'objective/rlhf_reward': -1.2316943407058716, 'objective/scores': 0.5792551040649414, 'policy/approxkl_avg': 0.0037302023265510798, 'policy/clipfrac_avg': 0.016509434208273888, 'loss/policy_avg': -0.0033334637992084026, 'loss/value_avg': 0.0136804124340415, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.11090105026960373, 'val/ratio': 1.0016071796417236, 'val/ratio_var': 1.4591438457500772e-06, 'val/num_eos_tokens': 0, 'lr': 4.037236648701617e-05, 'episode': 1576, 'epoch': 0.19}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [37:30<1:05:39, 131kB/s]
 19%|█▉        | 395/2041 [34:12<2:21:37,  5.16s/it][A

{'eps': 0, 'objective/kl': 37.8260498046875, 'objective/entropy': 4.4877448081970215, 'objective/non_score_reward': -1.8913025856018066, 'objective/rlhf_reward': -1.074903130531311, 'objective/scores': 0.8163994550704956, 'policy/approxkl_avg': 0.0031125943642109632, 'policy/clipfrac_avg': 0.010613207705318928, 'loss/policy_avg': -0.00086513243149966, 'loss/value_avg': 0.003806122113019228, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.09512151032686234, 'val/ratio': 0.9977681040763855, 'val/ratio_var': 6.982443210290512e-06, 'val/num_eos_tokens': 0, 'lr': 4.034786869181774e-05, 'episode': 1580, 'epoch': 0.19}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [37:36<1:05:39, 131kB/s]
 19%|█▉        | 396/2041 [34:17<2:21:08,  5.15s/it][A

{'eps': 0, 'objective/kl': 38.678367614746094, 'objective/entropy': 4.396188259124756, 'objective/non_score_reward': -1.9339184761047363, 'objective/rlhf_reward': -1.1948473453521729, 'objective/scores': 0.7390710711479187, 'policy/approxkl_avg': 0.0015766132855787873, 'policy/clipfrac_avg': 0.009433962404727936, 'loss/policy_avg': -0.00018435379024595022, 'loss/value_avg': 0.006904290523380041, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1113973930478096, 'val/ratio': 1.0036754608154297, 'val/ratio_var': 9.284754924010485e-06, 'val/num_eos_tokens': 0, 'lr': 4.0323370896619305e-05, 'episode': 1584, 'epoch': 0.19}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [37:41<1:05:39, 131kB/s]
 19%|█▉        | 397/2041 [34:22<2:21:03,  5.15s/it][A

{'eps': 0, 'objective/kl': 34.35294723510742, 'objective/entropy': 6.7428879737854, 'objective/non_score_reward': -1.7176474332809448, 'objective/rlhf_reward': -1.0607590675354004, 'objective/scores': 0.6568883657455444, 'policy/approxkl_avg': 0.002351087285205722, 'policy/clipfrac_avg': 0.012971698306500912, 'loss/policy_avg': -0.004981041885912418, 'loss/value_avg': 0.014702031388878822, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1009662076830864, 'val/ratio': 0.9980179667472839, 'val/ratio_var': 3.6040510167367756e-06, 'val/num_eos_tokens': 0, 'lr': 4.029887310142087e-05, 'episode': 1588, 'epoch': 0.19}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [37:46<1:05:39, 131kB/s]
 20%|█▉        | 398/2041 [34:27<2:20:39,  5.14s/it][A

{'eps': 0, 'objective/kl': 37.56462860107422, 'objective/entropy': 4.128228187561035, 'objective/non_score_reward': -1.878231406211853, 'objective/rlhf_reward': -1.1384263038635254, 'objective/scores': 0.7398051023483276, 'policy/approxkl_avg': 0.00018960512534249574, 'policy/clipfrac_avg': 0.001179245300590992, 'loss/policy_avg': -0.0015285953413695097, 'loss/value_avg': 0.005962371360510588, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.09471847116947174, 'val/ratio': 0.9981602430343628, 'val/ratio_var': 2.6685836473916424e-06, 'val/num_eos_tokens': 0, 'lr': 4.027437530622244e-05, 'episode': 1592, 'epoch': 0.2}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [37:51<1:05:39, 131kB/s]
 20%|█▉        | 399/2041 [34:33<2:21:15,  5.16s/it][A

{'eps': 0, 'objective/kl': 35.25609588623047, 'objective/entropy': 3.6006202697753906, 'objective/non_score_reward': -1.7628049850463867, 'objective/rlhf_reward': -1.1638647317886353, 'objective/scores': 0.5989402532577515, 'policy/approxkl_avg': 0.0005165794864296913, 'policy/clipfrac_avg': 0.007075471803545952, 'loss/policy_avg': -0.001151682110503316, 'loss/value_avg': 0.005492861848324537, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1248289942741394, 'val/ratio': 0.9980698823928833, 'val/ratio_var': 3.2497998745384393e-06, 'val/num_eos_tokens': 0, 'lr': 4.024987751102401e-05, 'episode': 1596, 'epoch': 0.2}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [37:56<1:05:39, 131kB/s]
 20%|█▉        | 400/2041 [34:38<2:20:50,  5.15s/it][A

{'eps': 0, 'objective/kl': 37.12438201904297, 'objective/entropy': 8.887928009033203, 'objective/non_score_reward': -1.8562194108963013, 'objective/rlhf_reward': -1.0792672634124756, 'objective/scores': 0.7769522070884705, 'policy/approxkl_avg': 0.004325297195464373, 'policy/clipfrac_avg': 0.024764150381088257, 'loss/policy_avg': -0.004242470487952232, 'loss/value_avg': 0.014169822447001934, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1451987773180008, 'val/ratio': 1.0048085451126099, 'val/ratio_var': 1.8254942915518768e-05, 'val/num_eos_tokens': 0, 'lr': 4.022537971582558e-05, 'episode': 1600, 'epoch': 0.2}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [38:01<1:05:39, 131kB/s]
 20%|█▉        | 401/2041 [34:43<2:20:32,  5.14s/it][A

{'eps': 0, 'objective/kl': 35.1373176574707, 'objective/entropy': 6.262405872344971, 'objective/non_score_reward': -1.7568659782409668, 'objective/rlhf_reward': -0.9382970333099365, 'objective/scores': 0.8185689449310303, 'policy/approxkl_avg': 0.0015198015607893467, 'policy/clipfrac_avg': 0.016509434208273888, 'loss/policy_avg': -0.0013865691144019365, 'loss/value_avg': 0.008290303871035576, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1292712390422821, 'val/ratio': 1.0027192831039429, 'val/ratio_var': 4.808553512702929e-06, 'val/num_eos_tokens': 0, 'lr': 4.0200881920627145e-05, 'episode': 1604, 'epoch': 0.2}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [38:06<1:05:39, 131kB/s]
 20%|█▉        | 402/2041 [34:48<2:20:18,  5.14s/it][A

{'eps': 0, 'objective/kl': 37.89111328125, 'objective/entropy': 7.729986190795898, 'objective/non_score_reward': -1.894555687904358, 'objective/rlhf_reward': -1.1159601211547852, 'objective/scores': 0.7785956263542175, 'policy/approxkl_avg': 0.0014697478618472815, 'policy/clipfrac_avg': 0.014150943607091904, 'loss/policy_avg': -0.0050992961041629314, 'loss/value_avg': 0.010085668414831161, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.15137866139411926, 'val/ratio': 0.9983783960342407, 'val/ratio_var': 1.453010440854996e-06, 'val/num_eos_tokens': 0, 'lr': 4.017638412542871e-05, 'episode': 1608, 'epoch': 0.2}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [38:12<1:05:39, 131kB/s]
 20%|█▉        | 403/2041 [34:53<2:19:53,  5.12s/it][A

{'eps': 0, 'objective/kl': 35.64198303222656, 'objective/entropy': 5.934588432312012, 'objective/non_score_reward': -1.7820992469787598, 'objective/rlhf_reward': -0.9456971287727356, 'objective/scores': 0.8364021182060242, 'policy/approxkl_avg': 0.0027024494484066963, 'policy/clipfrac_avg': 0.007075471803545952, 'loss/policy_avg': -0.0021617563907057047, 'loss/value_avg': 0.012137416750192642, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.16444918513298035, 'val/ratio': 0.9950700402259827, 'val/ratio_var': 1.4608357560064178e-05, 'val/num_eos_tokens': 0, 'lr': 4.015188633023028e-05, 'episode': 1612, 'epoch': 0.2}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [38:17<1:05:39, 131kB/s]
 20%|█▉        | 404/2041 [34:58<2:19:28,  5.11s/it][A

{'eps': 0, 'objective/kl': 34.5184326171875, 'objective/entropy': 10.511092185974121, 'objective/non_score_reward': -1.725921630859375, 'objective/rlhf_reward': -0.8441765904426575, 'objective/scores': 0.8817450404167175, 'policy/approxkl_avg': 0.0009836506797000766, 'policy/clipfrac_avg': 0.009433962404727936, 'loss/policy_avg': -0.002433664398267865, 'loss/value_avg': 0.0130692757666111, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.16942554712295532, 'val/ratio': 1.003097414970398, 'val/ratio_var': 5.80434470975888e-06, 'val/num_eos_tokens': 0, 'lr': 4.012738853503185e-05, 'episode': 1616, 'epoch': 0.2}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [38:22<1:05:39, 131kB/s]
 20%|█▉        | 405/2041 [35:03<2:19:53,  5.13s/it][A

{'eps': 0, 'objective/kl': 39.827857971191406, 'objective/entropy': 9.807905197143555, 'objective/non_score_reward': -1.991392970085144, 'objective/rlhf_reward': -1.146580696105957, 'objective/scores': 0.844812273979187, 'policy/approxkl_avg': 0.003026404185220599, 'policy/clipfrac_avg': 0.02358490601181984, 'loss/policy_avg': -0.005493246950209141, 'loss/value_avg': 0.011143945157527924, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1867438554763794, 'val/ratio': 0.999424159526825, 'val/ratio_var': 1.6370192668091477e-07, 'val/num_eos_tokens': 0, 'lr': 4.010289073983342e-05, 'episode': 1620, 'epoch': 0.2}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [38:27<1:05:39, 131kB/s]
 20%|█▉        | 406/2041 [35:08<2:19:56,  5.14s/it][A

{'eps': 0, 'objective/kl': 37.22791290283203, 'objective/entropy': 12.76474380493164, 'objective/non_score_reward': -1.8613955974578857, 'objective/rlhf_reward': -1.0821303129196167, 'objective/scores': 0.779265284538269, 'policy/approxkl_avg': 0.009026825428009033, 'policy/clipfrac_avg': 0.044811319559812546, 'loss/policy_avg': -0.007664293050765991, 'loss/value_avg': 0.018457679077982903, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.16859084367752075, 'val/ratio': 1.0028793811798096, 'val/ratio_var': 4.076740879099816e-06, 'val/num_eos_tokens': 0, 'lr': 4.0078392944634986e-05, 'episode': 1624, 'epoch': 0.2}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [38:32<1:05:39, 131kB/s]
 20%|█▉        | 407/2041 [35:14<2:20:04,  5.14s/it][A

{'eps': 0, 'objective/kl': 35.34416961669922, 'objective/entropy': 7.126609802246094, 'objective/non_score_reward': -1.7672085762023926, 'objective/rlhf_reward': -0.9955863952636719, 'objective/scores': 0.7716221809387207, 'policy/approxkl_avg': 0.0005033689667470753, 'policy/clipfrac_avg': 0.002358490601181984, 'loss/policy_avg': -0.0008077880484052002, 'loss/value_avg': 0.012614395469427109, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1768607199192047, 'val/ratio': 1.0014564990997314, 'val/ratio_var': 1.1588053894229233e-06, 'val/num_eos_tokens': 0, 'lr': 4.0053895149436554e-05, 'episode': 1628, 'epoch': 0.2}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [38:37<1:05:39, 131kB/s]
 20%|█▉        | 408/2041 [35:19<2:20:14,  5.15s/it][A

{'eps': 0, 'objective/kl': 38.157711029052734, 'objective/entropy': 7.512561798095703, 'objective/non_score_reward': -1.9078855514526367, 'objective/rlhf_reward': -1.0536468029022217, 'objective/scores': 0.854238748550415, 'policy/approxkl_avg': 0.0006637030746787786, 'policy/clipfrac_avg': 0.004716981202363968, 'loss/policy_avg': -0.0014661967288702726, 'loss/value_avg': 0.007802359294146299, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1268041878938675, 'val/ratio': 0.9986399412155151, 'val/ratio_var': 1.6460202232337906e-06, 'val/num_eos_tokens': 0, 'lr': 4.002939735423812e-05, 'episode': 1632, 'epoch': 0.2}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [38:43<1:05:39, 131kB/s]
 20%|██        | 409/2041 [35:24<2:21:05,  5.19s/it][A

{'eps': 0, 'objective/kl': 37.914039611816406, 'objective/entropy': 5.195699691772461, 'objective/non_score_reward': -1.8957021236419678, 'objective/rlhf_reward': -1.1180415153503418, 'objective/scores': 0.777660608291626, 'policy/approxkl_avg': 0.0015055446419864893, 'policy/clipfrac_avg': 0.00589622650295496, 'loss/policy_avg': -0.004870166536420584, 'loss/value_avg': 0.016772687435150146, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.11871539801359177, 'val/ratio': 0.998270571231842, 'val/ratio_var': 2.6461548259248957e-06, 'val/num_eos_tokens': 0, 'lr': 4.000489955903969e-05, 'episode': 1636, 'epoch': 0.2}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [38:51<1:05:39, 131kB/s]
 20%|██        | 410/2041 [35:32<2:46:18,  6.12s/it][A

{'eps': 0, 'objective/kl': 32.22378158569336, 'objective/entropy': 8.678116798400879, 'objective/non_score_reward': -1.6111888885498047, 'objective/rlhf_reward': -0.7836877107620239, 'objective/scores': 0.8275011777877808, 'policy/approxkl_avg': 0.0005981263238936663, 'policy/clipfrac_avg': 0.009433962404727936, 'loss/policy_avg': -0.0022490080446004868, 'loss/value_avg': 0.010655473917722702, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.10546710342168808, 'val/ratio': 0.9988018870353699, 'val/ratio_var': 8.39534550323151e-07, 'val/num_eos_tokens': 0, 'lr': 3.998040176384126e-05, 'episode': 1640, 'epoch': 0.2}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [38:56<1:05:39, 131kB/s]
 20%|██        | 411/2041 [35:38<2:38:08,  5.82s/it][A

{'eps': 0, 'objective/kl': 35.03813552856445, 'objective/entropy': 5.545526504516602, 'objective/non_score_reward': -1.7519068717956543, 'objective/rlhf_reward': -0.8656061887741089, 'objective/scores': 0.8863006830215454, 'policy/approxkl_avg': 0.0007131327292881906, 'policy/clipfrac_avg': 0.004716981202363968, 'loss/policy_avg': -0.0017878490034490824, 'loss/value_avg': 0.006984882056713104, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.14424045383930206, 'val/ratio': 1.0026679039001465, 'val/ratio_var': 8.90595492819557e-06, 'val/num_eos_tokens': 0, 'lr': 3.995590396864282e-05, 'episode': 1644, 'epoch': 0.2}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [39:01<1:05:39, 131kB/s]
 20%|██        | 412/2041 [35:43<2:32:32,  5.62s/it][A

{'eps': 0, 'objective/kl': 33.19884490966797, 'objective/entropy': 7.449207305908203, 'objective/non_score_reward': -1.6599421501159668, 'objective/rlhf_reward': -0.9082176089286804, 'objective/scores': 0.7517245411872864, 'policy/approxkl_avg': 0.0015267590060830116, 'policy/clipfrac_avg': 0.007075471803545952, 'loss/policy_avg': -0.002642907202243805, 'loss/value_avg': 0.011483084410429, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1522657871246338, 'val/ratio': 0.9942383766174316, 'val/ratio_var': 3.0508794225170277e-05, 'val/num_eos_tokens': 0, 'lr': 3.9931406173444394e-05, 'episode': 1648, 'epoch': 0.2}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [39:06<1:05:39, 131kB/s]
 20%|██        | 413/2041 [35:48<2:28:06,  5.46s/it][A

{'eps': 0, 'objective/kl': 37.07850646972656, 'objective/entropy': 9.225316047668457, 'objective/non_score_reward': -1.8539254665374756, 'objective/rlhf_reward': -0.86537104845047, 'objective/scores': 0.9885544180870056, 'policy/approxkl_avg': 0.0005913387285545468, 'policy/clipfrac_avg': 0.009433962404727936, 'loss/policy_avg': -0.002116357907652855, 'loss/value_avg': 0.010423867031931877, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.14394143223762512, 'val/ratio': 1.00143563747406, 'val/ratio_var': 2.781998773571104e-06, 'val/num_eos_tokens': 0, 'lr': 3.990690837824596e-05, 'episode': 1652, 'epoch': 0.2}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [39:11<1:05:39, 131kB/s]
 20%|██        | 414/2041 [35:53<2:25:43,  5.37s/it][A

{'eps': 0, 'objective/kl': 34.872154235839844, 'objective/entropy': 9.717339515686035, 'objective/non_score_reward': -1.743607997894287, 'objective/rlhf_reward': -0.9436205625534058, 'objective/scores': 0.7999874353408813, 'policy/approxkl_avg': 0.0032492049504071474, 'policy/clipfrac_avg': 0.02004716917872429, 'loss/policy_avg': -0.0029499486554414034, 'loss/value_avg': 0.010242652148008347, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.12123227119445801, 'val/ratio': 0.9977610111236572, 'val/ratio_var': 3.551439476723317e-06, 'val/num_eos_tokens': 0, 'lr': 3.988241058304753e-05, 'episode': 1656, 'epoch': 0.2}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [39:17<1:05:39, 131kB/s]
 20%|██        | 415/2041 [35:58<2:24:12,  5.32s/it][A

{'eps': 0, 'objective/kl': 34.80602264404297, 'objective/entropy': 3.8368520736694336, 'objective/non_score_reward': -1.7403013706207275, 'objective/rlhf_reward': -0.8459784388542175, 'objective/scores': 0.89432293176651, 'policy/approxkl_avg': 0.0007626812439411879, 'policy/clipfrac_avg': 0.002358490601181984, 'loss/policy_avg': -0.0017110772896558046, 'loss/value_avg': 0.009586158208549023, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.13734565675258636, 'val/ratio': 0.9986390471458435, 'val/ratio_var': 4.1938988033507485e-06, 'val/num_eos_tokens': 0, 'lr': 3.985791278784909e-05, 'episode': 1660, 'epoch': 0.2}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [39:22<1:05:39, 131kB/s]
 20%|██        | 416/2041 [36:03<2:21:13,  5.21s/it][A

{'eps': 0, 'objective/kl': 35.52989959716797, 'objective/entropy': 6.873213768005371, 'objective/non_score_reward': -1.7764952182769775, 'objective/rlhf_reward': -0.7942895889282227, 'objective/scores': 0.9822056293487549, 'policy/approxkl_avg': 0.0018552071414887905, 'policy/clipfrac_avg': 0.01179245300590992, 'loss/policy_avg': -0.0014806012623012066, 'loss/value_avg': 0.010996920987963676, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.14184653759002686, 'val/ratio': 1.0030057430267334, 'val/ratio_var': 4.604706646205159e-06, 'val/num_eos_tokens': 0, 'lr': 3.9833414992650666e-05, 'episode': 1664, 'epoch': 0.2}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [39:27<1:05:39, 131kB/s]
 20%|██        | 417/2041 [36:08<2:20:06,  5.18s/it][A

{'eps': 0, 'objective/kl': 35.672706604003906, 'objective/entropy': 6.012468338012695, 'objective/non_score_reward': -1.7836353778839111, 'objective/rlhf_reward': -0.759225606918335, 'objective/scores': 1.0244097709655762, 'policy/approxkl_avg': 0.003218767000362277, 'policy/clipfrac_avg': 0.01886792480945587, 'loss/policy_avg': -0.0018411485943943262, 'loss/value_avg': 0.009866220876574516, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.17608922719955444, 'val/ratio': 0.9969295859336853, 'val/ratio_var': 1.2033985512971412e-05, 'val/num_eos_tokens': 0, 'lr': 3.9808917197452234e-05, 'episode': 1668, 'epoch': 0.2}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [39:32<1:05:39, 131kB/s]
 20%|██        | 418/2041 [36:13<2:19:08,  5.14s/it][A

{'eps': 0, 'objective/kl': 38.6773567199707, 'objective/entropy': 12.241933822631836, 'objective/non_score_reward': -1.9338679313659668, 'objective/rlhf_reward': -0.9274967908859253, 'objective/scores': 1.0063711404800415, 'policy/approxkl_avg': 0.00861174426972866, 'policy/clipfrac_avg': 0.022405661642551422, 'loss/policy_avg': -0.009075761772692204, 'loss/value_avg': 0.018625780940055847, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1589270532131195, 'val/ratio': 1.0164598226547241, 'val/ratio_var': 0.00024457808467559516, 'val/num_eos_tokens': 0, 'lr': 3.9784419402253795e-05, 'episode': 1672, 'epoch': 0.2}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [39:37<1:05:39, 131kB/s]
 21%|██        | 419/2041 [36:18<2:18:09,  5.11s/it][A

{'eps': 0, 'objective/kl': 36.94792938232422, 'objective/entropy': 9.201622009277344, 'objective/non_score_reward': -1.8473963737487793, 'objective/rlhf_reward': -0.9621944427490234, 'objective/scores': 0.8852019309997559, 'policy/approxkl_avg': 0.0016414046986028552, 'policy/clipfrac_avg': 0.009433962404727936, 'loss/policy_avg': -0.0022268323227763176, 'loss/value_avg': 0.012777344323694706, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.14870011806488037, 'val/ratio': 1.0027239322662354, 'val/ratio_var': 1.0613897757139057e-05, 'val/num_eos_tokens': 0, 'lr': 3.975992160705537e-05, 'episode': 1676, 'epoch': 0.21}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [39:42<1:05:39, 131kB/s]
 21%|██        | 420/2041 [36:23<2:18:07,  5.11s/it][A

{'eps': 0, 'objective/kl': 35.97958755493164, 'objective/entropy': 8.835750579833984, 'objective/non_score_reward': -1.7989792823791504, 'objective/rlhf_reward': -0.738395094871521, 'objective/scores': 1.0605841875076294, 'policy/approxkl_avg': 0.001444133697077632, 'policy/clipfrac_avg': 0.014150943607091904, 'loss/policy_avg': -0.001988102914765477, 'loss/value_avg': 0.009997506625950336, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1488877683877945, 'val/ratio': 1.001142978668213, 'val/ratio_var': 1.5323445268222713e-06, 'val/num_eos_tokens': 0, 'lr': 3.973542381185694e-05, 'episode': 1680, 'epoch': 0.21}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [39:47<1:05:39, 131kB/s]
 21%|██        | 421/2041 [36:29<2:18:07,  5.12s/it][A

{'eps': 0, 'objective/kl': 35.324859619140625, 'objective/entropy': 7.751314640045166, 'objective/non_score_reward': -1.7662431001663208, 'objective/rlhf_reward': -0.6535898447036743, 'objective/scores': 1.1126532554626465, 'policy/approxkl_avg': 0.003760876599699259, 'policy/clipfrac_avg': 0.028301887214183807, 'loss/policy_avg': -0.0010318857384845614, 'loss/value_avg': 0.01343490555882454, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1455453336238861, 'val/ratio': 1.0090420246124268, 'val/ratio_var': 4.3032709072576836e-05, 'val/num_eos_tokens': 0, 'lr': 3.97109260166585e-05, 'episode': 1684, 'epoch': 0.21}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [39:52<1:05:39, 131kB/s]
 21%|██        | 422/2041 [36:34<2:17:44,  5.10s/it][A

{'eps': 0, 'objective/kl': 35.913787841796875, 'objective/entropy': 6.246572971343994, 'objective/non_score_reward': -1.795689344406128, 'objective/rlhf_reward': -0.7133661508560181, 'objective/scores': 1.0823231935501099, 'policy/approxkl_avg': 0.0006738752126693726, 'policy/clipfrac_avg': 0.009433962404727936, 'loss/policy_avg': -0.0007846332737244666, 'loss/value_avg': 0.009289909154176712, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.15846030414104462, 'val/ratio': 0.9992470741271973, 'val/ratio_var': 3.3889708106471517e-07, 'val/num_eos_tokens': 0, 'lr': 3.968642822146007e-05, 'episode': 1688, 'epoch': 0.21}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [39:57<1:05:39, 131kB/s]
 21%|██        | 423/2041 [36:39<2:18:26,  5.13s/it][A

{'eps': 0, 'objective/kl': 35.45128631591797, 'objective/entropy': 7.346288204193115, 'objective/non_score_reward': -1.77256441116333, 'objective/rlhf_reward': -0.6180617809295654, 'objective/scores': 1.1545026302337646, 'policy/approxkl_avg': 0.000918059900868684, 'policy/clipfrac_avg': 0.007075471803545952, 'loss/policy_avg': -0.0008400431252084672, 'loss/value_avg': 0.011561032384634018, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.12137439846992493, 'val/ratio': 1.0009132623672485, 'val/ratio_var': 1.4889475323798251e-06, 'val/num_eos_tokens': 0, 'lr': 3.966193042626164e-05, 'episode': 1692, 'epoch': 0.21}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [40:02<1:05:39, 131kB/s]
 21%|██        | 424/2041 [36:44<2:18:31,  5.14s/it][A

{'eps': 0, 'objective/kl': 36.22053909301758, 'objective/entropy': 6.0454559326171875, 'objective/non_score_reward': -1.8110270500183105, 'objective/rlhf_reward': -0.7132599353790283, 'objective/scores': 1.0977671146392822, 'policy/approxkl_avg': 0.00031873679836280644, 'policy/clipfrac_avg': 0.001179245300590992, 'loss/policy_avg': -0.0019601359963417053, 'loss/value_avg': 0.008466492407023907, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1624387502670288, 'val/ratio': 1.000138759613037, 'val/ratio_var': 1.1828892532150803e-08, 'val/num_eos_tokens': 0, 'lr': 3.9637432631063204e-05, 'episode': 1696, 'epoch': 0.21}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [40:08<1:05:39, 131kB/s]
 21%|██        | 425/2041 [36:49<2:18:35,  5.15s/it][A

{'eps': 0, 'objective/kl': 36.96540069580078, 'objective/entropy': 9.390724182128906, 'objective/non_score_reward': -1.848270058631897, 'objective/rlhf_reward': -0.7664400339126587, 'objective/scores': 1.0818300247192383, 'policy/approxkl_avg': 0.003414356615394354, 'policy/clipfrac_avg': 0.025943394750356674, 'loss/policy_avg': -0.0030042168218642473, 'loss/value_avg': 0.010970225557684898, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1291191577911377, 'val/ratio': 1.0004355907440186, 'val/ratio_var': 5.086833994027984e-07, 'val/num_eos_tokens': 0, 'lr': 3.961293483586477e-05, 'episode': 1700, 'epoch': 0.21}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [40:13<1:05:39, 131kB/s]
 21%|██        | 426/2041 [36:54<2:18:29,  5.15s/it][A

{'eps': 0, 'objective/kl': 35.85527038574219, 'objective/entropy': 8.59716796875, 'objective/non_score_reward': -1.7927634716033936, 'objective/rlhf_reward': -0.7738132476806641, 'objective/scores': 1.0189502239227295, 'policy/approxkl_avg': 0.004642731510102749, 'policy/clipfrac_avg': 0.022405661642551422, 'loss/policy_avg': -0.0033908102195709944, 'loss/value_avg': 0.012367824092507362, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1252320408821106, 'val/ratio': 1.0070340633392334, 'val/ratio_var': 2.8411521270754747e-05, 'val/num_eos_tokens': 0, 'lr': 3.958843704066634e-05, 'episode': 1704, 'epoch': 0.21}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [40:18<1:05:39, 131kB/s]
 21%|██        | 427/2041 [36:59<2:18:18,  5.14s/it][A

{'eps': 0, 'objective/kl': 36.785675048828125, 'objective/entropy': 7.250953197479248, 'objective/non_score_reward': -1.8392839431762695, 'objective/rlhf_reward': -0.8028470277786255, 'objective/scores': 1.036436915397644, 'policy/approxkl_avg': 0.0012645203387364745, 'policy/clipfrac_avg': 0.009433962404727936, 'loss/policy_avg': -0.0035607218742370605, 'loss/value_avg': 0.008493531495332718, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.13139760494232178, 'val/ratio': 1.0033750534057617, 'val/ratio_var': 1.3427143130684271e-05, 'val/num_eos_tokens': 0, 'lr': 3.9563939245467914e-05, 'episode': 1708, 'epoch': 0.21}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [40:23<1:05:39, 131kB/s]
 21%|██        | 428/2041 [37:05<2:18:38,  5.16s/it][A

{'eps': 0, 'objective/kl': 36.36825942993164, 'objective/entropy': 5.647487640380859, 'objective/non_score_reward': -1.8184130191802979, 'objective/rlhf_reward': -0.7556370496749878, 'objective/scores': 1.06277596950531, 'policy/approxkl_avg': 0.0019316233228892088, 'policy/clipfrac_avg': 0.010613207705318928, 'loss/policy_avg': -0.0040257154032588005, 'loss/value_avg': 0.00933288037776947, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.15643635392189026, 'val/ratio': 0.9964209198951721, 'val/ratio_var': 1.4187166925694328e-05, 'val/num_eos_tokens': 0, 'lr': 3.9539441450269476e-05, 'episode': 1712, 'epoch': 0.21}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [40:28<1:05:39, 131kB/s]
 21%|██        | 429/2041 [37:10<2:18:17,  5.15s/it][A

{'eps': 0, 'objective/kl': 37.00859832763672, 'objective/entropy': 8.918764114379883, 'objective/non_score_reward': -1.8504300117492676, 'objective/rlhf_reward': -0.8107852935791016, 'objective/scores': 1.039644718170166, 'policy/approxkl_avg': 0.005580013617873192, 'policy/clipfrac_avg': 0.028301887214183807, 'loss/policy_avg': -0.00150823756121099, 'loss/value_avg': 0.0161565113812685, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1560496985912323, 'val/ratio': 1.0007333755493164, 'val/ratio_var': 2.790725091017521e-07, 'val/num_eos_tokens': 0, 'lr': 3.9514943655071044e-05, 'episode': 1716, 'epoch': 0.21}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [40:33<1:05:39, 131kB/s]
 21%|██        | 430/2041 [37:15<2:17:18,  5.11s/it][A

{'eps': 0, 'objective/kl': 36.72679138183594, 'objective/entropy': 10.18045425415039, 'objective/non_score_reward': -1.8363394737243652, 'objective/rlhf_reward': -0.65619957447052, 'objective/scores': 1.1801398992538452, 'policy/approxkl_avg': 0.003540180390700698, 'policy/clipfrac_avg': 0.022405661642551422, 'loss/policy_avg': -0.005754268728196621, 'loss/value_avg': 0.009156323969364166, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1446051299571991, 'val/ratio': 1.002992868423462, 'val/ratio_var': 8.13587485026801e-06, 'val/num_eos_tokens': 0, 'lr': 3.949044585987262e-05, 'episode': 1720, 'epoch': 0.21}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [40:38<1:05:39, 131kB/s]
 21%|██        | 431/2041 [37:20<2:16:52,  5.10s/it][A

{'eps': 0, 'objective/kl': 36.3935546875, 'objective/entropy': 7.647392749786377, 'objective/non_score_reward': -1.8196778297424316, 'objective/rlhf_reward': -0.8767080307006836, 'objective/scores': 0.942969799041748, 'policy/approxkl_avg': 0.000989320920780301, 'policy/clipfrac_avg': 0.004716981202363968, 'loss/policy_avg': -0.001172565040178597, 'loss/value_avg': 0.015407516621053219, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.13257768750190735, 'val/ratio': 1.0003089904785156, 'val/ratio_var': 9.190896577138119e-08, 'val/num_eos_tokens': 0, 'lr': 3.946594806467418e-05, 'episode': 1724, 'epoch': 0.21}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [40:43<1:05:39, 131kB/s]
 21%|██        | 432/2041 [37:25<2:17:20,  5.12s/it][A

{'eps': 0, 'objective/kl': 36.78181457519531, 'objective/entropy': 5.00874662399292, 'objective/non_score_reward': -1.8390908241271973, 'objective/rlhf_reward': -0.7768378257751465, 'objective/scores': 1.0622529983520508, 'policy/approxkl_avg': 0.0014529203763231635, 'policy/clipfrac_avg': 0.010613207705318928, 'loss/policy_avg': -0.0031392578966915607, 'loss/value_avg': 0.009898686781525612, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.09651586413383484, 'val/ratio': 0.9963797330856323, 'val/ratio_var': 1.2443510058801621e-05, 'val/num_eos_tokens': 0, 'lr': 3.944145026947575e-05, 'episode': 1728, 'epoch': 0.21}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [40:49<1:05:39, 131kB/s]
 21%|██        | 433/2041 [37:30<2:17:18,  5.12s/it][A

{'eps': 0, 'objective/kl': 35.10672378540039, 'objective/entropy': 4.9860520362854, 'objective/non_score_reward': -1.755336046218872, 'objective/rlhf_reward': -0.5856976509094238, 'objective/scores': 1.1696383953094482, 'policy/approxkl_avg': 0.0013807108625769615, 'policy/clipfrac_avg': 0.014150943607091904, 'loss/policy_avg': -0.0006951582618057728, 'loss/value_avg': 0.007396368309855461, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.11134202033281326, 'val/ratio': 1.0006612539291382, 'val/ratio_var': 4.5257152692101954e-07, 'val/num_eos_tokens': 0, 'lr': 3.9416952474277316e-05, 'episode': 1732, 'epoch': 0.21}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [40:54<1:05:39, 131kB/s]
 21%|██▏       | 434/2041 [37:35<2:18:02,  5.15s/it][A

{'eps': 0, 'objective/kl': 35.95698547363281, 'objective/entropy': 5.181053161621094, 'objective/non_score_reward': -1.797849416732788, 'objective/rlhf_reward': -0.6429586410522461, 'objective/scores': 1.154890775680542, 'policy/approxkl_avg': 9.08914880710654e-05, 'policy/clipfrac_avg': 0.001179245300590992, 'loss/policy_avg': -0.0009881730657070875, 'loss/value_avg': 0.010373424738645554, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.10364130884408951, 'val/ratio': 0.9998934268951416, 'val/ratio_var': 3.6985824181101634e-08, 'val/num_eos_tokens': 0, 'lr': 3.9392454679078884e-05, 'episode': 1736, 'epoch': 0.21}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [40:59<1:05:39, 131kB/s]
 21%|██▏       | 435/2041 [37:41<2:18:10,  5.16s/it][A

{'eps': 0, 'objective/kl': 34.65716552734375, 'objective/entropy': 5.040310859680176, 'objective/non_score_reward': -1.7328581809997559, 'objective/rlhf_reward': -0.6818900108337402, 'objective/scores': 1.0509681701660156, 'policy/approxkl_avg': 0.00015221131616272032, 'policy/clipfrac_avg': 0.003537735901772976, 'loss/policy_avg': -0.0003614632587414235, 'loss/value_avg': 0.006538149900734425, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.12750937044620514, 'val/ratio': 1.0005961656570435, 'val/ratio_var': 2.7748396291826793e-07, 'val/num_eos_tokens': 0, 'lr': 3.936795688388045e-05, 'episode': 1740, 'epoch': 0.21}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [41:04<1:05:39, 131kB/s]
 21%|██▏       | 436/2041 [37:46<2:17:30,  5.14s/it][A

{'eps': 0, 'objective/kl': 34.838623046875, 'objective/entropy': 6.509653568267822, 'objective/non_score_reward': -1.7419312000274658, 'objective/rlhf_reward': -0.7158517837524414, 'objective/scores': 1.0260794162750244, 'policy/approxkl_avg': 0.0021385974250733852, 'policy/clipfrac_avg': 0.008254717104136944, 'loss/policy_avg': -0.007661105133593082, 'loss/value_avg': 0.012464850209653378, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.12381790578365326, 'val/ratio': 1.0007717609405518, 'val/ratio_var': 5.137659968568187e-07, 'val/num_eos_tokens': 0, 'lr': 3.934345908868202e-05, 'episode': 1744, 'epoch': 0.21}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [41:09<1:05:39, 131kB/s]
 21%|██▏       | 437/2041 [37:51<2:18:25,  5.18s/it][A

{'eps': 0, 'objective/kl': 38.67755126953125, 'objective/entropy': 5.272769927978516, 'objective/non_score_reward': -1.9338772296905518, 'objective/rlhf_reward': -0.9557029008865356, 'objective/scores': 0.9781743288040161, 'policy/approxkl_avg': 0.0027134977281093597, 'policy/clipfrac_avg': 0.004716981202363968, 'loss/policy_avg': -0.003984374459832907, 'loss/value_avg': 0.010694246739149094, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1051044762134552, 'val/ratio': 0.9972521662712097, 'val/ratio_var': 5.0978469516849145e-06, 'val/num_eos_tokens': 0, 'lr': 3.931896129348359e-05, 'episode': 1748, 'epoch': 0.21}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [41:14<1:05:39, 131kB/s]
 21%|██▏       | 438/2041 [37:56<2:17:51,  5.16s/it][A

{'eps': 0, 'objective/kl': 35.49275207519531, 'objective/entropy': 5.092913627624512, 'objective/non_score_reward': -1.7746376991271973, 'objective/rlhf_reward': -0.7139343023300171, 'objective/scores': 1.0607033967971802, 'policy/approxkl_avg': 0.001256373361684382, 'policy/clipfrac_avg': 0.007075471803545952, 'loss/policy_avg': 0.00025322585133835673, 'loss/value_avg': 0.010712089948356152, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.12934550642967224, 'val/ratio': 0.998382568359375, 'val/ratio_var': 1.6932968947003246e-06, 'val/num_eos_tokens': 0, 'lr': 3.9294463498285156e-05, 'episode': 1752, 'epoch': 0.21}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [41:20<1:05:39, 131kB/s]
 22%|██▏       | 439/2041 [38:01<2:17:35,  5.15s/it][A

{'eps': 0, 'objective/kl': 34.86846923828125, 'objective/entropy': 6.217060565948486, 'objective/non_score_reward': -1.7434234619140625, 'objective/rlhf_reward': -0.6252622604370117, 'objective/scores': 1.1181612014770508, 'policy/approxkl_avg': 0.0001263010926777497, 'policy/clipfrac_avg': 0.001179245300590992, 'loss/policy_avg': -0.0012894300743937492, 'loss/value_avg': 0.008490649051964283, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.12971848249435425, 'val/ratio': 0.999417781829834, 'val/ratio_var': 2.0231455266639387e-07, 'val/num_eos_tokens': 0, 'lr': 3.9269965703086724e-05, 'episode': 1756, 'epoch': 0.22}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [41:25<1:05:39, 131kB/s]
 22%|██▏       | 440/2041 [38:06<2:16:53,  5.13s/it][A

{'eps': 0, 'objective/kl': 34.617034912109375, 'objective/entropy': 4.789279937744141, 'objective/non_score_reward': -1.730851650238037, 'objective/rlhf_reward': -0.5717403888702393, 'objective/scores': 1.1591112613677979, 'policy/approxkl_avg': 6.217147165443748e-05, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.0005505242152139544, 'loss/value_avg': 0.010329599492251873, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.12869638204574585, 'val/ratio': 0.9998897314071655, 'val/ratio_var': 1.393177484487751e-07, 'val/num_eos_tokens': 0, 'lr': 3.924546790788829e-05, 'episode': 1760, 'epoch': 0.22}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [41:30<1:05:39, 131kB/s]
 22%|██▏       | 441/2041 [38:11<2:16:32,  5.12s/it][A

{'eps': 0, 'objective/kl': 36.137088775634766, 'objective/entropy': 6.232849597930908, 'objective/non_score_reward': -1.806854486465454, 'objective/rlhf_reward': -0.5725045204162598, 'objective/scores': 1.2343499660491943, 'policy/approxkl_avg': 7.224199362099171e-05, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': 1.1310265108477324e-05, 'loss/value_avg': 0.01201072707772255, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.13021284341812134, 'val/ratio': 1.0003187656402588, 'val/ratio_var': 2.8035597665621026e-07, 'val/num_eos_tokens': 0, 'lr': 3.922097011268986e-05, 'episode': 1764, 'epoch': 0.22}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [41:35<1:05:39, 131kB/s]
 22%|██▏       | 442/2041 [38:16<2:16:34,  5.12s/it][A

{'eps': 0, 'objective/kl': 34.85038757324219, 'objective/entropy': 6.383182525634766, 'objective/non_score_reward': -1.7425192594528198, 'objective/rlhf_reward': -0.509832501411438, 'objective/scores': 1.2326867580413818, 'policy/approxkl_avg': 0.0001251016219612211, 'policy/clipfrac_avg': 0.001179245300590992, 'loss/policy_avg': -0.00029969552997499704, 'loss/value_avg': 0.01080649346113205, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.09976986795663834, 'val/ratio': 0.9996429085731506, 'val/ratio_var': 6.239365291094146e-08, 'val/num_eos_tokens': 0, 'lr': 3.919647231749143e-05, 'episode': 1768, 'epoch': 0.22}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [41:40<1:05:39, 131kB/s]
 22%|██▏       | 443/2041 [38:22<2:16:39,  5.13s/it][A

{'eps': 0, 'objective/kl': 35.538299560546875, 'objective/entropy': 5.400496006011963, 'objective/non_score_reward': -1.7769148349761963, 'objective/rlhf_reward': -0.4528226852416992, 'objective/scores': 1.324092149734497, 'policy/approxkl_avg': 0.001939658890478313, 'policy/clipfrac_avg': 0.01179245300590992, 'loss/policy_avg': -0.0015975611750036478, 'loss/value_avg': 0.012015147134661674, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.14073196053504944, 'val/ratio': 1.0012156963348389, 'val/ratio_var': 1.1754378874684335e-06, 'val/num_eos_tokens': 0, 'lr': 3.9171974522292996e-05, 'episode': 1772, 'epoch': 0.22}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [41:45<1:05:39, 131kB/s]
 22%|██▏       | 444/2041 [38:27<2:17:06,  5.15s/it][A

{'eps': 0, 'objective/kl': 37.035972595214844, 'objective/entropy': 9.636624336242676, 'objective/non_score_reward': -1.8517985343933105, 'objective/rlhf_reward': -0.5780681371688843, 'objective/scores': 1.2737303972244263, 'policy/approxkl_avg': 0.012397892773151398, 'policy/clipfrac_avg': 0.024764152243733406, 'loss/policy_avg': -0.0028588001150637865, 'loss/value_avg': 0.013582568615674973, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1238914430141449, 'val/ratio': 1.028464913368225, 'val/ratio_var': 0.000360223202733323, 'val/num_eos_tokens': 0, 'lr': 3.9147476727094564e-05, 'episode': 1776, 'epoch': 0.22}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [41:50<1:05:39, 131kB/s]
 22%|██▏       | 445/2041 [38:32<2:17:15,  5.16s/it][A

{'eps': 0, 'objective/kl': 36.657318115234375, 'objective/entropy': 7.082225799560547, 'objective/non_score_reward': -1.8328659534454346, 'objective/rlhf_reward': -0.5725216865539551, 'objective/scores': 1.2603442668914795, 'policy/approxkl_avg': 0.00023174460511654615, 'policy/clipfrac_avg': 0.002358490601181984, 'loss/policy_avg': -0.0011142712319269776, 'loss/value_avg': 0.011902077123522758, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.11690689623355865, 'val/ratio': 1.0015716552734375, 'val/ratio_var': 2.8314591418165946e-06, 'val/num_eos_tokens': 0, 'lr': 3.912297893189613e-05, 'episode': 1780, 'epoch': 0.22}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [41:56<1:05:39, 131kB/s]
 22%|██▏       | 446/2041 [38:37<2:17:11,  5.16s/it][A

{'eps': 0, 'objective/kl': 36.97918701171875, 'objective/entropy': 7.906773567199707, 'objective/non_score_reward': -1.8489594459533691, 'objective/rlhf_reward': -0.5678918361663818, 'objective/scores': 1.2810676097869873, 'policy/approxkl_avg': 0.0003164416993968189, 'policy/clipfrac_avg': 0.003537735901772976, 'loss/policy_avg': -0.0007617704104632139, 'loss/value_avg': 0.010211487300693989, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.12276852130889893, 'val/ratio': 0.9995870590209961, 'val/ratio_var': 6.90240540279774e-07, 'val/num_eos_tokens': 0, 'lr': 3.90984811366977e-05, 'episode': 1784, 'epoch': 0.22}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [42:01<1:05:39, 131kB/s]
 22%|██▏       | 447/2041 [38:42<2:16:53,  5.15s/it][A

{'eps': 0, 'objective/kl': 34.751304626464844, 'objective/entropy': 7.957601070404053, 'objective/non_score_reward': -1.737565040588379, 'objective/rlhf_reward': -0.4760187864303589, 'objective/scores': 1.26154625415802, 'policy/approxkl_avg': 0.00011493818601593375, 'policy/clipfrac_avg': 0.001179245300590992, 'loss/policy_avg': -0.0006533844280056655, 'loss/value_avg': 0.011530566029250622, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.14694738388061523, 'val/ratio': 0.9996029138565063, 'val/ratio_var': 4.141154477110831e-07, 'val/num_eos_tokens': 0, 'lr': 3.907398334149926e-05, 'episode': 1788, 'epoch': 0.22}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [42:06<1:05:39, 131kB/s]
 22%|██▏       | 448/2041 [38:47<2:17:02,  5.16s/it][A

{'eps': 0, 'objective/kl': 35.39935302734375, 'objective/entropy': 10.341744422912598, 'objective/non_score_reward': -1.7699675559997559, 'objective/rlhf_reward': -0.4186781644821167, 'objective/scores': 1.3512893915176392, 'policy/approxkl_avg': 0.010458757169544697, 'policy/clipfrac_avg': 0.025943394750356674, 'loss/policy_avg': -0.001272787805646658, 'loss/value_avg': 0.013887099921703339, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.15454912185668945, 'val/ratio': 1.0321582555770874, 'val/ratio_var': 0.0008113920339383185, 'val/num_eos_tokens': 0, 'lr': 3.9049485546300837e-05, 'episode': 1792, 'epoch': 0.22}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [42:11<1:05:39, 131kB/s]
 22%|██▏       | 449/2041 [38:53<2:17:29,  5.18s/it][A

{'eps': 0, 'objective/kl': 34.69835662841797, 'objective/entropy': 8.410581588745117, 'objective/non_score_reward': -1.7349179983139038, 'objective/rlhf_reward': -0.407021164894104, 'objective/scores': 1.3278968334197998, 'policy/approxkl_avg': 0.005896762479096651, 'policy/clipfrac_avg': 0.03655660152435303, 'loss/policy_avg': -0.00685035390779376, 'loss/value_avg': 0.010982871055603027, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.16398069262504578, 'val/ratio': 1.0102651119232178, 'val/ratio_var': 7.882680074544623e-05, 'val/num_eos_tokens': 0, 'lr': 3.9024987751102405e-05, 'episode': 1796, 'epoch': 0.22}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [42:16<1:05:39, 131kB/s]
 22%|██▏       | 450/2041 [38:58<2:17:28,  5.18s/it][A

{'eps': 0, 'objective/kl': 35.85706329345703, 'objective/entropy': 5.7324748039245605, 'objective/non_score_reward': -1.7928531169891357, 'objective/rlhf_reward': -0.44283151626586914, 'objective/scores': 1.3500216007232666, 'policy/approxkl_avg': 0.0016018346650525928, 'policy/clipfrac_avg': 0.012971698306500912, 'loss/policy_avg': 0.0005940241389907897, 'loss/value_avg': 0.011750604957342148, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1126486212015152, 'val/ratio': 1.003314733505249, 'val/ratio_var': 5.670335667673498e-06, 'val/num_eos_tokens': 0, 'lr': 3.9000489955903966e-05, 'episode': 1800, 'epoch': 0.22}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [42:21<1:05:39, 131kB/s]
 22%|██▏       | 451/2041 [39:03<2:16:40,  5.16s/it][A

{'eps': 0, 'objective/kl': 34.77862548828125, 'objective/entropy': 4.7967848777771, 'objective/non_score_reward': -1.7389312982559204, 'objective/rlhf_reward': -0.43133366107940674, 'objective/scores': 1.3075976371765137, 'policy/approxkl_avg': 0.00037712359335273504, 'policy/clipfrac_avg': 0.002358490601181984, 'loss/policy_avg': -0.0009711179882287979, 'loss/value_avg': 0.008666587993502617, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.09844208508729935, 'val/ratio': 1.0003658533096313, 'val/ratio_var': 1.7855664680155314e-07, 'val/num_eos_tokens': 0, 'lr': 3.897599216070554e-05, 'episode': 1804, 'epoch': 0.22}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [42:26<1:05:39, 131kB/s]
 22%|██▏       | 452/2041 [39:08<2:15:59,  5.13s/it][A

{'eps': 0, 'objective/kl': 35.42816162109375, 'objective/entropy': 4.122137546539307, 'objective/non_score_reward': -1.771408200263977, 'objective/rlhf_reward': -0.4478977918624878, 'objective/scores': 1.3235104084014893, 'policy/approxkl_avg': 0.00016047579993028194, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -9.834485535975546e-05, 'loss/value_avg': 0.009644496254622936, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.13845966756343842, 'val/ratio': 0.9975552558898926, 'val/ratio_var': 3.965364612668054e-06, 'val/num_eos_tokens': 0, 'lr': 3.895149436550711e-05, 'episode': 1808, 'epoch': 0.22}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [42:32<1:05:39, 131kB/s]
 22%|██▏       | 453/2041 [39:13<2:16:26,  5.15s/it][A

{'eps': 0, 'objective/kl': 36.120391845703125, 'objective/entropy': 8.388731002807617, 'objective/non_score_reward': -1.8060197830200195, 'objective/rlhf_reward': -0.4988747835159302, 'objective/scores': 1.3071449995040894, 'policy/approxkl_avg': 0.00047642760910093784, 'policy/clipfrac_avg': 0.004716981202363968, 'loss/policy_avg': -0.0037337439134716988, 'loss/value_avg': 0.01023862510919571, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.14120858907699585, 'val/ratio': 0.9981759190559387, 'val/ratio_var': 2.205118335041334e-06, 'val/num_eos_tokens': 0, 'lr': 3.892699657030867e-05, 'episode': 1812, 'epoch': 0.22}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [42:37<1:05:39, 131kB/s]
 22%|██▏       | 454/2041 [39:18<2:15:50,  5.14s/it][A

{'eps': 0, 'objective/kl': 36.63264846801758, 'objective/entropy': 9.478489875793457, 'objective/non_score_reward': -1.8316324949264526, 'objective/rlhf_reward': -0.3393927812576294, 'objective/scores': 1.4922397136688232, 'policy/approxkl_avg': 0.00487565528601408, 'policy/clipfrac_avg': 0.029481131583452225, 'loss/policy_avg': -0.0039108749479055405, 'loss/value_avg': 0.011351635679602623, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.10563719272613525, 'val/ratio': 0.9938746094703674, 'val/ratio_var': 2.798395144054666e-05, 'val/num_eos_tokens': 0, 'lr': 3.890249877511024e-05, 'episode': 1816, 'epoch': 0.22}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [42:42<1:05:39, 131kB/s]
 22%|██▏       | 455/2041 [39:23<2:15:41,  5.13s/it][A

{'eps': 0, 'objective/kl': 33.47261047363281, 'objective/entropy': 2.4065189361572266, 'objective/non_score_reward': -1.6736305952072144, 'objective/rlhf_reward': -0.3740187883377075, 'objective/scores': 1.2996118068695068, 'policy/approxkl_avg': 0.0001368813100270927, 'policy/clipfrac_avg': 0.001179245300590992, 'loss/policy_avg': 0.0005169999785721302, 'loss/value_avg': 0.007443075068295002, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.08235906809568405, 'val/ratio': 1.0014865398406982, 'val/ratio_var': 1.2326785281402408e-06, 'val/num_eos_tokens': 0, 'lr': 3.887800097991181e-05, 'episode': 1820, 'epoch': 0.22}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [42:47<1:05:39, 131kB/s]
 22%|██▏       | 456/2041 [39:29<2:15:38,  5.13s/it][A

{'eps': 0, 'objective/kl': 36.43603515625, 'objective/entropy': 5.106734275817871, 'objective/non_score_reward': -1.8218015432357788, 'objective/rlhf_reward': -0.5056295394897461, 'objective/scores': 1.3161720037460327, 'policy/approxkl_avg': 0.00012609550321940333, 'policy/clipfrac_avg': 0.002358490601181984, 'loss/policy_avg': -0.0015103965997695923, 'loss/value_avg': 0.011970378458499908, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.08355741202831268, 'val/ratio': 1.0002318620681763, 'val/ratio_var': 3.5328948655433123e-08, 'val/num_eos_tokens': 0, 'lr': 3.885350318471338e-05, 'episode': 1824, 'epoch': 0.22}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [42:52<1:05:39, 131kB/s]
 22%|██▏       | 457/2041 [39:34<2:16:01,  5.15s/it][A

{'eps': 0, 'objective/kl': 34.625389099121094, 'objective/entropy': 4.787115573883057, 'objective/non_score_reward': -1.7312694787979126, 'objective/rlhf_reward': -0.37922823429107666, 'objective/scores': 1.352041244506836, 'policy/approxkl_avg': 0.0017189763020724058, 'policy/clipfrac_avg': 0.014150943607091904, 'loss/policy_avg': -0.0029515463393181562, 'loss/value_avg': 0.009450297802686691, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1041240394115448, 'val/ratio': 1.0023874044418335, 'val/ratio_var': 4.378474841360003e-06, 'val/num_eos_tokens': 0, 'lr': 3.882900538951494e-05, 'episode': 1828, 'epoch': 0.22}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [42:57<1:05:39, 131kB/s]
 22%|██▏       | 458/2041 [39:39<2:15:41,  5.14s/it][A

{'eps': 0, 'objective/kl': 34.497222900390625, 'objective/entropy': 6.132298469543457, 'objective/non_score_reward': -1.7248611450195312, 'objective/rlhf_reward': -0.471571683883667, 'objective/scores': 1.2532894611358643, 'policy/approxkl_avg': 0.0004010579432360828, 'policy/clipfrac_avg': 0.007075471803545952, 'loss/policy_avg': 4.886792885372415e-05, 'loss/value_avg': 0.010250738821923733, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.16083064675331116, 'val/ratio': 0.9984591603279114, 'val/ratio_var': 1.1771488743761438e-06, 'val/num_eos_tokens': 0, 'lr': 3.880450759431652e-05, 'episode': 1832, 'epoch': 0.22}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [43:03<1:05:39, 131kB/s]
 22%|██▏       | 459/2041 [39:44<2:15:45,  5.15s/it][A

{'eps': 0, 'objective/kl': 36.94560241699219, 'objective/entropy': 6.605567932128906, 'objective/non_score_reward': -1.8472800254821777, 'objective/rlhf_reward': -0.5133689641952515, 'objective/scores': 1.3339110612869263, 'policy/approxkl_avg': 0.001736143371090293, 'policy/clipfrac_avg': 0.014150943607091904, 'loss/policy_avg': -0.00236495235003531, 'loss/value_avg': 0.010836523957550526, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1255555897951126, 'val/ratio': 1.0022144317626953, 'val/ratio_var': 2.9990658276801696e-06, 'val/num_eos_tokens': 0, 'lr': 3.8780009799118085e-05, 'episode': 1836, 'epoch': 0.22}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [43:08<1:05:39, 131kB/s]
 23%|██▎       | 460/2041 [39:49<2:15:37,  5.15s/it][A

{'eps': 0, 'objective/kl': 33.95610046386719, 'objective/entropy': 3.0709638595581055, 'objective/non_score_reward': -1.6978049278259277, 'objective/rlhf_reward': -0.4105958938598633, 'objective/scores': 1.2872090339660645, 'policy/approxkl_avg': 2.1135196220711805e-05, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -7.598636875627562e-05, 'loss/value_avg': 0.006874286103993654, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.12509897351264954, 'val/ratio': 1.0012011528015137, 'val/ratio_var': 1.074710780812893e-06, 'val/num_eos_tokens': 0, 'lr': 3.8755512003919646e-05, 'episode': 1840, 'epoch': 0.23}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [43:13<1:05:39, 131kB/s]
 23%|██▎       | 461/2041 [39:54<2:15:16,  5.14s/it][A

{'eps': 0, 'objective/kl': 33.89778137207031, 'objective/entropy': 5.9226908683776855, 'objective/non_score_reward': -1.6948888301849365, 'objective/rlhf_reward': -0.28489160537719727, 'objective/scores': 1.4099972248077393, 'policy/approxkl_avg': 0.0042875986546278, 'policy/clipfrac_avg': 0.003537735901772976, 'loss/policy_avg': -0.0010304790921509266, 'loss/value_avg': 0.008672455325722694, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.10058677196502686, 'val/ratio': 1.0134005546569824, 'val/ratio_var': 0.00013916326861362904, 'val/num_eos_tokens': 0, 'lr': 3.8731014208721214e-05, 'episode': 1844, 'epoch': 0.23}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [43:18<1:05:39, 131kB/s]
 23%|██▎       | 462/2041 [39:59<2:14:28,  5.11s/it][A

{'eps': 0, 'objective/kl': 34.333740234375, 'objective/entropy': 5.880904674530029, 'objective/non_score_reward': -1.7166869640350342, 'objective/rlhf_reward': -0.3474644422531128, 'objective/scores': 1.3692225217819214, 'policy/approxkl_avg': 0.0032894036266952753, 'policy/clipfrac_avg': 0.022405661642551422, 'loss/policy_avg': -0.004928647540509701, 'loss/value_avg': 0.009226512163877487, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.13916704058647156, 'val/ratio': 0.9990109205245972, 'val/ratio_var': 7.974927598297654e-07, 'val/num_eos_tokens': 0, 'lr': 3.870651641352279e-05, 'episode': 1848, 'epoch': 0.23}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [43:23<1:05:39, 131kB/s]
 23%|██▎       | 463/2041 [40:05<2:15:26,  5.15s/it][A

{'eps': 0, 'objective/kl': 36.79588317871094, 'objective/entropy': 9.465005874633789, 'objective/non_score_reward': -1.8397942781448364, 'objective/rlhf_reward': -0.35110557079315186, 'objective/scores': 1.4886887073516846, 'policy/approxkl_avg': 0.0015420508570969105, 'policy/clipfrac_avg': 0.016509434208273888, 'loss/policy_avg': -2.3729957320028916e-05, 'loss/value_avg': 0.01200881414115429, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.16568630933761597, 'val/ratio': 1.002474308013916, 'val/ratio_var': 2.9603836537717143e-06, 'val/num_eos_tokens': 0, 'lr': 3.868201861832435e-05, 'episode': 1852, 'epoch': 0.23}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [43:28<1:05:39, 131kB/s]
 23%|██▎       | 464/2041 [40:10<2:15:48,  5.17s/it][A

{'eps': 0, 'objective/kl': 35.94561767578125, 'objective/entropy': 7.419835090637207, 'objective/non_score_reward': -1.79728102684021, 'objective/rlhf_reward': -0.3648562431335449, 'objective/scores': 1.432424783706665, 'policy/approxkl_avg': 0.0013540564104914665, 'policy/clipfrac_avg': 0.014150943607091904, 'loss/policy_avg': -0.0007232509087771177, 'loss/value_avg': 0.013419822789728642, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.15427902340888977, 'val/ratio': 1.0005629062652588, 'val/ratio_var': 1.6689033088823635e-07, 'val/num_eos_tokens': 0, 'lr': 3.865752082312592e-05, 'episode': 1856, 'epoch': 0.23}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [43:33<1:05:39, 131kB/s]
 23%|██▎       | 465/2041 [40:15<2:14:49,  5.13s/it][A

{'eps': 0, 'objective/kl': 37.15657043457031, 'objective/entropy': 11.559953689575195, 'objective/non_score_reward': -1.8578283786773682, 'objective/rlhf_reward': -0.3810077905654907, 'objective/scores': 1.4768205881118774, 'policy/approxkl_avg': 0.0013076835311949253, 'policy/clipfrac_avg': 0.01886792480945587, 'loss/policy_avg': -0.002405709121376276, 'loss/value_avg': 0.010598327964544296, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1611708402633667, 'val/ratio': 1.0073919296264648, 'val/ratio_var': 4.043889566673897e-05, 'val/num_eos_tokens': 0, 'lr': 3.8633023027927487e-05, 'episode': 1860, 'epoch': 0.23}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [43:39<1:05:39, 131kB/s]
 23%|██▎       | 466/2041 [40:20<2:15:21,  5.16s/it][A

{'eps': 0, 'objective/kl': 34.80417251586914, 'objective/entropy': 7.024231433868408, 'objective/non_score_reward': -1.7402085065841675, 'objective/rlhf_reward': -0.23897027969360352, 'objective/scores': 1.501238226890564, 'policy/approxkl_avg': 0.001044180360622704, 'policy/clipfrac_avg': 0.010613207705318928, 'loss/policy_avg': 0.00023765383230056614, 'loss/value_avg': 0.013415519148111343, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.13190877437591553, 'val/ratio': 0.9995740652084351, 'val/ratio_var': 2.18417184782993e-07, 'val/num_eos_tokens': 0, 'lr': 3.8608525232729055e-05, 'episode': 1864, 'epoch': 0.23}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [43:44<1:05:39, 131kB/s]
 23%|██▎       | 467/2041 [40:25<2:15:20,  5.16s/it][A

{'eps': 0, 'objective/kl': 34.81312561035156, 'objective/entropy': 6.1764421463012695, 'objective/non_score_reward': -1.7406563758850098, 'objective/rlhf_reward': -0.34918344020843506, 'objective/scores': 1.3914729356765747, 'policy/approxkl_avg': 5.337467882782221e-05, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.00037326395977288485, 'loss/value_avg': 0.008099433034658432, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.16320839524269104, 'val/ratio': 0.9994978904724121, 'val/ratio_var': 4.612701047790324e-07, 'val/num_eos_tokens': 0, 'lr': 3.858402743753062e-05, 'episode': 1868, 'epoch': 0.23}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [43:49<1:05:39, 131kB/s]
 23%|██▎       | 468/2041 [40:30<2:15:27,  5.17s/it][A

{'eps': 0, 'objective/kl': 35.43080520629883, 'objective/entropy': 6.697603702545166, 'objective/non_score_reward': -1.7715402841567993, 'objective/rlhf_reward': -0.32465291023254395, 'objective/scores': 1.4468873739242554, 'policy/approxkl_avg': 0.0010385760106146336, 'policy/clipfrac_avg': 0.003537735901772976, 'loss/policy_avg': -0.002044905908405781, 'loss/value_avg': 0.010190901346504688, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1413971483707428, 'val/ratio': 0.9980481863021851, 'val/ratio_var': 2.1455605292430846e-06, 'val/num_eos_tokens': 0, 'lr': 3.855952964233219e-05, 'episode': 1872, 'epoch': 0.23}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [43:54<1:05:39, 131kB/s]
 23%|██▎       | 469/2041 [40:36<2:15:12,  5.16s/it][A

{'eps': 0, 'objective/kl': 37.07958984375, 'objective/entropy': 6.039985179901123, 'objective/non_score_reward': -1.8539793491363525, 'objective/rlhf_reward': -0.36056482791900635, 'objective/scores': 1.4934145212173462, 'policy/approxkl_avg': 0.0004673945077229291, 'policy/clipfrac_avg': 0.004716981202363968, 'loss/policy_avg': -0.0006076570134609938, 'loss/value_avg': 0.008243294432759285, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1189933568239212, 'val/ratio': 1.0002446174621582, 'val/ratio_var': 3.9937702922543394e-08, 'val/num_eos_tokens': 0, 'lr': 3.8535031847133766e-05, 'episode': 1876, 'epoch': 0.23}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [43:59<1:05:39, 131kB/s]
 23%|██▎       | 470/2041 [40:41<2:15:15,  5.17s/it][A

{'eps': 0, 'objective/kl': 36.42182159423828, 'objective/entropy': 6.465118408203125, 'objective/non_score_reward': -1.8210911750793457, 'objective/rlhf_reward': -0.32337188720703125, 'objective/scores': 1.4977192878723145, 'policy/approxkl_avg': 0.0013604828855022788, 'policy/clipfrac_avg': 0.01886792480945587, 'loss/policy_avg': -0.0013426976511254907, 'loss/value_avg': 0.011339357122778893, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.08435668051242828, 'val/ratio': 0.9961725473403931, 'val/ratio_var': 1.0979812032019254e-05, 'val/num_eos_tokens': 0, 'lr': 3.851053405193533e-05, 'episode': 1880, 'epoch': 0.23}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [44:04<1:05:39, 131kB/s]
 23%|██▎       | 471/2041 [40:46<2:14:57,  5.16s/it][A

{'eps': 0, 'objective/kl': 34.74732971191406, 'objective/entropy': 5.078463077545166, 'objective/non_score_reward': -1.7373664379119873, 'objective/rlhf_reward': -0.21403956413269043, 'objective/scores': 1.5233268737792969, 'policy/approxkl_avg': 0.006749691441655159, 'policy/clipfrac_avg': 0.012971698306500912, 'loss/policy_avg': -0.0037900907918810844, 'loss/value_avg': 0.010326125659048557, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.10187461227178574, 'val/ratio': 0.9978722333908081, 'val/ratio_var': 2.0535114799713483e-06, 'val/num_eos_tokens': 0, 'lr': 3.8486036256736895e-05, 'episode': 1884, 'epoch': 0.23}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [44:10<1:05:39, 131kB/s]
 23%|██▎       | 472/2041 [40:51<2:15:13,  5.17s/it][A

{'eps': 0, 'objective/kl': 33.54639434814453, 'objective/entropy': 6.616533279418945, 'objective/non_score_reward': -1.6773196458816528, 'objective/rlhf_reward': -0.18740642070770264, 'objective/scores': 1.4899132251739502, 'policy/approxkl_avg': 0.004932714160531759, 'policy/clipfrac_avg': 0.02358490601181984, 'loss/policy_avg': -0.004190119914710522, 'loss/value_avg': 0.01530889980494976, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.14464761316776276, 'val/ratio': 1.0105035305023193, 'val/ratio_var': 8.556068496545777e-05, 'val/num_eos_tokens': 0, 'lr': 3.846153846153846e-05, 'episode': 1888, 'epoch': 0.23}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [44:15<1:05:39, 131kB/s]
 23%|██▎       | 473/2041 [40:56<2:15:34,  5.19s/it][A

{'eps': 0, 'objective/kl': 38.800025939941406, 'objective/entropy': 7.4663872718811035, 'objective/non_score_reward': -1.9400014877319336, 'objective/rlhf_reward': -0.43751096725463867, 'objective/scores': 1.502490520477295, 'policy/approxkl_avg': 0.013245790265500546, 'policy/clipfrac_avg': 0.004716981202363968, 'loss/policy_avg': -0.002945313462987542, 'loss/value_avg': 0.012508325278759003, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.18889647722244263, 'val/ratio': 0.99836665391922, 'val/ratio_var': 2.460076984789339e-06, 'val/num_eos_tokens': 0, 'lr': 3.843704066634003e-05, 'episode': 1892, 'epoch': 0.23}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [44:20<1:05:39, 131kB/s]
 23%|██▎       | 474/2041 [41:02<2:15:45,  5.20s/it][A

{'eps': 0, 'objective/kl': 35.2213020324707, 'objective/entropy': 7.226543426513672, 'objective/non_score_reward': -1.7610652446746826, 'objective/rlhf_reward': -0.18810880184173584, 'objective/scores': 1.5729564428329468, 'policy/approxkl_avg': 0.0011936096707358956, 'policy/clipfrac_avg': 0.008254717104136944, 'loss/policy_avg': -0.0033923794981092215, 'loss/value_avg': 0.012462278828024864, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1525009721517563, 'val/ratio': 1.002636432647705, 'val/ratio_var': 7.082932370394701e-06, 'val/num_eos_tokens': 0, 'lr': 3.84125428711416e-05, 'episode': 1896, 'epoch': 0.23}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [44:25<1:05:39, 131kB/s]
 23%|██▎       | 475/2041 [41:07<2:15:33,  5.19s/it][A

{'eps': 0, 'objective/kl': 35.90515899658203, 'objective/entropy': 10.505683898925781, 'objective/non_score_reward': -1.7952580451965332, 'objective/rlhf_reward': -0.2345597743988037, 'objective/scores': 1.5606982707977295, 'policy/approxkl_avg': 0.0003866670886054635, 'policy/clipfrac_avg': 0.002358490601181984, 'loss/policy_avg': -0.0013359093572944403, 'loss/value_avg': 0.010412832722067833, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.18235337734222412, 'val/ratio': 0.9999998211860657, 'val/ratio_var': 2.098976921161011e-07, 'val/num_eos_tokens': 0, 'lr': 3.838804507594317e-05, 'episode': 1900, 'epoch': 0.23}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [44:30<1:05:39, 131kB/s]
 23%|██▎       | 476/2041 [41:12<2:14:43,  5.17s/it][A

{'eps': 0, 'objective/kl': 37.34115982055664, 'objective/entropy': 9.095707893371582, 'objective/non_score_reward': -1.8670580387115479, 'objective/rlhf_reward': -0.2299036979675293, 'objective/scores': 1.6371543407440186, 'policy/approxkl_avg': 0.0011906472500413656, 'policy/clipfrac_avg': 0.010613207705318928, 'loss/policy_avg': -0.0009887313935905695, 'loss/value_avg': 0.01114637777209282, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.16074547171592712, 'val/ratio': 0.9997050762176514, 'val/ratio_var': 7.015379566155389e-08, 'val/num_eos_tokens': 0, 'lr': 3.8363547280744735e-05, 'episode': 1904, 'epoch': 0.23}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [44:35<1:05:39, 131kB/s]
 23%|██▎       | 477/2041 [41:17<2:14:58,  5.18s/it][A

{'eps': 0, 'objective/kl': 36.846675872802734, 'objective/entropy': 9.284320831298828, 'objective/non_score_reward': -1.8423337936401367, 'objective/rlhf_reward': -0.2980053424835205, 'objective/scores': 1.5443284511566162, 'policy/approxkl_avg': 0.011042204685509205, 'policy/clipfrac_avg': 0.028301887214183807, 'loss/policy_avg': -0.005239504389464855, 'loss/value_avg': 0.014430776238441467, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.17240849137306213, 'val/ratio': 0.9998045563697815, 'val/ratio_var': 2.1771820684080012e-06, 'val/num_eos_tokens': 0, 'lr': 3.83390494855463e-05, 'episode': 1908, 'epoch': 0.23}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [44:41<1:05:39, 131kB/s]
 23%|██▎       | 478/2041 [41:22<2:14:55,  5.18s/it][A

{'eps': 0, 'objective/kl': 37.61534881591797, 'objective/entropy': 10.292474746704102, 'objective/non_score_reward': -1.8807673454284668, 'objective/rlhf_reward': -0.3433070182800293, 'objective/scores': 1.5374603271484375, 'policy/approxkl_avg': 0.0029439725913107395, 'policy/clipfrac_avg': 0.00589622650295496, 'loss/policy_avg': -0.005006688646972179, 'loss/value_avg': 0.015501563437283039, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.175001323223114, 'val/ratio': 0.9992427825927734, 'val/ratio_var': 8.336494374816539e-07, 'val/num_eos_tokens': 0, 'lr': 3.831455169034787e-05, 'episode': 1912, 'epoch': 0.23}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [44:46<1:05:39, 131kB/s]
 23%|██▎       | 479/2041 [41:27<2:14:21,  5.16s/it][A

{'eps': 0, 'objective/kl': 36.3587646484375, 'objective/entropy': 8.054643630981445, 'objective/non_score_reward': -1.8179383277893066, 'objective/rlhf_reward': -0.19193744659423828, 'objective/scores': 1.6260008811950684, 'policy/approxkl_avg': 0.00018092455866280943, 'policy/clipfrac_avg': 0.001179245300590992, 'loss/policy_avg': -0.002222107257694006, 'loss/value_avg': 0.011662080883979797, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.18214038014411926, 'val/ratio': 0.9978842735290527, 'val/ratio_var': 3.7063407489767997e-06, 'val/num_eos_tokens': 0, 'lr': 3.829005389514944e-05, 'episode': 1916, 'epoch': 0.23}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [44:51<1:05:39, 131kB/s]
 24%|██▎       | 480/2041 [41:32<2:13:49,  5.14s/it][A

{'eps': 0, 'objective/kl': 35.31098175048828, 'objective/entropy': 8.42389965057373, 'objective/non_score_reward': -1.7655491828918457, 'objective/rlhf_reward': -0.23868238925933838, 'objective/scores': 1.5268667936325073, 'policy/approxkl_avg': 0.00110913859680295, 'policy/clipfrac_avg': 0.00589622650295496, 'loss/policy_avg': -0.0052537210285663605, 'loss/value_avg': 0.012926981784403324, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.22433000802993774, 'val/ratio': 0.9930940866470337, 'val/ratio_var': 4.7471094148932025e-05, 'val/num_eos_tokens': 0, 'lr': 3.826555609995101e-05, 'episode': 1920, 'epoch': 0.24}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [44:56<1:05:39, 131kB/s]
 24%|██▎       | 481/2041 [41:38<2:14:26,  5.17s/it][A

{'eps': 0, 'objective/kl': 37.18132400512695, 'objective/entropy': 14.106300354003906, 'objective/non_score_reward': -1.8590660095214844, 'objective/rlhf_reward': -0.5647134780883789, 'objective/scores': 1.2943525314331055, 'policy/approxkl_avg': 0.001243955921381712, 'policy/clipfrac_avg': 0.004716981202363968, 'loss/policy_avg': -0.003103150986135006, 'loss/value_avg': 0.029036827385425568, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.2604437470436096, 'val/ratio': 0.9983715415000916, 'val/ratio_var': 1.2680956160693313e-06, 'val/num_eos_tokens': 0, 'lr': 3.8241058304752575e-05, 'episode': 1924, 'epoch': 0.24}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [45:01<1:05:39, 131kB/s]
 24%|██▎       | 482/2041 [41:43<2:14:41,  5.18s/it][A

{'eps': 0, 'objective/kl': 37.5682373046875, 'objective/entropy': 12.928592681884766, 'objective/non_score_reward': -1.8784115314483643, 'objective/rlhf_reward': -0.7789117097854614, 'objective/scores': 1.0994998216629028, 'policy/approxkl_avg': 0.021914664655923843, 'policy/clipfrac_avg': 0.0837264135479927, 'loss/policy_avg': -0.010045016184449196, 'loss/value_avg': 0.039337486028671265, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.17513403296470642, 'val/ratio': 0.9998980760574341, 'val/ratio_var': 7.972608727868646e-06, 'val/num_eos_tokens': 0, 'lr': 3.821656050955414e-05, 'episode': 1928, 'epoch': 0.24}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [45:06<1:05:39, 131kB/s]
 24%|██▎       | 483/2041 [41:48<2:14:18,  5.17s/it][A

{'eps': 0, 'objective/kl': 33.935813903808594, 'objective/entropy': 8.889392852783203, 'objective/non_score_reward': -1.6967906951904297, 'objective/rlhf_reward': -0.23413026332855225, 'objective/scores': 1.4626604318618774, 'policy/approxkl_avg': 0.0004989333683624864, 'policy/clipfrac_avg': 0.003537735901772976, 'loss/policy_avg': -0.003650473663583398, 'loss/value_avg': 0.011451772414147854, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.12967979907989502, 'val/ratio': 0.9991132616996765, 'val/ratio_var': 4.392790060592233e-07, 'val/num_eos_tokens': 0, 'lr': 3.819206271435571e-05, 'episode': 1932, 'epoch': 0.24}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [45:12<1:05:39, 131kB/s]
 24%|██▎       | 484/2041 [41:53<2:14:14,  5.17s/it][A

{'eps': 0, 'objective/kl': 33.951568603515625, 'objective/entropy': 2.7677435874938965, 'objective/non_score_reward': -1.6975784301757812, 'objective/rlhf_reward': -0.2944601774215698, 'objective/scores': 1.4031182527542114, 'policy/approxkl_avg': 4.886944589088671e-05, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': 0.00021694290626328439, 'loss/value_avg': 0.0077477386221289635, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.09264528751373291, 'val/ratio': 0.9988446831703186, 'val/ratio_var': 1.0707998399084317e-06, 'val/num_eos_tokens': 0, 'lr': 3.816756491915728e-05, 'episode': 1936, 'epoch': 0.24}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [45:17<1:05:39, 131kB/s]
 24%|██▍       | 485/2041 [41:58<2:14:19,  5.18s/it][A

{'eps': 0, 'objective/kl': 35.10762023925781, 'objective/entropy': 11.15107536315918, 'objective/non_score_reward': -1.7553812265396118, 'objective/rlhf_reward': -0.34796226024627686, 'objective/scores': 1.407418966293335, 'policy/approxkl_avg': 0.0008492899360135198, 'policy/clipfrac_avg': 0.002358490601181984, 'loss/policy_avg': -0.001159020117484033, 'loss/value_avg': 0.012620237655937672, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.19626036286354065, 'val/ratio': 0.9982324838638306, 'val/ratio_var': 2.5008296233863803e-06, 'val/num_eos_tokens': 0, 'lr': 3.814306712395885e-05, 'episode': 1940, 'epoch': 0.24}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [45:22<1:05:39, 131kB/s]
 24%|██▍       | 486/2041 [42:04<2:14:38,  5.20s/it][A

{'eps': 0, 'objective/kl': 33.98851013183594, 'objective/entropy': 5.825958728790283, 'objective/non_score_reward': -1.6994256973266602, 'objective/rlhf_reward': -0.5763925313949585, 'objective/scores': 1.1230331659317017, 'policy/approxkl_avg': 0.007900513708591461, 'policy/clipfrac_avg': 0.03773584961891174, 'loss/policy_avg': -0.008923061192035675, 'loss/value_avg': 0.01770617440342903, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1769031435251236, 'val/ratio': 0.9828298687934875, 'val/ratio_var': 0.00027288493583910167, 'val/num_eos_tokens': 0, 'lr': 3.811856932876041e-05, 'episode': 1944, 'epoch': 0.24}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [45:27<1:05:39, 131kB/s]
 24%|██▍       | 487/2041 [42:09<2:13:50,  5.17s/it][A

{'eps': 0, 'objective/kl': 35.821922302246094, 'objective/entropy': 8.90787124633789, 'objective/non_score_reward': -1.7910960912704468, 'objective/rlhf_reward': -0.9532195329666138, 'objective/scores': 0.837876558303833, 'policy/approxkl_avg': 0.005763529799878597, 'policy/clipfrac_avg': 0.022405661642551422, 'loss/policy_avg': -0.007935581728816032, 'loss/value_avg': 0.06107691302895546, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.13974624872207642, 'val/ratio': 0.997127890586853, 'val/ratio_var': 1.4282777556218207e-05, 'val/num_eos_tokens': 0, 'lr': 3.8094071533561984e-05, 'episode': 1948, 'epoch': 0.24}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [45:32<1:05:39, 131kB/s]
 24%|██▍       | 488/2041 [42:14<2:13:23,  5.15s/it][A

{'eps': 0, 'objective/kl': 32.69361114501953, 'objective/entropy': 3.9570531845092773, 'objective/non_score_reward': -1.6346807479858398, 'objective/rlhf_reward': -0.47547852993011475, 'objective/scores': 1.159202218055725, 'policy/approxkl_avg': 0.005856855772435665, 'policy/clipfrac_avg': 0.014150943607091904, 'loss/policy_avg': -0.000684362486936152, 'loss/value_avg': 0.01181127317249775, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.053818944841623306, 'val/ratio': 1.003390908241272, 'val/ratio_var': 1.0740015568444505e-05, 'val/num_eos_tokens': 0, 'lr': 3.806957373836355e-05, 'episode': 1952, 'epoch': 0.24}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [45:37<1:05:39, 131kB/s]
 24%|██▍       | 489/2041 [42:19<2:13:03,  5.14s/it][A

{'eps': 0, 'objective/kl': 33.55562210083008, 'objective/entropy': 0.15806913375854492, 'objective/non_score_reward': -1.677781105041504, 'objective/rlhf_reward': -0.3871431350708008, 'objective/scores': 1.2906379699707031, 'policy/approxkl_avg': 4.859813316215877e-07, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': 0.00013543468958232552, 'loss/value_avg': 0.0052971672266721725, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.016921978443861008, 'val/ratio': 1.0006797313690186, 'val/ratio_var': 2.844944617663714e-07, 'val/num_eos_tokens': 0, 'lr': 3.804507594316511e-05, 'episode': 1956, 'epoch': 0.24}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [45:43<1:05:39, 131kB/s]
 24%|██▍       | 490/2041 [42:24<2:12:47,  5.14s/it][A

{'eps': 0, 'objective/kl': 34.176483154296875, 'objective/entropy': 3.635068416595459, 'objective/non_score_reward': -1.7088240385055542, 'objective/rlhf_reward': -0.43060290813446045, 'objective/scores': 1.2782211303710938, 'policy/approxkl_avg': 0.0004192670457996428, 'policy/clipfrac_avg': 0.004716981202363968, 'loss/policy_avg': -0.0009054463007487357, 'loss/value_avg': 0.006945054978132248, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.027192121371626854, 'val/ratio': 0.9986369609832764, 'val/ratio_var': 1.1963685437876848e-06, 'val/num_eos_tokens': 0, 'lr': 3.802057814796669e-05, 'episode': 1960, 'epoch': 0.24}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [45:48<1:05:39, 131kB/s]
 24%|██▍       | 491/2041 [42:29<2:12:02,  5.11s/it][A

{'eps': 0, 'objective/kl': 33.95652389526367, 'objective/entropy': 0.062369346618652344, 'objective/non_score_reward': -1.6978261470794678, 'objective/rlhf_reward': -0.48845458030700684, 'objective/scores': 1.209371566772461, 'policy/approxkl_avg': 1.1384321263463448e-09, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': 3.766619784073555e-06, 'loss/value_avg': 0.006038909777998924, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.009789192117750645, 'val/ratio': 1.000023365020752, 'val/ratio_var': 3.552879379586926e-10, 'val/num_eos_tokens': 0, 'lr': 3.7996080352768256e-05, 'episode': 1964, 'epoch': 0.24}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [45:53<1:05:39, 131kB/s]
 24%|██▍       | 492/2041 [42:34<2:11:29,  5.09s/it][A

{'eps': 0, 'objective/kl': 33.86235427856445, 'objective/entropy': 1.88380765914917, 'objective/non_score_reward': -1.6931177377700806, 'objective/rlhf_reward': -0.4862252473831177, 'objective/scores': 1.206892490386963, 'policy/approxkl_avg': 0.00014231829845812172, 'policy/clipfrac_avg': 0.002358490601181984, 'loss/policy_avg': -0.001345714321359992, 'loss/value_avg': 0.006759488023817539, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.017861222848296165, 'val/ratio': 1.000679850578308, 'val/ratio_var': 4.829454951504886e-07, 'val/num_eos_tokens': 0, 'lr': 3.797158255756982e-05, 'episode': 1968, 'epoch': 0.24}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [45:58<1:05:39, 131kB/s]
 24%|██▍       | 493/2041 [42:39<2:11:36,  5.10s/it][A

{'eps': 0, 'objective/kl': 34.80107116699219, 'objective/entropy': 0.09950923919677734, 'objective/non_score_reward': -1.740053415298462, 'objective/rlhf_reward': -0.5733764171600342, 'objective/scores': 1.1666769981384277, 'policy/approxkl_avg': 1.2078801603365719e-07, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -4.222258212394081e-05, 'loss/value_avg': 0.004947171080857515, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.01692814938724041, 'val/ratio': 0.9996519088745117, 'val/ratio_var': 9.036813253260334e-08, 'val/num_eos_tokens': 0, 'lr': 3.7947084762371385e-05, 'episode': 1972, 'epoch': 0.24}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [46:03<1:05:39, 131kB/s]
 24%|██▍       | 494/2041 [42:44<2:11:22,  5.10s/it][A

{'eps': 0, 'objective/kl': 33.62146759033203, 'objective/entropy': 0.13691139221191406, 'objective/non_score_reward': -1.6810733079910278, 'objective/rlhf_reward': -0.601253867149353, 'objective/scores': 1.0798194408416748, 'policy/approxkl_avg': 1.0443944375992942e-07, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -3.790743357967585e-05, 'loss/value_avg': 0.005336096044629812, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.021183103322982788, 'val/ratio': 0.9996773600578308, 'val/ratio_var': 7.677770241798498e-08, 'val/num_eos_tokens': 0, 'lr': 3.792258696717296e-05, 'episode': 1976, 'epoch': 0.24}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [46:08<1:05:39, 131kB/s]
 24%|██▍       | 495/2041 [42:50<2:11:36,  5.11s/it][A

{'eps': 0, 'objective/kl': 33.711280822753906, 'objective/entropy': 6.296942710876465, 'objective/non_score_reward': -1.6855640411376953, 'objective/rlhf_reward': -0.5102847814559937, 'objective/scores': 1.1752792596817017, 'policy/approxkl_avg': 0.00010758089774753898, 'policy/clipfrac_avg': 0.001179245300590992, 'loss/policy_avg': -0.0014824141981080174, 'loss/value_avg': 0.008270742371678352, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.06168456748127937, 'val/ratio': 1.0002553462982178, 'val/ratio_var': 3.023350458875029e-08, 'val/num_eos_tokens': 0, 'lr': 3.789808917197453e-05, 'episode': 1980, 'epoch': 0.24}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [46:13<1:05:39, 131kB/s]
 24%|██▍       | 496/2041 [42:55<2:12:33,  5.15s/it][A

{'eps': 0, 'objective/kl': 33.41108703613281, 'objective/entropy': 3.1146249771118164, 'objective/non_score_reward': -1.6705543994903564, 'objective/rlhf_reward': -0.4565589427947998, 'objective/scores': 1.2139954566955566, 'policy/approxkl_avg': 0.002789183985441923, 'policy/clipfrac_avg': 0.009433962404727936, 'loss/policy_avg': -0.002486063865944743, 'loss/value_avg': 0.007026854436844587, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.06523176282644272, 'val/ratio': 1.0075727701187134, 'val/ratio_var': 3.745414505829103e-05, 'val/num_eos_tokens': 0, 'lr': 3.787359137677609e-05, 'episode': 1984, 'epoch': 0.24}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [46:18<1:05:39, 131kB/s]
 24%|██▍       | 497/2041 [43:00<2:12:39,  5.16s/it][A

{'eps': 0, 'objective/kl': 35.809593200683594, 'objective/entropy': 4.265377521514893, 'objective/non_score_reward': -1.7904796600341797, 'objective/rlhf_reward': -0.6979466676712036, 'objective/scores': 1.092532992362976, 'policy/approxkl_avg': 0.00035168539034202695, 'policy/clipfrac_avg': 0.00589622650295496, 'loss/policy_avg': -0.0026122129056602716, 'loss/value_avg': 0.012411252595484257, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.07492780685424805, 'val/ratio': 0.9992619752883911, 'val/ratio_var': 5.731178021051164e-07, 'val/num_eos_tokens': 0, 'lr': 3.784909358157766e-05, 'episode': 1988, 'epoch': 0.24}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [46:23<1:05:39, 131kB/s]
 24%|██▍       | 498/2041 [43:05<2:11:56,  5.13s/it][A

{'eps': 0, 'objective/kl': 34.697608947753906, 'objective/entropy': 5.715638160705566, 'objective/non_score_reward': -1.7348804473876953, 'objective/rlhf_reward': -0.6692366600036621, 'objective/scores': 1.0656437873840332, 'policy/approxkl_avg': 0.00024813972413539886, 'policy/clipfrac_avg': 0.004716981202363968, 'loss/policy_avg': -0.0006790577317588031, 'loss/value_avg': 0.010897887870669365, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.07361875474452972, 'val/ratio': 1.0010185241699219, 'val/ratio_var': 8.904819424060406e-07, 'val/num_eos_tokens': 0, 'lr': 3.782459578637923e-05, 'episode': 1992, 'epoch': 0.24}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [46:29<1:05:39, 131kB/s]
 24%|██▍       | 499/2041 [43:10<2:12:05,  5.14s/it][A

{'eps': 0, 'objective/kl': 34.087440490722656, 'objective/entropy': 8.554666519165039, 'objective/non_score_reward': -1.7043719291687012, 'objective/rlhf_reward': -0.7091171741485596, 'objective/scores': 0.9952547550201416, 'policy/approxkl_avg': 0.0019592817407101393, 'policy/clipfrac_avg': 0.022405661642551422, 'loss/policy_avg': -0.004198436625301838, 'loss/value_avg': 0.015478613786399364, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1367301344871521, 'val/ratio': 0.9966399073600769, 'val/ratio_var': 1.023183995130239e-05, 'val/num_eos_tokens': 0, 'lr': 3.780009799118079e-05, 'episode': 1996, 'epoch': 0.24}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [46:34<1:05:39, 131kB/s]
 24%|██▍       | 500/2041 [43:15<2:12:20,  5.15s/it][A

{'eps': 0, 'objective/kl': 34.172664642333984, 'objective/entropy': 5.9922194480896, 'objective/non_score_reward': -1.708633303642273, 'objective/rlhf_reward': -0.5880531072616577, 'objective/scores': 1.1205801963806152, 'policy/approxkl_avg': 0.0012325624702498317, 'policy/clipfrac_avg': 0.00589622650295496, 'loss/policy_avg': -0.0018987171351909637, 'loss/value_avg': 0.006450776942074299, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.09729006141424179, 'val/ratio': 1.001766324043274, 'val/ratio_var': 2.8957922495465027e-06, 'val/num_eos_tokens': 0, 'lr': 3.777560019598236e-05, 'episode': 2000, 'epoch': 0.25}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [46:57<1:05:39, 131kB/s]
 25%|██▍       | 501/2041 [43:39<4:32:10, 10.60s/it][A

{'eps': 0, 'objective/kl': 33.311119079589844, 'objective/entropy': 2.35359525680542, 'objective/non_score_reward': -1.6655558347702026, 'objective/rlhf_reward': -0.5361729860305786, 'objective/scores': 1.129382848739624, 'policy/approxkl_avg': 3.529958848957904e-05, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': 0.00030661921482533216, 'loss/value_avg': 0.005955876782536507, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.051455773413181305, 'val/ratio': 0.9998012185096741, 'val/ratio_var': 1.875529243022811e-08, 'val/num_eos_tokens': 0, 'lr': 3.7751102400783936e-05, 'episode': 2004, 'epoch': 0.25}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [47:02<1:05:39, 131kB/s]
 25%|██▍       | 502/2041 [43:44<3:50:52,  9.00s/it][A

{'eps': 0, 'objective/kl': 34.20262145996094, 'objective/entropy': 0.40630435943603516, 'objective/non_score_reward': -1.7101311683654785, 'objective/rlhf_reward': -0.7563470005989075, 'objective/scores': 0.953784167766571, 'policy/approxkl_avg': 4.830273496736481e-07, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': 8.497260569129139e-05, 'loss/value_avg': 0.010175010189414024, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.051950909197330475, 'val/ratio': 0.9992705583572388, 'val/ratio_var': 3.6051062579645077e-07, 'val/num_eos_tokens': 0, 'lr': 3.77266046055855e-05, 'episode': 2008, 'epoch': 0.25}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [47:08<1:05:39, 131kB/s]
 25%|██▍       | 503/2041 [43:49<3:21:04,  7.84s/it][A

{'eps': 0, 'objective/kl': 34.95567321777344, 'objective/entropy': 7.108794212341309, 'objective/non_score_reward': -1.7477836608886719, 'objective/rlhf_reward': -0.8627489805221558, 'objective/scores': 0.8850346803665161, 'policy/approxkl_avg': 8.277423330582678e-05, 'policy/clipfrac_avg': 0.001179245300590992, 'loss/policy_avg': -0.002036730060353875, 'loss/value_avg': 0.02174309268593788, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1309507191181183, 'val/ratio': 0.9999706745147705, 'val/ratio_var': 3.913834589752696e-08, 'val/num_eos_tokens': 0, 'lr': 3.7702106810387065e-05, 'episode': 2012, 'epoch': 0.25}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [47:13<1:05:39, 131kB/s]
 25%|██▍       | 504/2041 [43:54<2:59:44,  7.02s/it][A

{'eps': 0, 'objective/kl': 36.97814178466797, 'objective/entropy': 6.567513465881348, 'objective/non_score_reward': -1.8489071130752563, 'objective/rlhf_reward': -0.8748787641525269, 'objective/scores': 0.9740283489227295, 'policy/approxkl_avg': 0.0033501232974231243, 'policy/clipfrac_avg': 0.028301887214183807, 'loss/policy_avg': -0.00426753144711256, 'loss/value_avg': 0.010313292033970356, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.11175453662872314, 'val/ratio': 0.9999020099639893, 'val/ratio_var': 3.155937022825128e-08, 'val/num_eos_tokens': 0, 'lr': 3.7677609015188633e-05, 'episode': 2016, 'epoch': 0.25}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [47:18<1:05:39, 131kB/s]
 25%|██▍       | 505/2041 [43:59<2:44:58,  6.44s/it][A

{'eps': 0, 'objective/kl': 32.30818557739258, 'objective/entropy': 0.39131736755371094, 'objective/non_score_reward': -1.6154093742370605, 'objective/rlhf_reward': -0.5656269788742065, 'objective/scores': 1.049782395362854, 'policy/approxkl_avg': 1.4775011436540808e-07, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': 8.263115205409122e-07, 'loss/value_avg': 0.004947013221681118, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.044452209025621414, 'val/ratio': 1.0004117488861084, 'val/ratio_var': 1.139291043728008e-07, 'val/num_eos_tokens': 0, 'lr': 3.76531112199902e-05, 'episode': 2020, 'epoch': 0.25}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [47:23<1:05:39, 131kB/s]
 25%|██▍       | 506/2041 [44:04<2:34:53,  6.05s/it][A

{'eps': 0, 'objective/kl': 33.05605697631836, 'objective/entropy': 0.3224172592163086, 'objective/non_score_reward': -1.6528029441833496, 'objective/rlhf_reward': -0.7838473916053772, 'objective/scores': 0.8689555525779724, 'policy/approxkl_avg': 4.700837408222469e-08, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -1.1002399332937784e-05, 'loss/value_avg': 0.007696786895394325, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.039302077144384384, 'val/ratio': 1.0002293586730957, 'val/ratio_var': 3.628639078101514e-08, 'val/num_eos_tokens': 0, 'lr': 3.762861342479177e-05, 'episode': 2024, 'epoch': 0.25}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [47:28<1:05:39, 131kB/s]
 25%|██▍       | 507/2041 [44:10<2:27:18,  5.76s/it][A

{'eps': 0, 'objective/kl': 34.012123107910156, 'objective/entropy': 4.045043468475342, 'objective/non_score_reward': -1.700606346130371, 'objective/rlhf_reward': -0.7130789160728455, 'objective/scores': 0.9875274300575256, 'policy/approxkl_avg': 0.00023836478067096323, 'policy/clipfrac_avg': 0.003537735901772976, 'loss/policy_avg': -0.0011018323712050915, 'loss/value_avg': 0.006295060273259878, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.040900733321905136, 'val/ratio': 1.0015257596969604, 'val/ratio_var': 2.9627424282807624e-06, 'val/num_eos_tokens': 0, 'lr': 3.760411562959334e-05, 'episode': 2028, 'epoch': 0.25}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [47:33<1:05:39, 131kB/s]
 25%|██▍       | 508/2041 [44:15<2:22:36,  5.58s/it][A

{'eps': 0, 'objective/kl': 33.79267120361328, 'objective/entropy': 2.9558424949645996, 'objective/non_score_reward': -1.6896336078643799, 'objective/rlhf_reward': -0.693996250629425, 'objective/scores': 0.9956373572349548, 'policy/approxkl_avg': 0.0005629945662803948, 'policy/clipfrac_avg': 0.004716981202363968, 'loss/policy_avg': -0.0014401578810065985, 'loss/value_avg': 0.006255016196519136, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.08104054629802704, 'val/ratio': 0.9993507862091064, 'val/ratio_var': 3.6724409824273607e-07, 'val/num_eos_tokens': 0, 'lr': 3.7579617834394906e-05, 'episode': 2032, 'epoch': 0.25}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [47:38<1:05:39, 131kB/s]
 25%|██▍       | 509/2041 [44:20<2:19:12,  5.45s/it][A

{'eps': 0, 'objective/kl': 34.92564392089844, 'objective/entropy': 8.038182258605957, 'objective/non_score_reward': -1.7462822198867798, 'objective/rlhf_reward': -0.86381995677948, 'objective/scores': 0.8824622631072998, 'policy/approxkl_avg': 0.0036176659632474184, 'policy/clipfrac_avg': 0.025943394750356674, 'loss/policy_avg': -0.00859208032488823, 'loss/value_avg': 0.016159117221832275, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.10929115861654282, 'val/ratio': 0.9931185245513916, 'val/ratio_var': 3.934773485525511e-05, 'val/num_eos_tokens': 0, 'lr': 3.7555120039196474e-05, 'episode': 2036, 'epoch': 0.25}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [47:43<1:05:39, 131kB/s]
 25%|██▍       | 510/2041 [44:25<2:16:26,  5.35s/it][A

{'eps': 0, 'objective/kl': 34.45201873779297, 'objective/entropy': 3.529684543609619, 'objective/non_score_reward': -1.7226009368896484, 'objective/rlhf_reward': -0.679013729095459, 'objective/scores': 1.0435872077941895, 'policy/approxkl_avg': 8.78683349583298e-05, 'policy/clipfrac_avg': 0.003537735901772976, 'loss/policy_avg': -0.000556592654902488, 'loss/value_avg': 0.0064484435133636, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.07560533285140991, 'val/ratio': 1.0006248950958252, 'val/ratio_var': 3.6399919167706685e-07, 'val/num_eos_tokens': 0, 'lr': 3.753062224399804e-05, 'episode': 2040, 'epoch': 0.25}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [47:49<1:05:39, 131kB/s]
 25%|██▌       | 511/2041 [44:30<2:15:28,  5.31s/it][A

{'eps': 0, 'objective/kl': 33.55192565917969, 'objective/entropy': 4.608309268951416, 'objective/non_score_reward': -1.6775963306427002, 'objective/rlhf_reward': -0.666359543800354, 'objective/scores': 1.0112367868423462, 'policy/approxkl_avg': 0.0014130001654848456, 'policy/clipfrac_avg': 0.009433962404727936, 'loss/policy_avg': -0.0030831401236355305, 'loss/value_avg': 0.008576808497309685, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.10264123231172562, 'val/ratio': 1.0017335414886475, 'val/ratio_var': 3.09262009068334e-06, 'val/num_eos_tokens': 0, 'lr': 3.750612444879961e-05, 'episode': 2044, 'epoch': 0.25}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [47:54<1:05:39, 131kB/s]
 25%|██▌       | 512/2041 [44:35<2:14:33,  5.28s/it][A

{'eps': 0, 'objective/kl': 34.858306884765625, 'objective/entropy': 6.33842658996582, 'objective/non_score_reward': -1.742915153503418, 'objective/rlhf_reward': -0.8445947766304016, 'objective/scores': 0.8983203768730164, 'policy/approxkl_avg': 0.00023052124015521258, 'policy/clipfrac_avg': 0.002358490601181984, 'loss/policy_avg': -0.0014092179480940104, 'loss/value_avg': 0.011238578706979752, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.16002708673477173, 'val/ratio': 0.9983945488929749, 'val/ratio_var': 2.0709039745270275e-06, 'val/num_eos_tokens': 0, 'lr': 3.748162665360118e-05, 'episode': 2048, 'epoch': 0.25}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [47:59<1:05:39, 131kB/s]
 25%|██▌       | 513/2041 [44:41<2:13:41,  5.25s/it][A

{'eps': 0, 'objective/kl': 35.763671875, 'objective/entropy': 4.352887153625488, 'objective/non_score_reward': -1.7881836891174316, 'objective/rlhf_reward': -0.8772225379943848, 'objective/scores': 0.9109611511230469, 'policy/approxkl_avg': 3.435271355556324e-05, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': 0.00016168646106962115, 'loss/value_avg': 0.007854439318180084, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.129034161567688, 'val/ratio': 1.001219630241394, 'val/ratio_var': 8.891517495612788e-07, 'val/num_eos_tokens': 0, 'lr': 3.7457128858402746e-05, 'episode': 2052, 'epoch': 0.25}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [48:04<1:05:39, 131kB/s]
 25%|██▌       | 514/2041 [44:46<2:12:39,  5.21s/it][A

{'eps': 0, 'objective/kl': 32.83125305175781, 'objective/entropy': 11.227624893188477, 'objective/non_score_reward': -1.641562819480896, 'objective/rlhf_reward': -0.6987387537956238, 'objective/scores': 0.9428240656852722, 'policy/approxkl_avg': 0.010087191127240658, 'policy/clipfrac_avg': 0.025943396613001823, 'loss/policy_avg': -0.004167851060628891, 'loss/value_avg': 0.00823288969695568, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1200675219297409, 'val/ratio': 1.0148128271102905, 'val/ratio_var': 0.00023326494556386024, 'val/num_eos_tokens': 0, 'lr': 3.7432631063204314e-05, 'episode': 2056, 'epoch': 0.25}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [48:09<1:05:39, 131kB/s]
 25%|██▌       | 515/2041 [44:51<2:12:11,  5.20s/it][A

{'eps': 0, 'objective/kl': 33.85635757446289, 'objective/entropy': 4.263495445251465, 'objective/non_score_reward': -1.6928176879882812, 'objective/rlhf_reward': -0.8254477977752686, 'objective/scores': 0.8673698902130127, 'policy/approxkl_avg': 0.00044535225606523454, 'policy/clipfrac_avg': 0.002358490601181984, 'loss/policy_avg': -0.0011794153833761811, 'loss/value_avg': 0.0047387657687067986, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.11736178398132324, 'val/ratio': 1.0006678104400635, 'val/ratio_var': 5.241194003247074e-07, 'val/num_eos_tokens': 0, 'lr': 3.740813326800588e-05, 'episode': 2060, 'epoch': 0.25}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [48:14<1:05:39, 131kB/s]
 25%|██▌       | 516/2041 [44:56<2:11:43,  5.18s/it][A

{'eps': 0, 'objective/kl': 36.04156494140625, 'objective/entropy': 5.909337520599365, 'objective/non_score_reward': -1.8020782470703125, 'objective/rlhf_reward': -0.7222352027893066, 'objective/scores': 1.0798430442810059, 'policy/approxkl_avg': 0.00039667898090556264, 'policy/clipfrac_avg': 0.001179245300590992, 'loss/policy_avg': 0.00019292166689410806, 'loss/value_avg': 0.004738113842904568, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.11176168918609619, 'val/ratio': 1.0015695095062256, 'val/ratio_var': 1.7380974668412819e-06, 'val/num_eos_tokens': 0, 'lr': 3.738363547280745e-05, 'episode': 2064, 'epoch': 0.25}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [48:20<1:05:39, 131kB/s]
 25%|██▌       | 517/2041 [45:01<2:11:11,  5.17s/it][A

{'eps': 0, 'objective/kl': 34.72949981689453, 'objective/entropy': 7.753891468048096, 'objective/non_score_reward': -1.7364749908447266, 'objective/rlhf_reward': -0.7501236200332642, 'objective/scores': 0.9863513708114624, 'policy/approxkl_avg': 0.0003913608379662037, 'policy/clipfrac_avg': 0.00589622650295496, 'loss/policy_avg': -0.001725939568132162, 'loss/value_avg': 0.007226389832794666, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.13828641176223755, 'val/ratio': 0.9986073970794678, 'val/ratio_var': 1.5945201994327363e-06, 'val/num_eos_tokens': 0, 'lr': 3.735913767760902e-05, 'episode': 2068, 'epoch': 0.25}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [48:25<1:05:39, 131kB/s]
 25%|██▌       | 518/2041 [45:06<2:10:52,  5.16s/it][A

{'eps': 0, 'objective/kl': 33.98527526855469, 'objective/entropy': 3.868485927581787, 'objective/non_score_reward': -1.6992639303207397, 'objective/rlhf_reward': -0.7990580201148987, 'objective/scores': 0.9002059102058411, 'policy/approxkl_avg': 4.494985660130624e-06, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': 7.203952554846182e-05, 'loss/value_avg': 0.007644373457878828, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.12711390852928162, 'val/ratio': 1.000145673751831, 'val/ratio_var': 1.1598689120262407e-08, 'val/num_eos_tokens': 0, 'lr': 3.733463988241058e-05, 'episode': 2072, 'epoch': 0.25}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [48:30<1:05:39, 131kB/s]
 25%|██▌       | 519/2041 [45:11<2:10:56,  5.16s/it][A

{'eps': 0, 'objective/kl': 32.9500846862793, 'objective/entropy': 3.1800966262817383, 'objective/non_score_reward': -1.647504210472107, 'objective/rlhf_reward': -0.6830956339836121, 'objective/scores': 0.9644085764884949, 'policy/approxkl_avg': 8.97750724107027e-05, 'policy/clipfrac_avg': 0.002358490601181984, 'loss/policy_avg': -0.00047790538519620895, 'loss/value_avg': 0.006078999489545822, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.15250568091869354, 'val/ratio': 0.9980239868164062, 'val/ratio_var': 4.365849235909991e-06, 'val/num_eos_tokens': 0, 'lr': 3.7310142087212154e-05, 'episode': 2076, 'epoch': 0.25}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [48:35<1:05:39, 131kB/s]
 25%|██▌       | 520/2041 [45:17<2:10:39,  5.15s/it][A

{'eps': 0, 'objective/kl': 35.106788635253906, 'objective/entropy': 4.6729207038879395, 'objective/non_score_reward': -1.7553393840789795, 'objective/rlhf_reward': -0.7655701637268066, 'objective/scores': 0.9897692203521729, 'policy/approxkl_avg': 0.0002952585928142071, 'policy/clipfrac_avg': 0.004716981202363968, 'loss/policy_avg': -0.0004433730791788548, 'loss/value_avg': 0.006239063572138548, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.14052969217300415, 'val/ratio': 0.9999668002128601, 'val/ratio_var': 9.271564072044214e-10, 'val/num_eos_tokens': 0, 'lr': 3.728564429201372e-05, 'episode': 2080, 'epoch': 0.25}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [48:40<1:05:39, 131kB/s]
 26%|██▌       | 521/2041 [45:22<2:10:02,  5.13s/it][A

{'eps': 0, 'objective/kl': 36.5909423828125, 'objective/entropy': 8.017693519592285, 'objective/non_score_reward': -1.8295470476150513, 'objective/rlhf_reward': -0.9713495373725891, 'objective/scores': 0.8581975102424622, 'policy/approxkl_avg': 0.001486301189288497, 'policy/clipfrac_avg': 0.003537735901772976, 'loss/policy_avg': -0.002698854310438037, 'loss/value_avg': 0.022601913660764694, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.15242666006088257, 'val/ratio': 0.9992386698722839, 'val/ratio_var': 2.727227013110678e-07, 'val/num_eos_tokens': 0, 'lr': 3.7261146496815283e-05, 'episode': 2084, 'epoch': 0.26}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [48:45<1:05:39, 131kB/s]
 26%|██▌       | 522/2041 [45:27<2:09:59,  5.13s/it][A

{'eps': 0, 'objective/kl': 36.30812072753906, 'objective/entropy': 7.7528581619262695, 'objective/non_score_reward': -1.815406322479248, 'objective/rlhf_reward': -0.8014307022094727, 'objective/scores': 1.0139756202697754, 'policy/approxkl_avg': 0.0019601071253418922, 'policy/clipfrac_avg': 0.012971698306500912, 'loss/policy_avg': -0.0017686996143311262, 'loss/value_avg': 0.009247029200196266, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.11401636898517609, 'val/ratio': 0.9986706972122192, 'val/ratio_var': 9.89229192782659e-07, 'val/num_eos_tokens': 0, 'lr': 3.723664870161686e-05, 'episode': 2088, 'epoch': 0.26}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [48:50<1:05:39, 131kB/s]
 26%|██▌       | 523/2041 [45:32<2:09:33,  5.12s/it][A

{'eps': 0, 'objective/kl': 38.73583221435547, 'objective/entropy': 10.862632751464844, 'objective/non_score_reward': -1.9367916584014893, 'objective/rlhf_reward': -0.9912553429603577, 'objective/scores': 0.9455363154411316, 'policy/approxkl_avg': 0.001681523397564888, 'policy/clipfrac_avg': 0.007075471803545952, 'loss/policy_avg': -0.0026624114252626896, 'loss/value_avg': 0.04524235427379608, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.09540761262178421, 'val/ratio': 0.9967167377471924, 'val/ratio_var': 6.115877113188617e-06, 'val/num_eos_tokens': 0, 'lr': 3.7212150906418426e-05, 'episode': 2092, 'epoch': 0.26}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [48:56<1:05:39, 131kB/s]
 26%|██▌       | 524/2041 [45:37<2:10:01,  5.14s/it][A

{'eps': 0, 'objective/kl': 35.38988494873047, 'objective/entropy': 8.363426208496094, 'objective/non_score_reward': -1.7694942951202393, 'objective/rlhf_reward': -0.7804676294326782, 'objective/scores': 0.989026665687561, 'policy/approxkl_avg': 0.0014448277652263641, 'policy/clipfrac_avg': 0.01886792480945587, 'loss/policy_avg': -0.0014573729131370783, 'loss/value_avg': 0.00792001187801361, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.11291473358869553, 'val/ratio': 1.0049526691436768, 'val/ratio_var': 2.007144757953938e-05, 'val/num_eos_tokens': 0, 'lr': 3.7187653111219994e-05, 'episode': 2096, 'epoch': 0.26}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [49:01<1:05:39, 131kB/s]
 26%|██▌       | 525/2041 [45:42<2:10:00,  5.15s/it][A

{'eps': 0, 'objective/kl': 35.68926239013672, 'objective/entropy': 7.6093926429748535, 'objective/non_score_reward': -1.7844632863998413, 'objective/rlhf_reward': -0.8515317440032959, 'objective/scores': 0.9329315423965454, 'policy/approxkl_avg': 0.0009100286988541484, 'policy/clipfrac_avg': 0.008254717104136944, 'loss/policy_avg': -0.003632486565038562, 'loss/value_avg': 0.009265217930078506, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.13311080634593964, 'val/ratio': 1.0037643909454346, 'val/ratio_var': 8.37973493617028e-06, 'val/num_eos_tokens': 0, 'lr': 3.7163155316021556e-05, 'episode': 2100, 'epoch': 0.26}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [49:06<1:05:39, 131kB/s]
 26%|██▌       | 526/2041 [45:47<2:09:50,  5.14s/it][A

{'eps': 0, 'objective/kl': 35.59619903564453, 'objective/entropy': 9.705345153808594, 'objective/non_score_reward': -1.7798100709915161, 'objective/rlhf_reward': -0.9501217603683472, 'objective/scores': 0.829688310623169, 'policy/approxkl_avg': 0.001538570737466216, 'policy/clipfrac_avg': 0.016509434208273888, 'loss/policy_avg': -0.0022979644127190113, 'loss/value_avg': 0.00619310699403286, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.11240081489086151, 'val/ratio': 0.9969972372055054, 'val/ratio_var': 7.703690243943129e-06, 'val/num_eos_tokens': 0, 'lr': 3.713865752082313e-05, 'episode': 2104, 'epoch': 0.26}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [49:11<1:05:39, 131kB/s]
 26%|██▌       | 527/2041 [45:53<2:09:54,  5.15s/it][A

{'eps': 0, 'objective/kl': 34.85700607299805, 'objective/entropy': 4.743180274963379, 'objective/non_score_reward': -1.7428503036499023, 'objective/rlhf_reward': -0.864345908164978, 'objective/scores': 0.8785043954849243, 'policy/approxkl_avg': 0.0004207585006952286, 'policy/clipfrac_avg': 0.004716981202363968, 'loss/policy_avg': -0.0013299365527927876, 'loss/value_avg': 0.0075370436534285545, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1141049712896347, 'val/ratio': 1.0000122785568237, 'val/ratio_var': 1.202320021320702e-09, 'val/num_eos_tokens': 0, 'lr': 3.71141597256247e-05, 'episode': 2108, 'epoch': 0.26}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [49:16<1:05:39, 131kB/s]
 26%|██▌       | 528/2041 [45:58<2:10:01,  5.16s/it][A

{'eps': 0, 'objective/kl': 35.488067626953125, 'objective/entropy': 3.959683895111084, 'objective/non_score_reward': -1.7744032144546509, 'objective/rlhf_reward': -0.8561984300613403, 'objective/scores': 0.9182047843933105, 'policy/approxkl_avg': 0.00037743078428320587, 'policy/clipfrac_avg': 0.002358490601181984, 'loss/policy_avg': 0.000969981774687767, 'loss/value_avg': 0.007828325033187866, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.14386408030986786, 'val/ratio': 0.9973509311676025, 'val/ratio_var': 3.871300123137189e-06, 'val/num_eos_tokens': 0, 'lr': 3.708966193042626e-05, 'episode': 2112, 'epoch': 0.26}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [49:21<1:05:39, 131kB/s]
 26%|██▌       | 529/2041 [46:03<2:09:13,  5.13s/it][A

{'eps': 0, 'objective/kl': 36.583038330078125, 'objective/entropy': 8.225571632385254, 'objective/non_score_reward': -1.8291518688201904, 'objective/rlhf_reward': -0.978263258934021, 'objective/scores': 0.8508886098861694, 'policy/approxkl_avg': 0.0007430922705680132, 'policy/clipfrac_avg': 0.004716981202363968, 'loss/policy_avg': -0.0014312739949673414, 'loss/value_avg': 0.009686995297670364, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.12626373767852783, 'val/ratio': 0.9993679523468018, 'val/ratio_var': 2.940631986803055e-07, 'val/num_eos_tokens': 0, 'lr': 3.706516413522783e-05, 'episode': 2116, 'epoch': 0.26}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [49:26<1:05:39, 131kB/s]
 26%|██▌       | 530/2041 [46:08<2:08:39,  5.11s/it][A

{'eps': 0, 'objective/kl': 33.2718505859375, 'objective/entropy': 8.088338851928711, 'objective/non_score_reward': -1.6635924577713013, 'objective/rlhf_reward': -0.7161334156990051, 'objective/scores': 0.9474590420722961, 'policy/approxkl_avg': 0.024119602516293526, 'policy/clipfrac_avg': 0.010613207705318928, 'loss/policy_avg': -0.0025222604162991047, 'loss/value_avg': 0.008650826290249825, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1113612949848175, 'val/ratio': 1.1737688779830933, 'val/ratio_var': 0.048442211002111435, 'val/num_eos_tokens': 0, 'lr': 3.70406663400294e-05, 'episode': 2120, 'epoch': 0.26}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [49:31<1:05:39, 131kB/s]
 26%|██▌       | 531/2041 [46:13<2:08:28,  5.10s/it][A

{'eps': 0, 'objective/kl': 32.744232177734375, 'objective/entropy': 5.803069114685059, 'objective/non_score_reward': -1.637211561203003, 'objective/rlhf_reward': -0.8253799080848694, 'objective/scores': 0.8118316531181335, 'policy/approxkl_avg': 0.0001692067162366584, 'policy/clipfrac_avg': 0.001179245300590992, 'loss/policy_avg': 0.000411958055337891, 'loss/value_avg': 0.004655783995985985, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1116500049829483, 'val/ratio': 0.9986926317214966, 'val/ratio_var': 1.2862728908658028e-06, 'val/num_eos_tokens': 0, 'lr': 3.7016168544830964e-05, 'episode': 2124, 'epoch': 0.26}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [49:37<1:05:39, 131kB/s]
 26%|██▌       | 532/2041 [46:18<2:08:45,  5.12s/it][A

{'eps': 0, 'objective/kl': 32.364173889160156, 'objective/entropy': 7.12234354019165, 'objective/non_score_reward': -1.6182085275650024, 'objective/rlhf_reward': -0.8245967626571655, 'objective/scores': 0.7936117649078369, 'policy/approxkl_avg': 0.0007180797401815653, 'policy/clipfrac_avg': 0.010613207705318928, 'loss/policy_avg': -0.00047595659270882607, 'loss/value_avg': 0.006875552237033844, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.16600802540779114, 'val/ratio': 0.9981613755226135, 'val/ratio_var': 2.281511797264102e-06, 'val/num_eos_tokens': 0, 'lr': 3.699167074963253e-05, 'episode': 2128, 'epoch': 0.26}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [49:42<1:05:39, 131kB/s]
 26%|██▌       | 533/2041 [46:23<2:08:38,  5.12s/it][A

{'eps': 0, 'objective/kl': 35.29505920410156, 'objective/entropy': 8.103580474853516, 'objective/non_score_reward': -1.7647531032562256, 'objective/rlhf_reward': -0.8739452362060547, 'objective/scores': 0.8908078670501709, 'policy/approxkl_avg': 0.0015603184001520276, 'policy/clipfrac_avg': 0.010613207705318928, 'loss/policy_avg': -0.002288069576025009, 'loss/value_avg': 0.0067543331533670425, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.11970233917236328, 'val/ratio': 1.0046720504760742, 'val/ratio_var': 1.643371069803834e-05, 'val/num_eos_tokens': 0, 'lr': 3.696717295443411e-05, 'episode': 2132, 'epoch': 0.26}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [49:47<1:05:39, 131kB/s]
 26%|██▌       | 534/2041 [46:28<2:08:11,  5.10s/it][A

{'eps': 0, 'objective/kl': 34.866607666015625, 'objective/entropy': 7.9316086769104, 'objective/non_score_reward': -1.743330478668213, 'objective/rlhf_reward': -0.8005743622779846, 'objective/scores': 0.9427561163902283, 'policy/approxkl_avg': 0.007850000634789467, 'policy/clipfrac_avg': 0.022405661642551422, 'loss/policy_avg': -0.004375853575766087, 'loss/value_avg': 0.010636923834681511, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1322382539510727, 'val/ratio': 1.0125243663787842, 'val/ratio_var': 0.00010370885138399899, 'val/num_eos_tokens': 0, 'lr': 3.694267515923567e-05, 'episode': 2136, 'epoch': 0.26}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [49:52<1:05:39, 131kB/s]
 26%|██▌       | 535/2041 [46:33<2:08:08,  5.10s/it][A

{'eps': 0, 'objective/kl': 33.712745666503906, 'objective/entropy': 7.61580228805542, 'objective/non_score_reward': -1.685637354850769, 'objective/rlhf_reward': -0.9292248487472534, 'objective/scores': 0.7564125061035156, 'policy/approxkl_avg': 0.002340820850804448, 'policy/clipfrac_avg': 0.028301887214183807, 'loss/policy_avg': -0.006466583348810673, 'loss/value_avg': 0.010324960574507713, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.13614393770694733, 'val/ratio': 1.0030193328857422, 'val/ratio_var': 6.65726292936597e-06, 'val/num_eos_tokens': 0, 'lr': 3.6918177364037236e-05, 'episode': 2140, 'epoch': 0.26}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [49:57<1:05:39, 131kB/s]
 26%|██▋       | 536/2041 [46:38<2:08:20,  5.12s/it][A

{'eps': 0, 'objective/kl': 33.26990509033203, 'objective/entropy': 6.885662078857422, 'objective/non_score_reward': -1.6634953022003174, 'objective/rlhf_reward': -0.9176743626594543, 'objective/scores': 0.745820939540863, 'policy/approxkl_avg': 0.0008703651838004589, 'policy/clipfrac_avg': 0.010613207705318928, 'loss/policy_avg': -0.0006036668200977147, 'loss/value_avg': 0.0055021923035383224, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.15412157773971558, 'val/ratio': 0.9991214275360107, 'val/ratio_var': 1.0130404461961007e-06, 'val/num_eos_tokens': 0, 'lr': 3.6893679568838804e-05, 'episode': 2144, 'epoch': 0.26}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [50:02<1:05:39, 131kB/s]
 26%|██▋       | 537/2041 [46:44<2:08:59,  5.15s/it][A

{'eps': 0, 'objective/kl': 34.468711853027344, 'objective/entropy': 4.359152317047119, 'objective/non_score_reward': -1.7234355211257935, 'objective/rlhf_reward': -0.9324319362640381, 'objective/scores': 0.7910035848617554, 'policy/approxkl_avg': 2.881055661418941e-05, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.0006160148186609149, 'loss/value_avg': 0.004046447109431028, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.16415464878082275, 'val/ratio': 1.000472068786621, 'val/ratio_var': 1.9610182278029242e-07, 'val/num_eos_tokens': 0, 'lr': 3.686918177364038e-05, 'episode': 2148, 'epoch': 0.26}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [50:07<1:05:39, 131kB/s]
 26%|██▋       | 538/2041 [46:49<2:08:45,  5.14s/it][A

{'eps': 0, 'objective/kl': 34.782833099365234, 'objective/entropy': 9.827909469604492, 'objective/non_score_reward': -1.7391417026519775, 'objective/rlhf_reward': -0.8068993091583252, 'objective/scores': 0.9322423934936523, 'policy/approxkl_avg': 0.006366096902638674, 'policy/clipfrac_avg': 0.014150943607091904, 'loss/policy_avg': -0.006867710966616869, 'loss/value_avg': 0.01230029296129942, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.13401499390602112, 'val/ratio': 1.0003117322921753, 'val/ratio_var': 1.661157966736937e-06, 'val/num_eos_tokens': 0, 'lr': 3.684468397844194e-05, 'episode': 2152, 'epoch': 0.26}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [50:12<1:05:39, 131kB/s]
 26%|██▋       | 539/2041 [46:54<2:09:01,  5.15s/it][A

{'eps': 0, 'objective/kl': 35.06435775756836, 'objective/entropy': 6.799308776855469, 'objective/non_score_reward': -1.7532179355621338, 'objective/rlhf_reward': -0.8760123252868652, 'objective/scores': 0.8772056102752686, 'policy/approxkl_avg': 0.002439033007249236, 'policy/clipfrac_avg': 0.00589622650295496, 'loss/policy_avg': -0.002281286520883441, 'loss/value_avg': 0.004712609574198723, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1828572154045105, 'val/ratio': 0.9942703247070312, 'val/ratio_var': 2.9893932151026092e-05, 'val/num_eos_tokens': 0, 'lr': 3.682018618324351e-05, 'episode': 2156, 'epoch': 0.26}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [50:18<1:05:39, 131kB/s]
 26%|██▋       | 540/2041 [46:59<2:08:42,  5.14s/it][A

{'eps': 0, 'objective/kl': 37.96814727783203, 'objective/entropy': 9.452127456665039, 'objective/non_score_reward': -1.8984076976776123, 'objective/rlhf_reward': -1.2067360877990723, 'objective/scores': 0.69167160987854, 'policy/approxkl_avg': 0.004395592492073774, 'policy/clipfrac_avg': 0.029481131583452225, 'loss/policy_avg': -0.00535338930785656, 'loss/value_avg': 0.014257694594562054, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.18413640558719635, 'val/ratio': 1.0028941631317139, 'val/ratio_var': 4.927374902763404e-06, 'val/num_eos_tokens': 0, 'lr': 3.679568838804508e-05, 'episode': 2160, 'epoch': 0.26}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [50:23<1:05:39, 131kB/s]
 27%|██▋       | 541/2041 [47:04<2:08:30,  5.14s/it][A

{'eps': 0, 'objective/kl': 36.25645446777344, 'objective/entropy': 10.36621379852295, 'objective/non_score_reward': -1.812822699546814, 'objective/rlhf_reward': -0.9008036255836487, 'objective/scores': 0.9120190739631653, 'policy/approxkl_avg': 0.0005420996458269656, 'policy/clipfrac_avg': 0.00589622650295496, 'loss/policy_avg': 0.0006453392561525106, 'loss/value_avg': 0.006661955267190933, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.20686478912830353, 'val/ratio': 1.0044898986816406, 'val/ratio_var': 1.3094261703372467e-05, 'val/num_eos_tokens': 0, 'lr': 3.6771190592846644e-05, 'episode': 2164, 'epoch': 0.27}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [50:28<1:05:39, 131kB/s]
 27%|██▋       | 542/2041 [47:09<2:08:13,  5.13s/it][A

{'eps': 0, 'objective/kl': 37.32088088989258, 'objective/entropy': 10.609855651855469, 'objective/non_score_reward': -1.866044044494629, 'objective/rlhf_reward': -1.1413825750350952, 'objective/scores': 0.7246614694595337, 'policy/approxkl_avg': 0.00464628916233778, 'policy/clipfrac_avg': 0.022405661642551422, 'loss/policy_avg': -0.005629265680909157, 'loss/value_avg': 0.01272624358534813, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.17481040954589844, 'val/ratio': 1.0014369487762451, 'val/ratio_var': 2.769086449916358e-06, 'val/num_eos_tokens': 0, 'lr': 3.674669279764821e-05, 'episode': 2168, 'epoch': 0.27}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [50:33<1:05:39, 131kB/s]
 27%|██▋       | 543/2041 [47:14<2:07:53,  5.12s/it][A

{'eps': 0, 'objective/kl': 35.44440460205078, 'objective/entropy': 7.643943786621094, 'objective/non_score_reward': -1.7722201347351074, 'objective/rlhf_reward': -1.046918272972107, 'objective/scores': 0.7253018617630005, 'policy/approxkl_avg': 0.0011419998481869698, 'policy/clipfrac_avg': 0.007075471803545952, 'loss/policy_avg': -0.0031057060696184635, 'loss/value_avg': 0.00808110460639, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.15819397568702698, 'val/ratio': 0.9998478889465332, 'val/ratio_var': 1.3506446805422456e-07, 'val/num_eos_tokens': 0, 'lr': 3.672219500244978e-05, 'episode': 2172, 'epoch': 0.27}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [50:38<1:05:39, 131kB/s]
 27%|██▋       | 544/2041 [47:20<2:07:55,  5.13s/it][A

{'eps': 0, 'objective/kl': 32.94355392456055, 'objective/entropy': 5.553138732910156, 'objective/non_score_reward': -1.647177815437317, 'objective/rlhf_reward': -0.8696314096450806, 'objective/scores': 0.7775464057922363, 'policy/approxkl_avg': 2.5189863663399592e-05, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.0005047366721555591, 'loss/value_avg': 0.006415906362235546, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.16592100262641907, 'val/ratio': 0.9993407726287842, 'val/ratio_var': 3.2391145055044035e-07, 'val/num_eos_tokens': 0, 'lr': 3.669769720725135e-05, 'episode': 2176, 'epoch': 0.27}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [50:43<1:05:39, 131kB/s]
 27%|██▋       | 545/2041 [47:25<2:07:50,  5.13s/it][A

{'eps': 0, 'objective/kl': 34.23670196533203, 'objective/entropy': 7.580172061920166, 'objective/non_score_reward': -1.7118351459503174, 'objective/rlhf_reward': -0.8671568632125854, 'objective/scores': 0.8446782827377319, 'policy/approxkl_avg': 0.005385585594922304, 'policy/clipfrac_avg': 0.041273586452007294, 'loss/policy_avg': -0.008952035568654537, 'loss/value_avg': 0.0070878430269658566, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.14334209263324738, 'val/ratio': 1.0033025741577148, 'val/ratio_var': 6.080224466131767e-06, 'val/num_eos_tokens': 0, 'lr': 3.6673199412052916e-05, 'episode': 2180, 'epoch': 0.27}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [50:48<1:05:39, 131kB/s]
 27%|██▋       | 546/2041 [47:30<2:07:58,  5.14s/it][A

{'eps': 0, 'objective/kl': 33.53857421875, 'objective/entropy': 12.743366241455078, 'objective/non_score_reward': -1.6769286394119263, 'objective/rlhf_reward': -1.0454378128051758, 'objective/scores': 0.6314908266067505, 'policy/approxkl_avg': 0.0045136259868741035, 'policy/clipfrac_avg': 0.014150943607091904, 'loss/policy_avg': -0.002579722786322236, 'loss/value_avg': 0.011672493070363998, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.08847559988498688, 'val/ratio': 1.0003416538238525, 'val/ratio_var': 1.069210429704981e-05, 'val/num_eos_tokens': 0, 'lr': 3.6648701616854485e-05, 'episode': 2184, 'epoch': 0.27}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [50:53<1:05:39, 131kB/s]
 27%|██▋       | 547/2041 [47:35<2:07:31,  5.12s/it][A

{'eps': 0, 'objective/kl': 34.200618743896484, 'objective/entropy': 4.188986301422119, 'objective/non_score_reward': -1.7100309133529663, 'objective/rlhf_reward': -0.9970391988754272, 'objective/scores': 0.7129917144775391, 'policy/approxkl_avg': 0.0001473604643251747, 'policy/clipfrac_avg': 0.001179245300590992, 'loss/policy_avg': -0.001019004383124411, 'loss/value_avg': 0.005720182787626982, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.14298155903816223, 'val/ratio': 0.9990564584732056, 'val/ratio_var': 5.721813067793846e-07, 'val/num_eos_tokens': 0, 'lr': 3.662420382165605e-05, 'episode': 2188, 'epoch': 0.27}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [50:59<1:05:39, 131kB/s]
 27%|██▋       | 548/2041 [47:40<2:08:21,  5.16s/it][A

{'eps': 0, 'objective/kl': 36.98150634765625, 'objective/entropy': 6.023271560668945, 'objective/non_score_reward': -1.849075436592102, 'objective/rlhf_reward': -1.1360188722610474, 'objective/scores': 0.7130565643310547, 'policy/approxkl_avg': 0.002315173391252756, 'policy/clipfrac_avg': 0.007075471803545952, 'loss/policy_avg': -0.003805099055171013, 'loss/value_avg': 0.008132042363286018, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1259876787662506, 'val/ratio': 0.9987345337867737, 'val/ratio_var': 2.3529573809355497e-06, 'val/num_eos_tokens': 0, 'lr': 3.659970602645762e-05, 'episode': 2192, 'epoch': 0.27}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [51:04<1:05:39, 131kB/s]
 27%|██▋       | 549/2041 [47:45<2:08:26,  5.17s/it][A

{'eps': 0, 'objective/kl': 35.89141082763672, 'objective/entropy': 7.154609680175781, 'objective/non_score_reward': -1.7945704460144043, 'objective/rlhf_reward': -1.1179473400115967, 'objective/scores': 0.6766231656074524, 'policy/approxkl_avg': 0.00019381938909646124, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': 0.0004388746456243098, 'loss/value_avg': 0.011256201192736626, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.11447325348854065, 'val/ratio': 1.0003386735916138, 'val/ratio_var': 8.466248146987709e-08, 'val/num_eos_tokens': 0, 'lr': 3.657520823125919e-05, 'episode': 2196, 'epoch': 0.27}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [51:09<1:05:39, 131kB/s]
 27%|██▋       | 550/2041 [47:51<2:08:29,  5.17s/it][A

{'eps': 0, 'objective/kl': 36.27245330810547, 'objective/entropy': 7.051903247833252, 'objective/non_score_reward': -1.8136227130889893, 'objective/rlhf_reward': -1.1991405487060547, 'objective/scores': 0.6144822239875793, 'policy/approxkl_avg': 0.00034752130159176886, 'policy/clipfrac_avg': 0.007075471803545952, 'loss/policy_avg': -0.004063993692398071, 'loss/value_avg': 0.010828240774571896, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.12143805623054504, 'val/ratio': 1.0019973516464233, 'val/ratio_var': 3.279865040894947e-06, 'val/num_eos_tokens': 0, 'lr': 3.655071043606076e-05, 'episode': 2200, 'epoch': 0.27}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [51:14<1:05:39, 131kB/s]
 27%|██▋       | 551/2041 [47:56<2:08:07,  5.16s/it][A

{'eps': 0, 'objective/kl': 38.431060791015625, 'objective/entropy': 7.511693954467773, 'objective/non_score_reward': -1.921553134918213, 'objective/rlhf_reward': -1.3127217292785645, 'objective/scores': 0.6088314056396484, 'policy/approxkl_avg': 0.0018825646257027984, 'policy/clipfrac_avg': 0.01886792480945587, 'loss/policy_avg': -0.0013357122661545873, 'loss/value_avg': 0.012918103486299515, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.13865824043750763, 'val/ratio': 1.00238835811615, 'val/ratio_var': 5.689187673851848e-06, 'val/num_eos_tokens': 0, 'lr': 3.6526212640862325e-05, 'episode': 2204, 'epoch': 0.27}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [51:19<1:05:39, 131kB/s]
 27%|██▋       | 552/2041 [48:01<2:08:23,  5.17s/it][A

{'eps': 0, 'objective/kl': 37.53057861328125, 'objective/entropy': 7.2021870613098145, 'objective/non_score_reward': -1.8765289783477783, 'objective/rlhf_reward': -1.2686247825622559, 'objective/scores': 0.6079042553901672, 'policy/approxkl_avg': 0.0002727713726926595, 'policy/clipfrac_avg': 0.003537735901772976, 'loss/policy_avg': -0.0014721222687512636, 'loss/value_avg': 0.011559872888028622, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.12469612807035446, 'val/ratio': 0.9984673261642456, 'val/ratio_var': 2.33104515245941e-06, 'val/num_eos_tokens': 0, 'lr': 3.650171484566389e-05, 'episode': 2208, 'epoch': 0.27}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [51:25<1:05:39, 131kB/s]
 27%|██▋       | 553/2041 [48:06<2:07:49,  5.15s/it][A

{'eps': 0, 'objective/kl': 36.63025665283203, 'objective/entropy': 6.123834133148193, 'objective/non_score_reward': -1.8315129280090332, 'objective/rlhf_reward': -1.179164171218872, 'objective/scores': 0.6523488163948059, 'policy/approxkl_avg': 0.002010632073506713, 'policy/clipfrac_avg': 0.015330188907682896, 'loss/policy_avg': -0.005083743948489428, 'loss/value_avg': 0.008444198407232761, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.13048510253429413, 'val/ratio': 1.0005452632904053, 'val/ratio_var': 3.187693380368728e-07, 'val/num_eos_tokens': 0, 'lr': 3.647721705046546e-05, 'episode': 2212, 'epoch': 0.27}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [51:30<1:05:39, 131kB/s]
 27%|██▋       | 554/2041 [48:11<2:08:01,  5.17s/it][A

{'eps': 0, 'objective/kl': 36.19977569580078, 'objective/entropy': 4.774266719818115, 'objective/non_score_reward': -1.8099888563156128, 'objective/rlhf_reward': -1.1231274604797363, 'objective/scores': 0.6868613362312317, 'policy/approxkl_avg': 0.003218076191842556, 'policy/clipfrac_avg': 0.012971698306500912, 'loss/policy_avg': -0.0041459957137703896, 'loss/value_avg': 0.011540467850863934, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.09937116503715515, 'val/ratio': 0.9987625479698181, 'val/ratio_var': 1.4448610272665974e-06, 'val/num_eos_tokens': 0, 'lr': 3.645271925526703e-05, 'episode': 2216, 'epoch': 0.27}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [51:35<1:05:39, 131kB/s]
 27%|██▋       | 555/2041 [48:16<2:07:50,  5.16s/it][A

{'eps': 0, 'objective/kl': 34.25886535644531, 'objective/entropy': 3.230884552001953, 'objective/non_score_reward': -1.712943196296692, 'objective/rlhf_reward': -1.043441653251648, 'objective/scores': 0.669501543045044, 'policy/approxkl_avg': 9.843124280450866e-05, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': 0.0012487113708630204, 'loss/value_avg': 0.005311818327754736, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.0714481994509697, 'val/ratio': 1.0007681846618652, 'val/ratio_var': 3.146436995393742e-07, 'val/num_eos_tokens': 0, 'lr': 3.64282214600686e-05, 'episode': 2220, 'epoch': 0.27}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [51:40<1:05:39, 131kB/s]
 27%|██▋       | 556/2041 [48:22<2:07:22,  5.15s/it][A

{'eps': 0, 'objective/kl': 33.854774475097656, 'objective/entropy': 2.407744884490967, 'objective/non_score_reward': -1.6927387714385986, 'objective/rlhf_reward': -1.1439588069915771, 'objective/scores': 0.5487800240516663, 'policy/approxkl_avg': 0.00012008812336716801, 'policy/clipfrac_avg': 0.002358490601181984, 'loss/policy_avg': -0.0018091420643031597, 'loss/value_avg': 0.004802272189408541, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.0737166702747345, 'val/ratio': 0.9990859627723694, 'val/ratio_var': 1.0111447181770927e-06, 'val/num_eos_tokens': 0, 'lr': 3.6403723664870165e-05, 'episode': 2224, 'epoch': 0.27}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [51:45<1:05:39, 131kB/s]
 27%|██▋       | 557/2041 [48:27<2:07:47,  5.17s/it][A

{'eps': 0, 'objective/kl': 35.56142807006836, 'objective/entropy': 4.520127296447754, 'objective/non_score_reward': -1.778071403503418, 'objective/rlhf_reward': -1.1458710432052612, 'objective/scores': 0.6322003602981567, 'policy/approxkl_avg': 0.0035424327943474054, 'policy/clipfrac_avg': 0.02358490601181984, 'loss/policy_avg': -0.008585946634411812, 'loss/value_avg': 0.006232342682778835, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.0983988344669342, 'val/ratio': 0.9969865083694458, 'val/ratio_var': 1.0210274922428653e-05, 'val/num_eos_tokens': 0, 'lr': 3.6379225869671726e-05, 'episode': 2228, 'epoch': 0.27}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [51:50<1:05:39, 131kB/s]
 27%|██▋       | 558/2041 [48:32<2:08:25,  5.20s/it][A

{'eps': 0, 'objective/kl': 34.52500534057617, 'objective/entropy': 6.708757400512695, 'objective/non_score_reward': -1.7262502908706665, 'objective/rlhf_reward': -1.2768685817718506, 'objective/scores': 0.4493817687034607, 'policy/approxkl_avg': 0.0030818891245871782, 'policy/clipfrac_avg': 0.01886792480945587, 'loss/policy_avg': -0.00047538414946757257, 'loss/value_avg': 0.010267957113683224, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.12989990413188934, 'val/ratio': 0.9982535243034363, 'val/ratio_var': 1.7692804021862685e-06, 'val/num_eos_tokens': 0, 'lr': 3.63547280744733e-05, 'episode': 2232, 'epoch': 0.27}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [51:56<1:05:39, 131kB/s]
 27%|██▋       | 559/2041 [48:37<2:08:03,  5.18s/it][A

{'eps': 0, 'objective/kl': 38.95132064819336, 'objective/entropy': 7.898791790008545, 'objective/non_score_reward': -1.947566032409668, 'objective/rlhf_reward': -1.4650360345840454, 'objective/scores': 0.48252999782562256, 'policy/approxkl_avg': 0.0001177969024865888, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': 0.0002838409855030477, 'loss/value_avg': 0.01760992780327797, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.09975578635931015, 'val/ratio': 1.0004956722259521, 'val/ratio_var': 6.12171504599246e-07, 'val/num_eos_tokens': 0, 'lr': 3.633023027927487e-05, 'episode': 2236, 'epoch': 0.27}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [52:01<1:05:39, 131kB/s]
 27%|██▋       | 560/2041 [48:42<2:08:15,  5.20s/it][A

{'eps': 0, 'objective/kl': 33.713478088378906, 'objective/entropy': 5.384017467498779, 'objective/non_score_reward': -1.6856739521026611, 'objective/rlhf_reward': -1.0747336149215698, 'objective/scores': 0.6109403371810913, 'policy/approxkl_avg': 0.0008585397154092789, 'policy/clipfrac_avg': 0.00589622650295496, 'loss/policy_avg': -0.0016025924123823643, 'loss/value_avg': 0.008692887611687183, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.12214185297489166, 'val/ratio': 1.0015840530395508, 'val/ratio_var': 2.0558807136694668e-06, 'val/num_eos_tokens': 0, 'lr': 3.630573248407643e-05, 'episode': 2240, 'epoch': 0.27}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [52:06<1:05:39, 131kB/s]
 27%|██▋       | 561/2041 [48:48<2:07:47,  5.18s/it][A

{'eps': 0, 'objective/kl': 38.87805938720703, 'objective/entropy': 6.318464756011963, 'objective/non_score_reward': -1.9439030885696411, 'objective/rlhf_reward': -1.374669075012207, 'objective/scores': 0.5692340731620789, 'policy/approxkl_avg': 8.22900328785181e-05, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': 9.88492465694435e-05, 'loss/value_avg': 0.023052645847201347, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.08998569846153259, 'val/ratio': 0.9995718002319336, 'val/ratio_var': 1.1019741918971704e-07, 'val/num_eos_tokens': 0, 'lr': 3.6281234688878005e-05, 'episode': 2244, 'epoch': 0.27}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [52:11<1:05:39, 131kB/s]
 28%|██▊       | 562/2041 [48:53<2:06:47,  5.14s/it][A

{'eps': 0, 'objective/kl': 34.79542541503906, 'objective/entropy': 4.512214660644531, 'objective/non_score_reward': -1.7397713661193848, 'objective/rlhf_reward': -1.2495924234390259, 'objective/scores': 0.4901789724826813, 'policy/approxkl_avg': 0.0005080067203380167, 'policy/clipfrac_avg': 0.007075471803545952, 'loss/policy_avg': -0.0003721567918546498, 'loss/value_avg': 0.005002401303499937, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.14274962246418, 'val/ratio': 0.9966676831245422, 'val/ratio_var': 7.736088264209684e-06, 'val/num_eos_tokens': 0, 'lr': 3.625673689367957e-05, 'episode': 2248, 'epoch': 0.28}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [52:16<1:05:39, 131kB/s]
 28%|██▊       | 563/2041 [48:58<2:06:42,  5.14s/it][A

{'eps': 0, 'objective/kl': 33.829139709472656, 'objective/entropy': 9.118310928344727, 'objective/non_score_reward': -1.6914567947387695, 'objective/rlhf_reward': -1.1890342235565186, 'objective/scores': 0.502422571182251, 'policy/approxkl_avg': 0.00455184280872345, 'policy/clipfrac_avg': 0.024764152243733406, 'loss/policy_avg': -0.006781339645385742, 'loss/value_avg': 0.008927861228585243, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.13565868139266968, 'val/ratio': 1.0020928382873535, 'val/ratio_var': 3.209544274795917e-06, 'val/num_eos_tokens': 0, 'lr': 3.623223909848114e-05, 'episode': 2252, 'epoch': 0.28}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [52:21<1:05:39, 131kB/s]
 28%|██▊       | 564/2041 [49:03<2:07:05,  5.16s/it][A

{'eps': 0, 'objective/kl': 36.27275466918945, 'objective/entropy': 8.744909286499023, 'objective/non_score_reward': -1.8136377334594727, 'objective/rlhf_reward': -1.29502272605896, 'objective/scores': 0.5186149477958679, 'policy/approxkl_avg': 0.0008240515599027276, 'policy/clipfrac_avg': 0.008254717104136944, 'loss/policy_avg': -0.0026641073636710644, 'loss/value_avg': 0.00753400381654501, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.11546332389116287, 'val/ratio': 1.0045726299285889, 'val/ratio_var': 1.0200924407399725e-05, 'val/num_eos_tokens': 0, 'lr': 3.62077413032827e-05, 'episode': 2256, 'epoch': 0.28}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [52:26<1:05:39, 131kB/s]
 28%|██▊       | 565/2041 [49:08<2:06:14,  5.13s/it][A

{'eps': 0, 'objective/kl': 37.59251403808594, 'objective/entropy': 7.5085248947143555, 'objective/non_score_reward': -1.8796257972717285, 'objective/rlhf_reward': -1.475054383277893, 'objective/scores': 0.40457141399383545, 'policy/approxkl_avg': 0.00034832966048270464, 'policy/clipfrac_avg': 0.00589622650295496, 'loss/policy_avg': -0.0027059977874159813, 'loss/value_avg': 0.0151748638600111, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.12252632528543472, 'val/ratio': 1.0000044107437134, 'val/ratio_var': 1.070508119482838e-06, 'val/num_eos_tokens': 0, 'lr': 3.618324350808428e-05, 'episode': 2260, 'epoch': 0.28}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [52:32<1:05:39, 131kB/s]
 28%|██▊       | 566/2041 [49:13<2:05:52,  5.12s/it][A

{'eps': 0, 'objective/kl': 37.78544616699219, 'objective/entropy': 6.17500114440918, 'objective/non_score_reward': -1.8892723321914673, 'objective/rlhf_reward': -1.3445934057235718, 'objective/scores': 0.5446789264678955, 'policy/approxkl_avg': 0.0035615162923932076, 'policy/clipfrac_avg': 0.02004716917872429, 'loss/policy_avg': -0.006577144376933575, 'loss/value_avg': 0.0073661282658576965, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.12086430937051773, 'val/ratio': 0.9988993406295776, 'val/ratio_var': 8.377201083931141e-07, 'val/num_eos_tokens': 0, 'lr': 3.6158745712885845e-05, 'episode': 2264, 'epoch': 0.28}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [52:37<1:05:39, 131kB/s]
 28%|██▊       | 567/2041 [49:18<2:05:42,  5.12s/it][A

{'eps': 0, 'objective/kl': 35.58684539794922, 'objective/entropy': 6.580982685089111, 'objective/non_score_reward': -1.7793422937393188, 'objective/rlhf_reward': -1.3022844791412354, 'objective/scores': 0.4770577549934387, 'policy/approxkl_avg': 0.0021143751218914986, 'policy/clipfrac_avg': 0.01179245300590992, 'loss/policy_avg': -0.002311260672286153, 'loss/value_avg': 0.0068149324506521225, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.09994588792324066, 'val/ratio': 0.9965490698814392, 'val/ratio_var': 7.650403858860955e-06, 'val/num_eos_tokens': 0, 'lr': 3.613424791768741e-05, 'episode': 2268, 'epoch': 0.28}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [52:42<1:05:39, 131kB/s]
 28%|██▊       | 568/2041 [49:23<2:05:50,  5.13s/it][A

{'eps': 0, 'objective/kl': 35.20037078857422, 'objective/entropy': 6.308432102203369, 'objective/non_score_reward': -1.7600184679031372, 'objective/rlhf_reward': -1.2614200115203857, 'objective/scores': 0.4985983967781067, 'policy/approxkl_avg': 0.0023021837696433067, 'policy/clipfrac_avg': 0.015330188907682896, 'loss/policy_avg': -0.004557294771075249, 'loss/value_avg': 0.004024986177682877, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.08732527494430542, 'val/ratio': 0.9994101524353027, 'val/ratio_var': 2.366529230357628e-07, 'val/num_eos_tokens': 0, 'lr': 3.6109750122488975e-05, 'episode': 2272, 'epoch': 0.28}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [52:47<1:05:39, 131kB/s]
 28%|██▊       | 569/2041 [49:28<2:05:53,  5.13s/it][A

{'eps': 0, 'objective/kl': 37.78435134887695, 'objective/entropy': 7.996156692504883, 'objective/non_score_reward': -1.8892176151275635, 'objective/rlhf_reward': -1.2330703735351562, 'objective/scores': 0.6561472415924072, 'policy/approxkl_avg': 0.0013190651079639792, 'policy/clipfrac_avg': 0.014150943607091904, 'loss/policy_avg': -0.0007477683830074966, 'loss/value_avg': 0.007455105893313885, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.11742280423641205, 'val/ratio': 1.0048928260803223, 'val/ratio_var': 1.9294673620606773e-05, 'val/num_eos_tokens': 0, 'lr': 3.608525232729055e-05, 'episode': 2276, 'epoch': 0.28}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [52:52<1:05:39, 131kB/s]
 28%|██▊       | 570/2041 [49:34<2:05:54,  5.14s/it][A

{'eps': 0, 'objective/kl': 34.05508804321289, 'objective/entropy': 8.365249633789062, 'objective/non_score_reward': -1.7027543783187866, 'objective/rlhf_reward': -1.1480932235717773, 'objective/scores': 0.554661214351654, 'policy/approxkl_avg': 0.0033795323688536882, 'policy/clipfrac_avg': 0.012971698306500912, 'loss/policy_avg': -0.0014609255595132709, 'loss/value_avg': 0.008181191980838776, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.14859484136104584, 'val/ratio': 1.0067099332809448, 'val/ratio_var': 3.016271693923045e-05, 'val/num_eos_tokens': 0, 'lr': 3.606075453209211e-05, 'episode': 2280, 'epoch': 0.28}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [52:57<1:05:39, 131kB/s]
 28%|██▊       | 571/2041 [49:39<2:05:14,  5.11s/it][A

{'eps': 0, 'objective/kl': 39.596717834472656, 'objective/entropy': 12.705809593200684, 'objective/non_score_reward': -1.979835867881775, 'objective/rlhf_reward': -1.3799364566802979, 'objective/scores': 0.599899411201477, 'policy/approxkl_avg': 0.013085849583148956, 'policy/clipfrac_avg': 0.02004716917872429, 'loss/policy_avg': -0.014243822544813156, 'loss/value_avg': 0.01933128759264946, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.12494605034589767, 'val/ratio': 1.0368108749389648, 'val/ratio_var': 0.0009931664681062102, 'val/num_eos_tokens': 0, 'lr': 3.603625673689368e-05, 'episode': 2284, 'epoch': 0.28}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [53:02<1:05:39, 131kB/s]
 28%|██▊       | 572/2041 [49:44<2:04:59,  5.11s/it][A

{'eps': 0, 'objective/kl': 33.54591369628906, 'objective/entropy': 6.482179164886475, 'objective/non_score_reward': -1.6772956848144531, 'objective/rlhf_reward': -1.2031968832015991, 'objective/scores': 0.474098801612854, 'policy/approxkl_avg': 0.00016968537238426507, 'policy/clipfrac_avg': 0.002358490601181984, 'loss/policy_avg': -0.0018159999744966626, 'loss/value_avg': 0.005550587549805641, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.12561574578285217, 'val/ratio': 0.9994546175003052, 'val/ratio_var': 1.9114652616281091e-07, 'val/num_eos_tokens': 0, 'lr': 3.6011758941695254e-05, 'episode': 2288, 'epoch': 0.28}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [53:07<1:05:39, 131kB/s]
 28%|██▊       | 573/2041 [49:49<2:05:54,  5.15s/it][A

{'eps': 0, 'objective/kl': 35.870941162109375, 'objective/entropy': 9.880550384521484, 'objective/non_score_reward': -1.7935471534729004, 'objective/rlhf_reward': -1.2963342666625977, 'objective/scores': 0.49721282720565796, 'policy/approxkl_avg': 0.001388377626426518, 'policy/clipfrac_avg': 0.009433962404727936, 'loss/policy_avg': -0.0021104952320456505, 'loss/value_avg': 0.008244065567851067, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.12875720858573914, 'val/ratio': 0.9978615641593933, 'val/ratio_var': 3.561274979801965e-06, 'val/num_eos_tokens': 0, 'lr': 3.5987261146496815e-05, 'episode': 2292, 'epoch': 0.28}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [53:13<1:05:39, 131kB/s]
 28%|██▊       | 574/2041 [49:54<2:06:03,  5.16s/it][A

{'eps': 0, 'objective/kl': 36.572105407714844, 'objective/entropy': 4.772487640380859, 'objective/non_score_reward': -1.8286051750183105, 'objective/rlhf_reward': -1.3632179498672485, 'objective/scores': 0.465387225151062, 'policy/approxkl_avg': 2.454994864820037e-05, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.0002080170961562544, 'loss/value_avg': 0.0055563971400260925, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.12929530441761017, 'val/ratio': 1.000551462173462, 'val/ratio_var': 1.5147519150104927e-07, 'val/num_eos_tokens': 0, 'lr': 3.596276335129838e-05, 'episode': 2296, 'epoch': 0.28}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [53:18<1:05:39, 131kB/s]
 28%|██▊       | 575/2041 [49:59<2:05:35,  5.14s/it][A

{'eps': 0, 'objective/kl': 37.210533142089844, 'objective/entropy': 6.962835788726807, 'objective/non_score_reward': -1.8605269193649292, 'objective/rlhf_reward': -1.3797228336334229, 'objective/scores': 0.4808041453361511, 'policy/approxkl_avg': 0.001061856048181653, 'policy/clipfrac_avg': 0.01179245300590992, 'loss/policy_avg': -0.000417136907344684, 'loss/value_avg': 0.011706321500241756, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1397038698196411, 'val/ratio': 0.9999351501464844, 'val/ratio_var': 3.706805173919747e-08, 'val/num_eos_tokens': 0, 'lr': 3.593826555609995e-05, 'episode': 2300, 'epoch': 0.28}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [53:23<1:05:39, 131kB/s]
 28%|██▊       | 576/2041 [50:04<2:05:18,  5.13s/it][A

{'eps': 0, 'objective/kl': 36.943382263183594, 'objective/entropy': 8.653327941894531, 'objective/non_score_reward': -1.847169280052185, 'objective/rlhf_reward': -1.4535484313964844, 'objective/scores': 0.39362090826034546, 'policy/approxkl_avg': 3.7117413739906624e-05, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.0017025269335135818, 'loss/value_avg': 0.011020198464393616, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.15986275672912598, 'val/ratio': 1.0013368129730225, 'val/ratio_var': 2.2310875920084072e-06, 'val/num_eos_tokens': 0, 'lr': 3.5913767760901526e-05, 'episode': 2304, 'epoch': 0.28}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [53:28<1:05:39, 131kB/s]
 28%|██▊       | 577/2041 [50:10<2:05:04,  5.13s/it][A

{'eps': 0, 'objective/kl': 34.69197082519531, 'objective/entropy': 5.684271812438965, 'objective/non_score_reward': -1.7345986366271973, 'objective/rlhf_reward': -1.2234069108963013, 'objective/scores': 0.511191725730896, 'policy/approxkl_avg': 5.1452712796162814e-05, 'policy/clipfrac_avg': 0.001179245300590992, 'loss/policy_avg': -0.0002379349898546934, 'loss/value_avg': 0.005256387870758772, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.16346055269241333, 'val/ratio': 0.9999972581863403, 'val/ratio_var': 3.752452926164551e-08, 'val/num_eos_tokens': 0, 'lr': 3.588926996570309e-05, 'episode': 2308, 'epoch': 0.28}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [53:33<1:05:39, 131kB/s]
 28%|██▊       | 578/2041 [50:15<2:05:43,  5.16s/it][A

{'eps': 0, 'objective/kl': 35.883094787597656, 'objective/entropy': 6.115251541137695, 'objective/non_score_reward': -1.7941548824310303, 'objective/rlhf_reward': -1.3155909776687622, 'objective/scores': 0.47856393456459045, 'policy/approxkl_avg': 0.000130157801322639, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.0006314251804724336, 'loss/value_avg': 0.005518458783626556, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1430652141571045, 'val/ratio': 0.9989813566207886, 'val/ratio_var': 7.831824859749759e-07, 'val/num_eos_tokens': 0, 'lr': 3.5864772170504655e-05, 'episode': 2312, 'epoch': 0.28}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [53:38<1:05:39, 131kB/s]
 28%|██▊       | 579/2041 [50:20<2:06:07,  5.18s/it][A

{'eps': 0, 'objective/kl': 35.70934295654297, 'objective/entropy': 8.080217361450195, 'objective/non_score_reward': -1.7854671478271484, 'objective/rlhf_reward': -1.361510157585144, 'objective/scores': 0.4239570200443268, 'policy/approxkl_avg': 0.0018985304050147533, 'policy/clipfrac_avg': 0.009433962404727936, 'loss/policy_avg': -0.0012958745937794447, 'loss/value_avg': 0.008755344897508621, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.12368756532669067, 'val/ratio': 0.9992011785507202, 'val/ratio_var': 4.787903549186012e-07, 'val/num_eos_tokens': 0, 'lr': 3.584027437530622e-05, 'episode': 2316, 'epoch': 0.28}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [53:44<1:05:39, 131kB/s]
 28%|██▊       | 580/2041 [50:25<2:06:42,  5.20s/it][A

{'eps': 0, 'objective/kl': 33.08704376220703, 'objective/entropy': 7.832952976226807, 'objective/non_score_reward': -1.6543521881103516, 'objective/rlhf_reward': -1.317787528038025, 'objective/scores': 0.33656466007232666, 'policy/approxkl_avg': 0.0017548358300700784, 'policy/clipfrac_avg': 0.014150943607091904, 'loss/policy_avg': -0.002326366025954485, 'loss/value_avg': 0.00814213789999485, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.08595887571573257, 'val/ratio': 0.9974275827407837, 'val/ratio_var': 5.10734025738202e-06, 'val/num_eos_tokens': 0, 'lr': 3.581577658010779e-05, 'episode': 2320, 'epoch': 0.28}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [53:49<1:05:39, 131kB/s]
 28%|██▊       | 581/2041 [50:30<2:06:07,  5.18s/it][A

{'eps': 0, 'objective/kl': 32.87027359008789, 'objective/entropy': 5.733901023864746, 'objective/non_score_reward': -1.643513560295105, 'objective/rlhf_reward': -1.2826741933822632, 'objective/scores': 0.3608393669128418, 'policy/approxkl_avg': 0.0004834240535274148, 'policy/clipfrac_avg': 0.002358490601181984, 'loss/policy_avg': -0.0013097018236294389, 'loss/value_avg': 0.006370461545884609, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.07049702852964401, 'val/ratio': 0.9986510276794434, 'val/ratio_var': 1.208616595249623e-06, 'val/num_eos_tokens': 0, 'lr': 3.579127878490936e-05, 'episode': 2324, 'epoch': 0.28}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [53:54<1:05:39, 131kB/s]
 29%|██▊       | 582/2041 [50:36<2:05:45,  5.17s/it][A

{'eps': 0, 'objective/kl': 34.78569793701172, 'objective/entropy': 6.620853424072266, 'objective/non_score_reward': -1.7392849922180176, 'objective/rlhf_reward': -1.373859167098999, 'objective/scores': 0.3654257655143738, 'policy/approxkl_avg': 0.0027996746357530355, 'policy/clipfrac_avg': 0.007075471803545952, 'loss/policy_avg': -0.0019355770200490952, 'loss/value_avg': 0.007855433970689774, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.07686121761798859, 'val/ratio': 1.0093047618865967, 'val/ratio_var': 8.889586752047762e-05, 'val/num_eos_tokens': 0, 'lr': 3.576678098971093e-05, 'episode': 2328, 'epoch': 0.29}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [53:59<1:05:39, 131kB/s]
 29%|██▊       | 583/2041 [50:41<2:05:15,  5.15s/it][A

{'eps': 0, 'objective/kl': 35.69168472290039, 'objective/entropy': 5.924934387207031, 'objective/non_score_reward': -1.784584403038025, 'objective/rlhf_reward': -1.4007657766342163, 'objective/scores': 0.3838186264038086, 'policy/approxkl_avg': 0.001340421847999096, 'policy/clipfrac_avg': 0.007075471803545952, 'loss/policy_avg': -0.0032470361329615116, 'loss/value_avg': 0.006834708154201508, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.10127944499254227, 'val/ratio': 1.000435471534729, 'val/ratio_var': 1.7385873718467337e-07, 'val/num_eos_tokens': 0, 'lr': 3.5742283194512495e-05, 'episode': 2332, 'epoch': 0.29}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [54:04<1:05:39, 131kB/s]
 29%|██▊       | 584/2041 [50:46<2:05:46,  5.18s/it][A

{'eps': 0, 'objective/kl': 36.81494903564453, 'objective/entropy': 7.961222171783447, 'objective/non_score_reward': -1.840747356414795, 'objective/rlhf_reward': -1.4281787872314453, 'objective/scores': 0.4125686287879944, 'policy/approxkl_avg': 0.003122342051938176, 'policy/clipfrac_avg': 0.021226413547992706, 'loss/policy_avg': -0.00208110804669559, 'loss/value_avg': 0.006714732851833105, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1434684693813324, 'val/ratio': 1.004504919052124, 'val/ratio_var': 1.4373380508914124e-05, 'val/num_eos_tokens': 0, 'lr': 3.5717785399314063e-05, 'episode': 2336, 'epoch': 0.29}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [54:09<1:05:39, 131kB/s]
 29%|██▊       | 585/2041 [50:51<2:05:14,  5.16s/it][A

{'eps': 0, 'objective/kl': 35.27532958984375, 'objective/entropy': 6.3897504806518555, 'objective/non_score_reward': -1.7637664079666138, 'objective/rlhf_reward': -1.2820045948028564, 'objective/scores': 0.4817618131637573, 'policy/approxkl_avg': 0.005129434168338776, 'policy/clipfrac_avg': 0.007075471803545952, 'loss/policy_avg': -0.0021911519579589367, 'loss/value_avg': 0.00484660966321826, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1702652871608734, 'val/ratio': 0.9969195127487183, 'val/ratio_var': 4.8396964302810375e-06, 'val/num_eos_tokens': 0, 'lr': 3.569328760411563e-05, 'episode': 2340, 'epoch': 0.29}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [54:15<1:05:39, 131kB/s]
 29%|██▊       | 586/2041 [50:56<2:05:10,  5.16s/it][A

{'eps': 0, 'objective/kl': 35.11863708496094, 'objective/entropy': 7.828407287597656, 'objective/non_score_reward': -1.755932092666626, 'objective/rlhf_reward': -1.3271859884262085, 'objective/scores': 0.4287461042404175, 'policy/approxkl_avg': 0.003719378961250186, 'policy/clipfrac_avg': 0.009433962404727936, 'loss/policy_avg': -0.003008441533893347, 'loss/value_avg': 0.010075886733829975, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.12984661757946014, 'val/ratio': 1.006023645401001, 'val/ratio_var': 5.669260281138122e-05, 'val/num_eos_tokens': 0, 'lr': 3.56687898089172e-05, 'episode': 2344, 'epoch': 0.29}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [54:20<1:05:39, 131kB/s]
 29%|██▉       | 587/2041 [51:01<2:05:09,  5.16s/it][A

{'eps': 0, 'objective/kl': 37.70487594604492, 'objective/entropy': 6.74237585067749, 'objective/non_score_reward': -1.8852438926696777, 'objective/rlhf_reward': -1.4425185918807983, 'objective/scores': 0.4427253007888794, 'policy/approxkl_avg': 0.0008163736783899367, 'policy/clipfrac_avg': 0.002358490601181984, 'loss/policy_avg': -4.653603537008166e-05, 'loss/value_avg': 0.006222552619874477, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.12289741635322571, 'val/ratio': 0.9969031810760498, 'val/ratio_var': 7.98268592916429e-06, 'val/num_eos_tokens': 0, 'lr': 3.564429201371877e-05, 'episode': 2348, 'epoch': 0.29}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [54:25<1:05:39, 131kB/s]
 29%|██▉       | 588/2041 [51:06<2:04:41,  5.15s/it][A

{'eps': 0, 'objective/kl': 33.67327117919922, 'objective/entropy': 8.066969871520996, 'objective/non_score_reward': -1.6836636066436768, 'objective/rlhf_reward': -1.2282558679580688, 'objective/scores': 0.4554077684879303, 'policy/approxkl_avg': 0.0018747687572613358, 'policy/clipfrac_avg': 0.008254717104136944, 'loss/policy_avg': -0.003707385854795575, 'loss/value_avg': 0.005312252324074507, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.16437849402427673, 'val/ratio': 1.0002094507217407, 'val/ratio_var': 2.546075883458343e-08, 'val/num_eos_tokens': 0, 'lr': 3.5619794218520336e-05, 'episode': 2352, 'epoch': 0.29}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [54:30<1:05:39, 131kB/s]
 29%|██▉       | 589/2041 [51:12<2:04:07,  5.13s/it][A

{'eps': 0, 'objective/kl': 34.4947509765625, 'objective/entropy': 6.994779109954834, 'objective/non_score_reward': -1.724737524986267, 'objective/rlhf_reward': -1.3119229078292847, 'objective/scores': 0.41281458735466003, 'policy/approxkl_avg': 0.0005246257060207427, 'policy/clipfrac_avg': 0.01179245300590992, 'loss/policy_avg': -0.0021066037006676197, 'loss/value_avg': 0.0062970612198114395, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1303982138633728, 'val/ratio': 1.0055351257324219, 'val/ratio_var': 2.535899511713069e-05, 'val/num_eos_tokens': 0, 'lr': 3.5595296423321904e-05, 'episode': 2356, 'epoch': 0.29}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [54:35<1:05:39, 131kB/s]
 29%|██▉       | 590/2041 [51:17<2:05:07,  5.17s/it][A

{'eps': 0, 'objective/kl': 35.6790657043457, 'objective/entropy': 4.465724945068359, 'objective/non_score_reward': -1.7839531898498535, 'objective/rlhf_reward': -1.4304536581039429, 'objective/scores': 0.35349950194358826, 'policy/approxkl_avg': 5.5328615417238325e-05, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.00044195353984832764, 'loss/value_avg': 0.009546304121613503, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.10747308284044266, 'val/ratio': 1.0002779960632324, 'val/ratio_var': 3.68621861923657e-08, 'val/num_eos_tokens': 0, 'lr': 3.557079862812347e-05, 'episode': 2360, 'epoch': 0.29}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [54:40<1:05:39, 131kB/s]
 29%|██▉       | 591/2041 [51:22<2:04:10,  5.14s/it][A

{'eps': 0, 'objective/kl': 35.62696075439453, 'objective/entropy': 4.451780796051025, 'objective/non_score_reward': -1.7813478708267212, 'objective/rlhf_reward': -1.2923853397369385, 'objective/scores': 0.4889625906944275, 'policy/approxkl_avg': 0.0003993279824499041, 'policy/clipfrac_avg': 0.007075471803545952, 'loss/policy_avg': -0.0013399855233728886, 'loss/value_avg': 0.00462445430457592, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.12210217118263245, 'val/ratio': 0.9996511936187744, 'val/ratio_var': 6.428492582699619e-08, 'val/num_eos_tokens': 0, 'lr': 3.554630083292504e-05, 'episode': 2364, 'epoch': 0.29}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [54:45<1:05:39, 131kB/s]
 29%|██▉       | 592/2041 [51:27<2:04:22,  5.15s/it][A

{'eps': 0, 'objective/kl': 34.309043884277344, 'objective/entropy': 8.786852836608887, 'objective/non_score_reward': -1.7154521942138672, 'objective/rlhf_reward': -1.334618330001831, 'objective/scores': 0.38083386421203613, 'policy/approxkl_avg': 0.0007859442266635597, 'policy/clipfrac_avg': 0.014150943607091904, 'loss/policy_avg': -0.002484971657395363, 'loss/value_avg': 0.006776539143174887, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.16411641240119934, 'val/ratio': 1.0015112161636353, 'val/ratio_var': 2.6534864900895627e-06, 'val/num_eos_tokens': 0, 'lr': 3.552180303772661e-05, 'episode': 2368, 'epoch': 0.29}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [54:51<1:05:39, 131kB/s]
 29%|██▉       | 593/2041 [51:32<2:03:37,  5.12s/it][A

{'eps': 0, 'objective/kl': 35.587425231933594, 'objective/entropy': 8.429594039916992, 'objective/non_score_reward': -1.7793715000152588, 'objective/rlhf_reward': -1.295763611793518, 'objective/scores': 0.48360785841941833, 'policy/approxkl_avg': 0.0008389278664253652, 'policy/clipfrac_avg': 0.008254717104136944, 'loss/policy_avg': -0.0017409799620509148, 'loss/value_avg': 0.006238535512238741, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.13173051178455353, 'val/ratio': 0.9998766183853149, 'val/ratio_var': 8.992491196124774e-09, 'val/num_eos_tokens': 0, 'lr': 3.5497305242528176e-05, 'episode': 2372, 'epoch': 0.29}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [54:56<1:05:39, 131kB/s]
 29%|██▉       | 594/2041 [51:37<2:03:54,  5.14s/it][A

{'eps': 0, 'objective/kl': 33.775882720947266, 'objective/entropy': 8.508060455322266, 'objective/non_score_reward': -1.6887943744659424, 'objective/rlhf_reward': -1.174636960029602, 'objective/scores': 0.5141574144363403, 'policy/approxkl_avg': 0.00042073975782841444, 'policy/clipfrac_avg': 0.00589622650295496, 'loss/policy_avg': -0.0016436531441286206, 'loss/value_avg': 0.008039923384785652, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1257428526878357, 'val/ratio': 1.0026825666427612, 'val/ratio_var': 1.2012232218694407e-05, 'val/num_eos_tokens': 0, 'lr': 3.5472807447329744e-05, 'episode': 2376, 'epoch': 0.29}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [55:01<1:05:39, 131kB/s]
 29%|██▉       | 595/2041 [51:42<2:02:54,  5.10s/it][A

{'eps': 0, 'objective/kl': 35.871673583984375, 'objective/entropy': 11.607561111450195, 'objective/non_score_reward': -1.7935839891433716, 'objective/rlhf_reward': -1.3984477519989014, 'objective/scores': 0.3951362371444702, 'policy/approxkl_avg': 0.001572233042679727, 'policy/clipfrac_avg': 0.01886792480945587, 'loss/policy_avg': -0.0015729565639048815, 'loss/value_avg': 0.007297843229025602, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.14384236931800842, 'val/ratio': 1.0041581392288208, 'val/ratio_var': 1.5253302990458906e-05, 'val/num_eos_tokens': 0, 'lr': 3.544830965213131e-05, 'episode': 2380, 'epoch': 0.29}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [55:06<1:05:39, 131kB/s]
 29%|██▉       | 596/2041 [51:47<2:03:16,  5.12s/it][A

{'eps': 0, 'objective/kl': 36.75921630859375, 'objective/entropy': 8.00103759765625, 'objective/non_score_reward': -1.837960958480835, 'objective/rlhf_reward': -1.4408395290374756, 'objective/scores': 0.397121399641037, 'policy/approxkl_avg': 0.00043702544644474983, 'policy/clipfrac_avg': 0.004716981202363968, 'loss/policy_avg': -0.0016293295193463564, 'loss/value_avg': 0.008256515488028526, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.14667865633964539, 'val/ratio': 0.9999775290489197, 'val/ratio_var': 5.050438289799786e-09, 'val/num_eos_tokens': 0, 'lr': 3.542381185693287e-05, 'episode': 2384, 'epoch': 0.29}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [55:11<1:05:39, 131kB/s]
 29%|██▉       | 597/2041 [51:53<2:03:53,  5.15s/it][A

{'eps': 0, 'objective/kl': 34.782928466796875, 'objective/entropy': 10.41867733001709, 'objective/non_score_reward': -1.73914635181427, 'objective/rlhf_reward': -1.3845025300979614, 'objective/scores': 0.3546438217163086, 'policy/approxkl_avg': 0.001087029930204153, 'policy/clipfrac_avg': 0.012971698306500912, 'loss/policy_avg': -0.0037863855250179768, 'loss/value_avg': 0.007847447879612446, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.16355669498443604, 'val/ratio': 1.0016812086105347, 'val/ratio_var': 1.3964084928375087e-06, 'val/num_eos_tokens': 0, 'lr': 3.539931406173445e-05, 'episode': 2388, 'epoch': 0.29}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [55:16<1:05:39, 131kB/s]
 29%|██▉       | 598/2041 [51:58<2:03:37,  5.14s/it][A

{'eps': 0, 'objective/kl': 34.62118911743164, 'objective/entropy': 8.702863693237305, 'objective/non_score_reward': -1.7310595512390137, 'objective/rlhf_reward': -1.3318569660186768, 'objective/scores': 0.3992026448249817, 'policy/approxkl_avg': 0.00037947684177197516, 'policy/clipfrac_avg': 0.002358490601181984, 'loss/policy_avg': -1.0570826816547196e-05, 'loss/value_avg': 0.0086483433842659, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1676635891199112, 'val/ratio': 0.9985893964767456, 'val/ratio_var': 1.337335561402142e-06, 'val/num_eos_tokens': 0, 'lr': 3.5374816266536016e-05, 'episode': 2392, 'epoch': 0.29}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [55:21<1:05:39, 131kB/s]
 29%|██▉       | 599/2041 [52:03<2:04:21,  5.17s/it][A

{'eps': 0, 'objective/kl': 38.031829833984375, 'objective/entropy': 9.53721809387207, 'objective/non_score_reward': -1.901591420173645, 'objective/rlhf_reward': -1.4653077125549316, 'objective/scores': 0.43628376722335815, 'policy/approxkl_avg': 0.0001512835151515901, 'policy/clipfrac_avg': 0.001179245300590992, 'loss/policy_avg': -0.0008145615574903786, 'loss/value_avg': 0.007997903041541576, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.16077405214309692, 'val/ratio': 1.0005676746368408, 'val/ratio_var': 1.9389976557704358e-07, 'val/num_eos_tokens': 0, 'lr': 3.535031847133758e-05, 'episode': 2396, 'epoch': 0.29}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [55:27<1:05:39, 131kB/s]
 29%|██▉       | 600/2041 [52:08<2:03:46,  5.15s/it][A

{'eps': 0, 'objective/kl': 36.291175842285156, 'objective/entropy': 11.222820281982422, 'objective/non_score_reward': -1.814558982849121, 'objective/rlhf_reward': -1.423417329788208, 'objective/scores': 0.39114171266555786, 'policy/approxkl_avg': 0.0005349396378733218, 'policy/clipfrac_avg': 0.004716981202363968, 'loss/policy_avg': -0.0027706134133040905, 'loss/value_avg': 0.00895705260336399, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.16263405978679657, 'val/ratio': 1.0022330284118652, 'val/ratio_var': 3.9708188523945864e-06, 'val/num_eos_tokens': 0, 'lr': 3.5325820676139145e-05, 'episode': 2400, 'epoch': 0.29}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [55:32<1:05:39, 131kB/s]
 29%|██▉       | 601/2041 [52:13<2:03:10,  5.13s/it][A

{'eps': 0, 'objective/kl': 34.84925079345703, 'objective/entropy': 6.743953704833984, 'objective/non_score_reward': -1.7424626350402832, 'objective/rlhf_reward': -1.285165548324585, 'objective/scores': 0.457297146320343, 'policy/approxkl_avg': 0.0011922222329303622, 'policy/clipfrac_avg': 0.014150943607091904, 'loss/policy_avg': -0.002251691883429885, 'loss/value_avg': 0.006518246605992317, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.15509016811847687, 'val/ratio': 1.0014030933380127, 'val/ratio_var': 1.8602689806357375e-06, 'val/num_eos_tokens': 0, 'lr': 3.530132288094072e-05, 'episode': 2404, 'epoch': 0.29}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [55:37<1:05:39, 131kB/s]
 29%|██▉       | 602/2041 [52:18<2:03:01,  5.13s/it][A

{'eps': 0, 'objective/kl': 37.90781784057617, 'objective/entropy': 5.737677097320557, 'objective/non_score_reward': -1.8953907489776611, 'objective/rlhf_reward': -1.4350864887237549, 'objective/scores': 0.4603042006492615, 'policy/approxkl_avg': 0.0028387282509356737, 'policy/clipfrac_avg': 0.016509434208273888, 'loss/policy_avg': -0.0015912782400846481, 'loss/value_avg': 0.01113487035036087, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.11439275741577148, 'val/ratio': 0.999905526638031, 'val/ratio_var': 1.71671263871076e-08, 'val/num_eos_tokens': 0, 'lr': 3.527682508574228e-05, 'episode': 2408, 'epoch': 0.29}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [55:42<1:05:39, 131kB/s]
 30%|██▉       | 603/2041 [52:24<2:03:19,  5.15s/it][A

{'eps': 0, 'objective/kl': 33.47795104980469, 'objective/entropy': 6.350225448608398, 'objective/non_score_reward': -1.6738975048065186, 'objective/rlhf_reward': -1.3042895793914795, 'objective/scores': 0.3696078658103943, 'policy/approxkl_avg': 0.000659814802929759, 'policy/clipfrac_avg': 0.009433962404727936, 'loss/policy_avg': -0.002871346427127719, 'loss/value_avg': 0.006007364951074123, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.11188221722841263, 'val/ratio': 1.0011143684387207, 'val/ratio_var': 1.753680408000946e-06, 'val/num_eos_tokens': 0, 'lr': 3.525232729054385e-05, 'episode': 2412, 'epoch': 0.3}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [55:47<1:05:39, 131kB/s]
 30%|██▉       | 604/2041 [52:29<2:03:36,  5.16s/it][A

{'eps': 0, 'objective/kl': 35.95314025878906, 'objective/entropy': 7.277606010437012, 'objective/non_score_reward': -1.7976570129394531, 'objective/rlhf_reward': -1.4836375713348389, 'objective/scores': 0.3140193819999695, 'policy/approxkl_avg': 0.0015057490672916174, 'policy/clipfrac_avg': 0.009433962404727936, 'loss/policy_avg': 0.0004875654121860862, 'loss/value_avg': 0.006241299211978912, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.11947622150182724, 'val/ratio': 1.004997730255127, 'val/ratio_var': 1.7573183868080378e-05, 'val/num_eos_tokens': 0, 'lr': 3.5227829495345424e-05, 'episode': 2416, 'epoch': 0.3}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [55:52<1:05:39, 131kB/s]
 30%|██▉       | 605/2041 [52:34<2:03:04,  5.14s/it][A

{'eps': 0, 'objective/kl': 37.14562225341797, 'objective/entropy': 8.406728744506836, 'objective/non_score_reward': -1.85728120803833, 'objective/rlhf_reward': -1.3830018043518066, 'objective/scores': 0.47427940368652344, 'policy/approxkl_avg': 0.0012113489210605621, 'policy/clipfrac_avg': 0.009433962404727936, 'loss/policy_avg': -0.0014771594433113933, 'loss/value_avg': 0.005500100087374449, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.10623131692409515, 'val/ratio': 0.9957289099693298, 'val/ratio_var': 1.7395343093085103e-05, 'val/num_eos_tokens': 0, 'lr': 3.520333170014699e-05, 'episode': 2420, 'epoch': 0.3}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [55:57<1:05:39, 131kB/s]
 30%|██▉       | 606/2041 [52:39<2:03:36,  5.17s/it][A

{'eps': 0, 'objective/kl': 35.383262634277344, 'objective/entropy': 4.164202690124512, 'objective/non_score_reward': -1.7691631317138672, 'objective/rlhf_reward': -1.3466938734054565, 'objective/scores': 0.42246925830841064, 'policy/approxkl_avg': 0.001152552547864616, 'policy/clipfrac_avg': 0.007075471803545952, 'loss/policy_avg': -0.0012023826129734516, 'loss/value_avg': 0.007698195520788431, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.12254282087087631, 'val/ratio': 1.0046107769012451, 'val/ratio_var': 1.4488591659755912e-05, 'val/num_eos_tokens': 0, 'lr': 3.5178833904948554e-05, 'episode': 2424, 'epoch': 0.3}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [56:03<1:05:39, 131kB/s]
 30%|██▉       | 607/2041 [52:44<2:03:24,  5.16s/it][A

{'eps': 0, 'objective/kl': 35.24314880371094, 'objective/entropy': 4.070163249969482, 'objective/non_score_reward': -1.7621574401855469, 'objective/rlhf_reward': -1.4352819919586182, 'objective/scores': 0.3268754482269287, 'policy/approxkl_avg': 2.7576437787502073e-05, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': 0.0002986035542562604, 'loss/value_avg': 0.006159963086247444, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.09889408946037292, 'val/ratio': 1.0000057220458984, 'val/ratio_var': 2.194947512634826e-08, 'val/num_eos_tokens': 0, 'lr': 3.515433610975012e-05, 'episode': 2428, 'epoch': 0.3}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [56:08<1:05:39, 131kB/s]
 30%|██▉       | 608/2041 [52:49<2:02:56,  5.15s/it][A

{'eps': 0, 'objective/kl': 38.22825622558594, 'objective/entropy': 6.526833534240723, 'objective/non_score_reward': -1.9114128351211548, 'objective/rlhf_reward': -1.6123955249786377, 'objective/scores': 0.2990172803401947, 'policy/approxkl_avg': 0.0010744099272415042, 'policy/clipfrac_avg': 0.01179245300590992, 'loss/policy_avg': -0.0005630906671285629, 'loss/value_avg': 0.009657555259764194, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.12943531572818756, 'val/ratio': 0.9991135597229004, 'val/ratio_var': 5.832462761645729e-07, 'val/num_eos_tokens': 0, 'lr': 3.5129838314551696e-05, 'episode': 2432, 'epoch': 0.3}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [56:13<1:05:39, 131kB/s]
 30%|██▉       | 609/2041 [52:54<2:02:45,  5.14s/it][A

{'eps': 0, 'objective/kl': 34.843441009521484, 'objective/entropy': 6.96039342880249, 'objective/non_score_reward': -1.7421720027923584, 'objective/rlhf_reward': -1.2974767684936523, 'objective/scores': 0.4446951746940613, 'policy/approxkl_avg': 0.0010763845639303327, 'policy/clipfrac_avg': 0.009433962404727936, 'loss/policy_avg': -0.00043549208203330636, 'loss/value_avg': 0.0042793042957782745, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1076006293296814, 'val/ratio': 1.0048604011535645, 'val/ratio_var': 1.7320795450359583e-05, 'val/num_eos_tokens': 0, 'lr': 3.510534051935326e-05, 'episode': 2436, 'epoch': 0.3}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [56:18<1:05:39, 131kB/s]
 30%|██▉       | 610/2041 [53:00<2:02:39,  5.14s/it][A

{'eps': 0, 'objective/kl': 37.76626968383789, 'objective/entropy': 8.88351058959961, 'objective/non_score_reward': -1.8883135318756104, 'objective/rlhf_reward': -1.5541062355041504, 'objective/scores': 0.33420735597610474, 'policy/approxkl_avg': 0.0012012685183435678, 'policy/clipfrac_avg': 0.014150943607091904, 'loss/policy_avg': -0.0018844627775251865, 'loss/value_avg': 0.007493600714951754, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1490752249956131, 'val/ratio': 1.0022063255310059, 'val/ratio_var': 4.485040335566737e-06, 'val/num_eos_tokens': 0, 'lr': 3.5080842724154826e-05, 'episode': 2440, 'epoch': 0.3}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [56:23<1:05:39, 131kB/s]
 30%|██▉       | 611/2041 [53:05<2:02:10,  5.13s/it][A

{'eps': 0, 'objective/kl': 36.46986389160156, 'objective/entropy': 9.269763946533203, 'objective/non_score_reward': -1.8234930038452148, 'objective/rlhf_reward': -1.4548438787460327, 'objective/scores': 0.36864912509918213, 'policy/approxkl_avg': 0.006568428594619036, 'policy/clipfrac_avg': 0.03537736088037491, 'loss/policy_avg': -0.005959684029221535, 'loss/value_avg': 0.008432751521468163, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.18520092964172363, 'val/ratio': 1.0140349864959717, 'val/ratio_var': 0.00013252650387585163, 'val/num_eos_tokens': 0, 'lr': 3.50563449289564e-05, 'episode': 2444, 'epoch': 0.3}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [56:28<1:05:39, 131kB/s]
 30%|██▉       | 612/2041 [53:10<2:02:31,  5.14s/it][A

{'eps': 0, 'objective/kl': 37.29252243041992, 'objective/entropy': 9.439699172973633, 'objective/non_score_reward': -1.864626169204712, 'objective/rlhf_reward': -1.526652216911316, 'objective/scores': 0.337973952293396, 'policy/approxkl_avg': 6.58851204207167e-05, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.00017026010027620941, 'loss/value_avg': 0.005543916020542383, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.20553182065486908, 'val/ratio': 0.9990925788879395, 'val/ratio_var': 5.22849063600006e-07, 'val/num_eos_tokens': 0, 'lr': 3.503184713375796e-05, 'episode': 2448, 'epoch': 0.3}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [56:33<1:05:39, 131kB/s]
 30%|███       | 613/2041 [53:15<2:02:36,  5.15s/it][A

{'eps': 0, 'objective/kl': 36.58066177368164, 'objective/entropy': 11.231143951416016, 'objective/non_score_reward': -1.8290332555770874, 'objective/rlhf_reward': -1.516863465309143, 'objective/scores': 0.31216979026794434, 'policy/approxkl_avg': 0.006502147763967514, 'policy/clipfrac_avg': 0.025943394750356674, 'loss/policy_avg': -0.00472914008423686, 'loss/value_avg': 0.01236663106828928, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.17263644933700562, 'val/ratio': 0.9975472688674927, 'val/ratio_var': 4.244040155754192e-06, 'val/num_eos_tokens': 0, 'lr': 3.500734933855953e-05, 'episode': 2452, 'epoch': 0.3}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [56:42<1:05:39, 131kB/s]
 30%|███       | 614/2041 [53:23<2:24:57,  6.10s/it][A

{'eps': 0, 'objective/kl': 39.352542877197266, 'objective/entropy': 9.704559326171875, 'objective/non_score_reward': -1.9676270484924316, 'objective/rlhf_reward': -1.5036900043487549, 'objective/scores': 0.46393710374832153, 'policy/approxkl_avg': 0.010000431910157204, 'policy/clipfrac_avg': 0.029481131583452225, 'loss/policy_avg': -0.007328630890697241, 'loss/value_avg': 0.019405748695135117, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1262221336364746, 'val/ratio': 1.0061309337615967, 'val/ratio_var': 7.392296538455412e-05, 'val/num_eos_tokens': 0, 'lr': 3.49828515433611e-05, 'episode': 2456, 'epoch': 0.3}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [56:47<1:05:39, 131kB/s]
 30%|███       | 615/2041 [53:29<2:18:31,  5.83s/it][A

{'eps': 0, 'objective/kl': 48.55975341796875, 'objective/entropy': 13.188604354858398, 'objective/non_score_reward': -2.427987575531006, 'objective/rlhf_reward': -1.916093349456787, 'objective/scores': 0.511894166469574, 'policy/approxkl_avg': 0.02879447117447853, 'policy/clipfrac_avg': 0.05424527823925018, 'loss/policy_avg': -0.009530263021588326, 'loss/value_avg': 0.03956964239478111, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.19570639729499817, 'val/ratio': 0.9816874265670776, 'val/ratio_var': 0.00017632827803026885, 'val/num_eos_tokens': 0, 'lr': 3.4958353748162666e-05, 'episode': 2460, 'epoch': 0.3}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [56:52<1:05:39, 131kB/s]
 30%|███       | 616/2041 [53:34<2:12:51,  5.59s/it][A

{'eps': 0, 'objective/kl': 46.74437713623047, 'objective/entropy': 0.07773876190185547, 'objective/non_score_reward': -2.337218761444092, 'objective/rlhf_reward': -1.0971180200576782, 'objective/scores': 1.2401007413864136, 'policy/approxkl_avg': 2.5566014301148243e-06, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': 1.752039315761067e-05, 'loss/value_avg': 0.029618607833981514, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.006690869107842445, 'val/ratio': 1.0002435445785522, 'val/ratio_var': 3.566375639252328e-08, 'val/num_eos_tokens': 0, 'lr': 3.4933855952964234e-05, 'episode': 2464, 'epoch': 0.3}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [56:57<1:05:39, 131kB/s]
 30%|███       | 617/2041 [53:39<2:09:19,  5.45s/it][A

{'eps': 0, 'objective/kl': 53.33753204345703, 'objective/entropy': 6.078817367553711, 'objective/non_score_reward': -2.666876792907715, 'objective/rlhf_reward': -1.7292211055755615, 'objective/scores': 0.9376557469367981, 'policy/approxkl_avg': 0.0598481185734272, 'policy/clipfrac_avg': 0.010613207705318928, 'loss/policy_avg': -0.004796992987394333, 'loss/value_avg': 0.12827683985233307, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.006444195285439491, 'val/ratio': 0.9907488822937012, 'val/ratio_var': 3.962754999520257e-05, 'val/num_eos_tokens': 0, 'lr': 3.49093581577658e-05, 'episode': 2468, 'epoch': 0.3}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [57:02<1:05:39, 131kB/s]
 30%|███       | 618/2041 [53:44<2:07:32,  5.38s/it][A

{'eps': 0, 'objective/kl': 46.331787109375, 'objective/entropy': 0.013216972351074219, 'objective/non_score_reward': -2.31658935546875, 'objective/rlhf_reward': -1.0959093570709229, 'objective/scores': 1.2206799983978271, 'policy/approxkl_avg': 1.0289585361533682e-07, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -3.496262070257217e-05, 'loss/value_avg': 0.029372800141572952, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.0011698210146278143, 'val/ratio': 1.0000576972961426, 'val/ratio_var': 2.080852823382884e-09, 'val/num_eos_tokens': 0, 'lr': 3.488486036256737e-05, 'episode': 2472, 'epoch': 0.3}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [57:08<1:05:39, 131kB/s]
 30%|███       | 619/2041 [53:49<2:06:05,  5.32s/it][A

{'eps': 0, 'objective/kl': 46.74181365966797, 'objective/entropy': 0.006846904754638672, 'objective/non_score_reward': -2.337090492248535, 'objective/rlhf_reward': -1.1142888069152832, 'objective/scores': 1.222801685333252, 'policy/approxkl_avg': 1.24831105452472e-08, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': 1.4535661648551468e-05, 'loss/value_avg': 0.02244536206126213, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.0007277142722159624, 'val/ratio': 1.0000193119049072, 'val/ratio_var': 2.377191776758991e-10, 'val/num_eos_tokens': 0, 'lr': 3.486036256736894e-05, 'episode': 2476, 'epoch': 0.3}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [57:13<1:05:39, 131kB/s]
 30%|███       | 620/2041 [53:54<2:04:18,  5.25s/it][A

{'eps': 0, 'objective/kl': 47.68804931640625, 'objective/entropy': 0.004361629486083984, 'objective/non_score_reward': -2.384402275085449, 'objective/rlhf_reward': -1.2195839881896973, 'objective/scores': 1.164818286895752, 'policy/approxkl_avg': 2.0114914178748222e-09, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': 5.36385550731211e-06, 'loss/value_avg': 0.027206147089600563, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.0005245500942692161, 'val/ratio': 1.0000077486038208, 'val/ratio_var': 4.0117242861015256e-11, 'val/num_eos_tokens': 0, 'lr': 3.4835864772170506e-05, 'episode': 2480, 'epoch': 0.3}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [57:18<1:05:39, 131kB/s]
 30%|███       | 621/2041 [53:59<2:03:58,  5.24s/it][A

{'eps': 0, 'objective/kl': 47.761375427246094, 'objective/entropy': 0.0033364295959472656, 'objective/non_score_reward': -2.3880691528320312, 'objective/rlhf_reward': -1.2218449115753174, 'objective/scores': 1.1662242412567139, 'policy/approxkl_avg': 4.3214729017471143e-10, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': 2.225614935014164e-06, 'loss/value_avg': 0.024655412882566452, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.0004287755873519927, 'val/ratio': 1.0000033378601074, 'val/ratio_var': 7.647808601685124e-12, 'val/num_eos_tokens': 0, 'lr': 3.4811366976972074e-05, 'episode': 2484, 'epoch': 0.3}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [57:23<1:05:39, 131kB/s]
 30%|███       | 622/2041 [54:04<2:02:59,  5.20s/it][A

{'eps': 0, 'objective/kl': 46.31095886230469, 'objective/entropy': 0.0030965805053710938, 'objective/non_score_reward': -2.3155479431152344, 'objective/rlhf_reward': -1.137162208557129, 'objective/scores': 1.1783857345581055, 'policy/approxkl_avg': 1.2934772697370533e-10, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -2.1367704050589964e-08, 'loss/value_avg': 0.02324136346578598, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.0004140700912103057, 'val/ratio': 1.0000019073486328, 'val/ratio_var': 2.6669038071663875e-12, 'val/num_eos_tokens': 0, 'lr': 3.478686918177364e-05, 'episode': 2488, 'epoch': 0.3}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [57:28<1:05:39, 131kB/s]
 31%|███       | 623/2041 [54:10<2:02:58,  5.20s/it][A

{'eps': 0, 'objective/kl': 48.84864807128906, 'objective/entropy': 0.002834320068359375, 'objective/non_score_reward': -2.442432403564453, 'objective/rlhf_reward': -1.2435846328735352, 'objective/scores': 1.198847770690918, 'policy/approxkl_avg': 2.7439283484254062e-11, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': 1.2567583098643809e-06, 'loss/value_avg': 0.022853486239910126, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.0003878193092532456, 'val/ratio': 1.0000009536743164, 'val/ratio_var': 4.902744876744691e-13, 'val/num_eos_tokens': 0, 'lr': 3.476237138657521e-05, 'episode': 2492, 'epoch': 0.31}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [57:33<1:05:39, 131kB/s]
 31%|███       | 624/2041 [54:15<2:02:25,  5.18s/it][A

{'eps': 0, 'objective/kl': 46.40135192871094, 'objective/entropy': 0.0028123855590820312, 'objective/non_score_reward': -2.3200671672821045, 'objective/rlhf_reward': -1.0519391298294067, 'objective/scores': 1.2681280374526978, 'policy/approxkl_avg': 5.6779073630275345e-12, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -2.7946705927206494e-07, 'loss/value_avg': 0.02565453201532364, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.00038631458301097155, 'val/ratio': 1.0000004768371582, 'val/ratio_var': 1.373716000977951e-13, 'val/num_eos_tokens': 0, 'lr': 3.473787359137678e-05, 'episode': 2496, 'epoch': 0.31}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [57:38<1:05:39, 131kB/s]
 31%|███       | 625/2041 [54:20<2:02:23,  5.19s/it][A

{'eps': 0, 'objective/kl': 48.21367645263672, 'objective/entropy': 0.0025968551635742188, 'objective/non_score_reward': -2.410684108734131, 'objective/rlhf_reward': -1.1474981307983398, 'objective/scores': 1.263185977935791, 'policy/approxkl_avg': 4.95503030967237e-12, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -5.6230796019463014e-08, 'loss/value_avg': 0.025303274393081665, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.0003582329081837088, 'val/ratio': 1.0000004768371582, 'val/ratio_var': 7.342274710312249e-14, 'val/num_eos_tokens': 0, 'lr': 3.4713375796178346e-05, 'episode': 2500, 'epoch': 0.31}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [57:44<1:05:39, 131kB/s]
 31%|███       | 626/2041 [54:25<2:02:21,  5.19s/it][A

{'eps': 0, 'objective/kl': 43.02256774902344, 'objective/entropy': 0.0023398399353027344, 'objective/non_score_reward': -2.1511282920837402, 'objective/rlhf_reward': -0.845130443572998, 'objective/scores': 1.3059978485107422, 'policy/approxkl_avg': 6.254921060266927e-12, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -6.50028027848748e-07, 'loss/value_avg': 0.02834356389939785, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.0003313185879960656, 'val/ratio': 1.0000004768371582, 'val/ratio_var': 1.373716000977951e-13, 'val/num_eos_tokens': 0, 'lr': 3.4688878000979914e-05, 'episode': 2504, 'epoch': 0.31}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [57:49<1:05:39, 131kB/s]
 31%|███       | 627/2041 [54:30<2:02:00,  5.18s/it][A

{'eps': 0, 'objective/kl': 46.266845703125, 'objective/entropy': 0.0026311874389648438, 'objective/non_score_reward': -2.3133420944213867, 'objective/rlhf_reward': -0.984145998954773, 'objective/scores': 1.3291960954666138, 'policy/approxkl_avg': 1.698867435617757e-11, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -5.701802479052276e-07, 'loss/value_avg': 0.02690809592604637, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.00036409441963769495, 'val/ratio': 1.0000005960464478, 'val/ratio_var': 2.1316282072803006e-13, 'val/num_eos_tokens': 0, 'lr': 3.466438020578148e-05, 'episode': 2508, 'epoch': 0.31}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [57:54<1:05:39, 131kB/s]
 31%|███       | 628/2041 [54:36<2:01:39,  5.17s/it][A

{'eps': 0, 'objective/kl': 44.514286041259766, 'objective/entropy': 0.002559185028076172, 'objective/non_score_reward': -2.2257142066955566, 'objective/rlhf_reward': -0.9010369777679443, 'objective/scores': 1.3246772289276123, 'policy/approxkl_avg': 2.081756295124748e-11, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -8.400880915360176e-07, 'loss/value_avg': 0.030464408919215202, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.000357393961166963, 'val/ratio': 1.0000007152557373, 'val/ratio_var': 3.78956116703702e-13, 'val/num_eos_tokens': 0, 'lr': 3.4639882410583044e-05, 'episode': 2512, 'epoch': 0.31}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [57:59<1:05:39, 131kB/s]
 31%|███       | 629/2041 [54:41<2:01:31,  5.16s/it][A

{'eps': 0, 'objective/kl': 47.46659469604492, 'objective/entropy': 0.0030770301818847656, 'objective/non_score_reward': -2.3733296394348145, 'objective/rlhf_reward': -1.0486681461334229, 'objective/scores': 1.3246614933013916, 'policy/approxkl_avg': 2.11371718739084e-11, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': 8.209696034100489e-08, 'loss/value_avg': 0.02137206867337227, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.00041969094309024513, 'val/ratio': 1.0000007152557373, 'val/ratio_var': 3.836930773104541e-13, 'val/num_eos_tokens': 0, 'lr': 3.461538461538462e-05, 'episode': 2516, 'epoch': 0.31}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [58:04<1:05:39, 131kB/s]
 31%|███       | 630/2041 [54:46<2:01:32,  5.17s/it][A

{'eps': 0, 'objective/kl': 48.53202819824219, 'objective/entropy': 0.0021371841430664062, 'objective/non_score_reward': -2.4266011714935303, 'objective/rlhf_reward': -1.0757274627685547, 'objective/scores': 1.3508737087249756, 'policy/approxkl_avg': 2.99017841737248e-12, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': 6.74769609076975e-08, 'loss/value_avg': 0.02245454676449299, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.0003074340056627989, 'val/ratio': 1.000000238418579, 'val/ratio_var': 8.05281744607235e-14, 'val/num_eos_tokens': 0, 'lr': 3.459088682018619e-05, 'episode': 2520, 'epoch': 0.31}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [58:09<1:05:39, 131kB/s]
 31%|███       | 631/2041 [54:51<2:00:48,  5.14s/it][A

{'eps': 0, 'objective/kl': 46.525604248046875, 'objective/entropy': 0.0021381378173828125, 'objective/non_score_reward': -2.326280117034912, 'objective/rlhf_reward': -0.9474050998687744, 'objective/scores': 1.3788750171661377, 'policy/approxkl_avg': 1.9133709373841956e-12, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -1.5238545358897682e-07, 'loss/value_avg': 0.020303379744291306, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.00031026580836623907, 'val/ratio': 1.000000238418579, 'val/ratio_var': 4.5001041060850275e-14, 'val/num_eos_tokens': 0, 'lr': 3.4566389024987755e-05, 'episode': 2524, 'epoch': 0.31}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [58:14<1:05:39, 131kB/s]
 31%|███       | 632/2041 [54:56<2:00:30,  5.13s/it][A

{'eps': 0, 'objective/kl': 47.954254150390625, 'objective/entropy': 0.0026235580444335938, 'objective/non_score_reward': -2.3977127075195312, 'objective/rlhf_reward': -0.9968026876449585, 'objective/scores': 1.4009100198745728, 'policy/approxkl_avg': 5.293945837259173e-12, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -1.9793240824128588e-07, 'loss/value_avg': 0.021651051938533783, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.00036581282620318234, 'val/ratio': 1.0000003576278687, 'val/ratio_var': 9.47390291759255e-14, 'val/num_eos_tokens': 0, 'lr': 3.454189122978932e-05, 'episode': 2528, 'epoch': 0.31}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [58:20<1:05:39, 131kB/s]
 31%|███       | 633/2041 [55:01<2:00:06,  5.12s/it][A

{'eps': 0, 'objective/kl': 43.83612823486328, 'objective/entropy': 0.0022482872009277344, 'objective/non_score_reward': -2.1918063163757324, 'objective/rlhf_reward': -0.7686246633529663, 'objective/scores': 1.4231816530227661, 'policy/approxkl_avg': 5.961051530861683e-12, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -3.497555667308916e-07, 'loss/value_avg': 0.03352672606706619, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.00032180885318666697, 'val/ratio': 1.000000238418579, 'val/ratio_var': 1.3026617274019409e-13, 'val/num_eos_tokens': 0, 'lr': 3.451739343459089e-05, 'episode': 2532, 'epoch': 0.31}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [58:25<1:05:39, 131kB/s]
 31%|███       | 634/2041 [55:06<2:00:23,  5.13s/it][A

{'eps': 0, 'objective/kl': 45.94123077392578, 'objective/entropy': 0.00174713134765625, 'objective/non_score_reward': -2.2970616817474365, 'objective/rlhf_reward': -0.82016921043396, 'objective/scores': 1.4768924713134766, 'policy/approxkl_avg': 5.88168532975053e-12, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -4.307279084514448e-07, 'loss/value_avg': 0.022524747997522354, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.00025803412427194417, 'val/ratio': 1.0000004768371582, 'val/ratio_var': 7.342274710312249e-14, 'val/num_eos_tokens': 0, 'lr': 3.449289563939246e-05, 'episode': 2536, 'epoch': 0.31}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [58:30<1:05:39, 131kB/s]
 31%|███       | 635/2041 [55:11<2:00:30,  5.14s/it][A

{'eps': 0, 'objective/kl': 47.262176513671875, 'objective/entropy': 0.0016665458679199219, 'objective/non_score_reward': -2.3631091117858887, 'objective/rlhf_reward': -0.8155543804168701, 'objective/scores': 1.5475547313690186, 'policy/approxkl_avg': 3.6680094725460854e-12, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': 4.4984633795763784e-09, 'loss/value_avg': 0.02863067016005516, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.00024855363881215453, 'val/ratio': 1.000000238418579, 'val/ratio_var': 5.447494194556375e-14, 'val/num_eos_tokens': 0, 'lr': 3.446839784419402e-05, 'episode': 2540, 'epoch': 0.31}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [58:35<1:05:39, 131kB/s]
 31%|███       | 636/2041 [55:17<2:00:10,  5.13s/it][A

{'eps': 0, 'objective/kl': 47.6909065246582, 'objective/entropy': 0.0021529197692871094, 'objective/non_score_reward': -2.38454532623291, 'objective/rlhf_reward': -0.9279054403305054, 'objective/scores': 1.4566398859024048, 'policy/approxkl_avg': 2.2479962769744732e-12, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': 7.253773048887524e-08, 'loss/value_avg': 0.02053467556834221, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.0003085271455347538, 'val/ratio': 1.000000238418579, 'val/ratio_var': 8.05281744607235e-14, 'val/num_eos_tokens': 0, 'lr': 3.4443900048995595e-05, 'episode': 2544, 'epoch': 0.31}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [58:40<1:05:39, 131kB/s]
 31%|███       | 637/2041 [55:22<1:59:33,  5.11s/it][A

{'eps': 0, 'objective/kl': 44.49676513671875, 'objective/entropy': 0.0018148422241210938, 'objective/non_score_reward': -2.2248382568359375, 'objective/rlhf_reward': -0.7010003328323364, 'objective/scores': 1.523837924003601, 'policy/approxkl_avg': 2.2136758572044446e-12, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -3.632509333328926e-07, 'loss/value_avg': 0.02708755061030388, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.0002666531945578754, 'val/ratio': 1.000000238418579, 'val/ratio_var': 8.05281744607235e-14, 'val/num_eos_tokens': 0, 'lr': 3.441940225379716e-05, 'episode': 2548, 'epoch': 0.31}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [58:45<1:05:39, 131kB/s]
 31%|███▏      | 638/2041 [55:27<2:00:22,  5.15s/it][A

{'eps': 0, 'objective/kl': 48.379798889160156, 'objective/entropy': 0.0017666816711425781, 'objective/non_score_reward': -2.418989896774292, 'objective/rlhf_reward': -0.8819131851196289, 'objective/scores': 1.537076711654663, 'policy/approxkl_avg': 2.5354310664860158e-12, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -2.5866167874255552e-08, 'loss/value_avg': 0.019475821405649185, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.00026170711498707533, 'val/ratio': 1.000000238418579, 'val/ratio_var': 8.05281744607235e-14, 'val/num_eos_tokens': 0, 'lr': 3.4394904458598724e-05, 'episode': 2552, 'epoch': 0.31}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [58:51<1:05:39, 131kB/s]
 31%|███▏      | 639/2041 [55:32<2:00:49,  5.17s/it][A

{'eps': 0, 'objective/kl': 47.86296081542969, 'objective/entropy': 0.002106189727783203, 'objective/non_score_reward': -2.3931479454040527, 'objective/rlhf_reward': -0.8507547378540039, 'objective/scores': 1.5423932075500488, 'policy/approxkl_avg': 3.2518725125857406e-12, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -1.8443701321757544e-07, 'loss/value_avg': 0.021412499248981476, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.00030283431988209486, 'val/ratio': 1.000000238418579, 'val/ratio_var': 5.447494194556375e-14, 'val/num_eos_tokens': 0, 'lr': 3.437040666340029e-05, 'episode': 2556, 'epoch': 0.31}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [58:56<1:05:39, 131kB/s]
 31%|███▏      | 640/2041 [55:37<2:00:11,  5.15s/it][A

{'eps': 0, 'objective/kl': 45.02750778198242, 'objective/entropy': 0.0019321441650390625, 'objective/non_score_reward': -2.251375198364258, 'objective/rlhf_reward': -0.6783332824707031, 'objective/scores': 1.5730419158935547, 'policy/approxkl_avg': 4.519588262669183e-12, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -4.6559100042031787e-07, 'loss/value_avg': 0.024854080751538277, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.0002837563515640795, 'val/ratio': 1.000000238418579, 'val/ratio_var': 1.3026617274019409e-13, 'val/num_eos_tokens': 0, 'lr': 3.434590886820187e-05, 'episode': 2560, 'epoch': 0.31}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [59:01<1:05:39, 131kB/s]
 31%|███▏      | 641/2041 [55:42<1:59:57,  5.14s/it][A

{'eps': 0, 'objective/kl': 47.20292282104492, 'objective/entropy': 0.0017781257629394531, 'objective/non_score_reward': -2.3601460456848145, 'objective/rlhf_reward': -0.688884973526001, 'objective/scores': 1.6712610721588135, 'policy/approxkl_avg': 4.521733248247228e-12, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -1.0149658891123181e-07, 'loss/value_avg': 0.02803967520594597, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.00026161264395341277, 'val/ratio': 1.0000003576278687, 'val/ratio_var': 9.47390291759255e-14, 'val/num_eos_tokens': 0, 'lr': 3.432141107300343e-05, 'episode': 2564, 'epoch': 0.31}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [59:06<1:05:39, 131kB/s]
 31%|███▏      | 642/2041 [55:47<2:00:02,  5.15s/it][A

{'eps': 0, 'objective/kl': 46.20166778564453, 'objective/entropy': 0.0011758804321289062, 'objective/non_score_reward': -2.3100833892822266, 'objective/rlhf_reward': -0.6899211406707764, 'objective/scores': 1.6201622486114502, 'policy/approxkl_avg': 1.756783412319718e-12, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -3.4188323638773e-07, 'loss/value_avg': 0.024327518418431282, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.0001863061188487336, 'val/ratio': 1.000000238418579, 'val/ratio_var': 5.447494194556375e-14, 'val/num_eos_tokens': 0, 'lr': 3.4296913277804996e-05, 'episode': 2568, 'epoch': 0.31}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [59:11<1:05:39, 131kB/s]
 32%|███▏      | 643/2041 [55:53<2:00:13,  5.16s/it][A

{'eps': 0, 'objective/kl': 47.123504638671875, 'objective/entropy': 0.0016226768493652344, 'objective/non_score_reward': -2.356175422668457, 'objective/rlhf_reward': -0.7222881317138672, 'objective/scores': 1.6338872909545898, 'policy/approxkl_avg': 3.1274605301334635e-12, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -1.602577697212837e-07, 'loss/value_avg': 0.024395572021603584, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.00024389545433223248, 'val/ratio': 1.000000238418579, 'val/ratio_var': 6.158037269129654e-14, 'val/num_eos_tokens': 0, 'lr': 3.427241548260657e-05, 'episode': 2572, 'epoch': 0.32}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [59:16<1:05:39, 131kB/s]
 32%|███▏      | 644/2041 [55:58<2:00:29,  5.18s/it][A

{'eps': 0, 'objective/kl': 47.20843505859375, 'objective/entropy': 0.0014476776123046875, 'objective/non_score_reward': -2.360421657562256, 'objective/rlhf_reward': -0.6934524774551392, 'objective/scores': 1.6669691801071167, 'policy/approxkl_avg': 2.4367593446511515e-12, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -2.502270604054502e-07, 'loss/value_avg': 0.020926453173160553, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.00022107025142759085, 'val/ratio': 1.000000238418579, 'val/ratio_var': 8.05281744607235e-14, 'val/num_eos_tokens': 0, 'lr': 3.424791768740814e-05, 'episode': 2576, 'epoch': 0.32}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [59:21<1:05:39, 131kB/s]
 32%|███▏      | 645/2041 [56:03<2:00:10,  5.17s/it][A

{'eps': 0, 'objective/kl': 48.35934829711914, 'objective/entropy': 0.0016226768493652344, 'objective/non_score_reward': -2.4179673194885254, 'objective/rlhf_reward': -0.7344264984130859, 'objective/scores': 1.6835408210754395, 'policy/approxkl_avg': 1.707447659822503e-12, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': 9.896620412064294e-08, 'loss/value_avg': 0.023516161367297173, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.00024402592680417, 'val/ratio': 1.000000238418579, 'val/ratio_var': 4.5001041060850275e-14, 'val/num_eos_tokens': 0, 'lr': 3.42234198922097e-05, 'episode': 2580, 'epoch': 0.32}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [59:27<1:05:39, 131kB/s]
 32%|███▏      | 646/2041 [56:08<1:59:45,  5.15s/it][A

{'eps': 0, 'objective/kl': 47.170982360839844, 'objective/entropy': 0.0016374588012695312, 'objective/non_score_reward': -2.3585493564605713, 'objective/rlhf_reward': -0.6482880115509033, 'objective/scores': 1.710261344909668, 'policy/approxkl_avg': 4.976480382293258e-13, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -3.767463496728851e-08, 'loss/value_avg': 0.030737828463315964, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.00024716134066693485, 'val/ratio': 1.0000001192092896, 'val/ratio_var': 2.1316282072803006e-14, 'val/num_eos_tokens': 0, 'lr': 3.419892209701127e-05, 'episode': 2584, 'epoch': 0.32}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [59:32<1:05:39, 131kB/s]
 32%|███▏      | 647/2041 [56:13<1:59:24,  5.14s/it][A

{'eps': 0, 'objective/kl': 47.840274810791016, 'objective/entropy': 0.0016946792602539062, 'objective/non_score_reward': -2.3920137882232666, 'objective/rlhf_reward': -0.6771137714385986, 'objective/scores': 1.714900016784668, 'policy/approxkl_avg': 5.148083565345574e-13, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -7.366234200389954e-08, 'loss/value_avg': 0.023536890745162964, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.00025534856831654906, 'val/ratio': 1.0000001192092896, 'val/ratio_var': 2.1316282072803006e-14, 'val/num_eos_tokens': 0, 'lr': 3.4174424301812843e-05, 'episode': 2588, 'epoch': 0.32}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [59:37<1:05:39, 131kB/s]
 32%|███▏      | 648/2041 [56:18<1:59:43,  5.16s/it][A

{'eps': 0, 'objective/kl': 45.22117233276367, 'objective/entropy': 0.0016431808471679688, 'objective/non_score_reward': -2.261058807373047, 'objective/rlhf_reward': -0.48044705390930176, 'objective/scores': 1.7806117534637451, 'policy/approxkl_avg': 2.2415613202403373e-12, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -4.7346330234177003e-07, 'loss/value_avg': 0.0254152063280344, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.0002469881437718868, 'val/ratio': 1.000000238418579, 'val/ratio_var': 8.05281744607235e-14, 'val/num_eos_tokens': 0, 'lr': 3.4149926506614405e-05, 'episode': 2592, 'epoch': 0.32}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [59:42<1:05:39, 131kB/s]
 32%|███▏      | 649/2041 [56:24<1:59:47,  5.16s/it][A

{'eps': 0, 'objective/kl': 46.04169464111328, 'objective/entropy': 0.0011348724365234375, 'objective/non_score_reward': -2.3020846843719482, 'objective/rlhf_reward': -0.5195118188858032, 'objective/scores': 1.782572865486145, 'policy/approxkl_avg': 2.2737367544323206e-12, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -3.767463567783125e-07, 'loss/value_avg': 0.026122961193323135, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.00018063581956084818, 'val/ratio': 1.000000238418579, 'val/ratio_var': 8.05281744607235e-14, 'val/num_eos_tokens': 0, 'lr': 3.412542871141597e-05, 'episode': 2596, 'epoch': 0.32}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [59:47<1:05:39, 131kB/s]
 32%|███▏      | 650/2041 [56:29<1:59:31,  5.16s/it][A

{'eps': 0, 'objective/kl': 45.67240905761719, 'objective/entropy': 0.0016169548034667969, 'objective/non_score_reward': -2.283620834350586, 'objective/rlhf_reward': -0.4999321699142456, 'objective/scores': 1.7836886644363403, 'policy/approxkl_avg': 5.6371513360020664e-12, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -4.959556463290937e-07, 'loss/value_avg': 0.020154278725385666, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.000243414135184139, 'val/ratio': 1.0000004768371582, 'val/ratio_var': 7.342274710312249e-14, 'val/num_eos_tokens': 0, 'lr': 3.410093091621754e-05, 'episode': 2600, 'epoch': 0.32}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [59:52<1:05:39, 131kB/s]
 32%|███▏      | 651/2041 [56:34<1:59:13,  5.15s/it][A

{'eps': 0, 'objective/kl': 43.81058883666992, 'objective/entropy': 0.001453399658203125, 'objective/non_score_reward': -2.1905293464660645, 'objective/rlhf_reward': -0.4069598913192749, 'objective/scores': 1.7835694551467896, 'policy/approxkl_avg': 7.893727338448286e-12, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -1.305679120378045e-06, 'loss/value_avg': 0.02532079443335533, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.00022202168474905193, 'val/ratio': 1.0000004768371582, 'val/ratio_var': 1.1842378477584098e-13, 'val/num_eos_tokens': 0, 'lr': 3.407643312101911e-05, 'episode': 2604, 'epoch': 0.32}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [59:57<1:05:39, 131kB/s]
 32%|███▏      | 652/2041 [56:39<1:58:38,  5.13s/it][A

{'eps': 0, 'objective/kl': 46.7401008605957, 'objective/entropy': 0.001636505126953125, 'objective/non_score_reward': -2.337005138397217, 'objective/rlhf_reward': -0.5402040481567383, 'objective/scores': 1.7968010902404785, 'policy/approxkl_avg': 1.3129757697738498e-11, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -7.101949677235098e-07, 'loss/value_avg': 0.021752741187810898, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.00024403042334597558, 'val/ratio': 1.0000005960464478, 'val/ratio_var': 2.2974214219408096e-13, 'val/num_eos_tokens': 0, 'lr': 3.405193532582068e-05, 'episode': 2608, 'epoch': 0.32}



                                                                       
model.safetensors:   4%|▍         | 23.7M/538M [1:00:03<1:05:39, 131kB/s]
 32%|███▏      | 653/2041 [56:44<1:58:53,  5.14s/it][A

{'eps': 0, 'objective/kl': 44.50370788574219, 'objective/entropy': 0.0010538101196289062, 'objective/non_score_reward': -2.2251853942871094, 'objective/rlhf_reward': -0.3729667663574219, 'objective/scores': 1.8522186279296875, 'policy/approxkl_avg': 4.631130114812754e-12, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -8.310911994158232e-07, 'loss/value_avg': 0.02815089374780655, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.0001688543416094035, 'val/ratio': 1.0000003576278687, 'val/ratio_var': 9.47390291759255e-14, 'val/num_eos_tokens': 0, 'lr': 3.4027437530622245e-05, 'episode': 2612, 'epoch': 0.32}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:00:08<1:05:39, 131kB/s]
 32%|███▏      | 654/2041 [56:49<1:59:07,  5.15s/it][A

{'eps': 0, 'objective/kl': 46.40643310546875, 'objective/entropy': 0.0011754035949707031, 'objective/non_score_reward': -2.320321559906006, 'objective/rlhf_reward': -0.4646613597869873, 'objective/scores': 1.8556602001190186, 'policy/approxkl_avg': 4.753396894846551e-12, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -3.053332306990342e-07, 'loss/value_avg': 0.01883387565612793, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.00018392418860457838, 'val/ratio': 1.0000003576278687, 'val/ratio_var': 9.47390291759255e-14, 'val/num_eos_tokens': 0, 'lr': 3.400293973542381e-05, 'episode': 2616, 'epoch': 0.32}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:00:13<1:05:39, 131kB/s]
 32%|███▏      | 655/2041 [56:55<1:59:27,  5.17s/it][A

{'eps': 0, 'objective/kl': 46.072509765625, 'objective/entropy': 0.001079559326171875, 'objective/non_score_reward': -2.3036255836486816, 'objective/rlhf_reward': -0.3800837993621826, 'objective/scores': 1.923541784286499, 'policy/approxkl_avg': 3.333383699274939e-12, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -5.623079459837754e-07, 'loss/value_avg': 0.028782349079847336, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.00017207299242727458, 'val/ratio': 1.000000238418579, 'val/ratio_var': 8.76336018183245e-14, 'val/num_eos_tokens': 0, 'lr': 3.397844194022538e-05, 'episode': 2620, 'epoch': 0.32}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:00:18<1:05:39, 131kB/s]
 32%|███▏      | 656/2041 [57:00<1:58:52,  5.15s/it][A

{'eps': 0, 'objective/kl': 47.92730712890625, 'objective/entropy': 0.0008955001831054688, 'objective/non_score_reward': -2.396365165710449, 'objective/rlhf_reward': -0.5209019184112549, 'objective/scores': 1.8754632472991943, 'policy/approxkl_avg': 1.115418087754838e-12, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': 1.0796313176797412e-07, 'loss/value_avg': 0.01767321117222309, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.00014708851813338697, 'val/ratio': 1.000000238418579, 'val/ratio_var': 8.05281744607235e-14, 'val/num_eos_tokens': 0, 'lr': 3.395394414502695e-05, 'episode': 2624, 'epoch': 0.32}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:00:23<1:05:39, 131kB/s]
 32%|███▏      | 657/2041 [57:05<1:59:12,  5.17s/it][A

{'eps': 0, 'objective/kl': 46.38312530517578, 'objective/entropy': 0.0013079643249511719, 'objective/non_score_reward': -2.3191561698913574, 'objective/rlhf_reward': -0.43794918060302734, 'objective/scores': 1.88120698928833, 'policy/approxkl_avg': 9.137847813492361e-13, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -3.148924676565912e-08, 'loss/value_avg': 0.017636023461818695, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.0002052536583505571, 'val/ratio': 1.0000001192092896, 'val/ratio_var': 2.3684757293981375e-14, 'val/num_eos_tokens': 0, 'lr': 3.392944634982852e-05, 'episode': 2628, 'epoch': 0.32}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:00:28<1:05:39, 131kB/s]
 32%|███▏      | 658/2041 [57:10<1:59:01,  5.16s/it][A

{'eps': 0, 'objective/kl': 47.924659729003906, 'objective/entropy': 0.0009589195251464844, 'objective/non_score_reward': -2.396233081817627, 'objective/rlhf_reward': -0.4623842239379883, 'objective/scores': 1.9338488578796387, 'policy/approxkl_avg': 4.054115712841949e-13, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -1.0346466439159485e-07, 'loss/value_avg': 0.018613608554005623, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.00015575930592603981, 'val/ratio': 1.0, 'val/ratio_var': 4.736951712906159e-15, 'val/num_eos_tokens': 0, 'lr': 3.3904948554630085e-05, 'episode': 2632, 'epoch': 0.32}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:00:34<1:05:39, 131kB/s]
 32%|███▏      | 659/2041 [57:15<1:58:52,  5.16s/it][A

{'eps': 0, 'objective/kl': 49.60145950317383, 'objective/entropy': 0.0011315345764160156, 'objective/non_score_reward': -2.4800734519958496, 'objective/rlhf_reward': -0.5119237899780273, 'objective/scores': 1.9681496620178223, 'policy/approxkl_avg': 6.628157766565279e-13, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -1.012154342561189e-07, 'loss/value_avg': 0.018975645303726196, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.0001785192871466279, 'val/ratio': 1.0000001192092896, 'val/ratio_var': 2.1316282072803006e-14, 'val/num_eos_tokens': 0, 'lr': 3.388045075943165e-05, 'episode': 2636, 'epoch': 0.32}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:00:39<1:05:39, 131kB/s]
 32%|███▏      | 660/2041 [57:20<1:59:10,  5.18s/it][A

{'eps': 0, 'objective/kl': 43.71019744873047, 'objective/entropy': 0.0010251998901367188, 'objective/non_score_reward': -2.18550968170166, 'objective/rlhf_reward': -0.21921372413635254, 'objective/scores': 1.9662959575653076, 'policy/approxkl_avg': 9.609755482684057e-13, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -4.63341791601124e-07, 'loss/value_avg': 0.025083012878894806, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.00016533653251826763, 'val/ratio': 1.0000001192092896, 'val/ratio_var': 2.842170943040401e-14, 'val/num_eos_tokens': 0, 'lr': 3.385595296423322e-05, 'episode': 2640, 'epoch': 0.32}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:00:44<1:05:39, 131kB/s]
 32%|███▏      | 661/2041 [57:26<1:59:08,  5.18s/it][A

{'eps': 0, 'objective/kl': 43.59752655029297, 'objective/entropy': 0.0011038780212402344, 'objective/non_score_reward': -2.1798768043518066, 'objective/rlhf_reward': -0.26782500743865967, 'objective/scores': 1.912051796913147, 'policy/approxkl_avg': 3.2518725125857406e-12, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -8.333404366567265e-07, 'loss/value_avg': 0.02959149330854416, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.00017709328676573932, 'val/ratio': 1.000000238418579, 'val/ratio_var': 8.05281744607235e-14, 'val/num_eos_tokens': 0, 'lr': 3.383145516903479e-05, 'episode': 2644, 'epoch': 0.32}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:00:49<1:05:39, 131kB/s]
 32%|███▏      | 662/2041 [57:31<1:59:27,  5.20s/it][A

{'eps': 0, 'objective/kl': 46.22308349609375, 'objective/entropy': 0.0009036064147949219, 'objective/non_score_reward': -2.3111538887023926, 'objective/rlhf_reward': -0.32916581630706787, 'objective/scores': 1.9819880723953247, 'policy/approxkl_avg': 1.982012210605122e-12, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -1.7206623681431665e-07, 'loss/value_avg': 0.025544216856360435, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.0001481029175920412, 'val/ratio': 1.000000238418579, 'val/ratio_var': 8.05281744607235e-14, 'val/num_eos_tokens': 0, 'lr': 3.380695737383636e-05, 'episode': 2648, 'epoch': 0.32}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:00:54<1:05:39, 131kB/s]
 32%|███▏      | 663/2041 [57:36<1:59:21,  5.20s/it][A

{'eps': 0, 'objective/kl': 46.73719787597656, 'objective/entropy': 0.0007071495056152344, 'objective/non_score_reward': -2.336860179901123, 'objective/rlhf_reward': -0.3267197608947754, 'objective/scores': 2.0101404190063477, 'policy/approxkl_avg': 7.293117932488657e-13, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -1.2904968116345117e-07, 'loss/value_avg': 0.0202416330575943, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.00012099068408133462, 'val/ratio': 1.0000001192092896, 'val/ratio_var': 2.1316282072803006e-14, 'val/num_eos_tokens': 0, 'lr': 3.3782459578637925e-05, 'episode': 2652, 'epoch': 0.32}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:01:00<1:05:39, 131kB/s]
 33%|███▎      | 664/2041 [57:41<1:59:33,  5.21s/it][A

{'eps': 0, 'objective/kl': 47.07728958129883, 'objective/entropy': 0.0009679794311523438, 'objective/non_score_reward': -2.3538646697998047, 'objective/rlhf_reward': -0.38492822647094727, 'objective/scores': 1.9689364433288574, 'policy/approxkl_avg': 1.2634254536667e-12, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -1.748777833654458e-07, 'loss/value_avg': 0.01977883279323578, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.0001567219733260572, 'val/ratio': 1.000000238418579, 'val/ratio_var': 4.5001041060850275e-14, 'val/num_eos_tokens': 0, 'lr': 3.375796178343949e-05, 'episode': 2656, 'epoch': 0.33}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:01:05<1:05:39, 131kB/s]
 33%|███▎      | 665/2041 [57:46<1:58:52,  5.18s/it][A

{'eps': 0, 'objective/kl': 47.93316650390625, 'objective/entropy': 0.0008068084716796875, 'objective/non_score_reward': -2.396658420562744, 'objective/rlhf_reward': -0.40564215183258057, 'objective/scores': 1.9910162687301636, 'policy/approxkl_avg': 7.636323756492203e-13, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -8.828234854263428e-08, 'loss/value_avg': 0.016797438263893127, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.0001346772478427738, 'val/ratio': 1.0000001192092896, 'val/ratio_var': 2.3684757293981375e-14, 'val/num_eos_tokens': 0, 'lr': 3.373346398824106e-05, 'episode': 2660, 'epoch': 0.33}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:01:10<1:05:39, 131kB/s]
 33%|███▎      | 666/2041 [57:51<1:58:11,  5.16s/it][A

{'eps': 0, 'objective/kl': 47.78308868408203, 'objective/entropy': 0.0009012222290039062, 'objective/non_score_reward': -2.3891544342041016, 'objective/rlhf_reward': -0.3982534408569336, 'objective/scores': 1.990900993347168, 'policy/approxkl_avg': 4.633275100390799e-13, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': 8.772004633783581e-08, 'loss/value_avg': 0.024581678211688995, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.0001488159323344007, 'val/ratio': 1.0000001192092896, 'val/ratio_var': 2.1316282072803006e-14, 'val/num_eos_tokens': 0, 'lr': 3.370896619304263e-05, 'episode': 2664, 'epoch': 0.33}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:01:15<1:05:39, 131kB/s]
 33%|███▎      | 667/2041 [57:57<1:57:43,  5.14s/it][A

{'eps': 0, 'objective/kl': 45.33847427368164, 'objective/entropy': 0.0009331703186035156, 'objective/non_score_reward': -2.2669239044189453, 'objective/rlhf_reward': -0.25586891174316406, 'objective/scores': 2.0110549926757812, 'policy/approxkl_avg': 3.30335368434187e-13, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -1.7881393432617188e-07, 'loss/value_avg': 0.02174999564886093, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.00015297025674954057, 'val/ratio': 1.0, 'val/ratio_var': 4.736951712906159e-15, 'val/num_eos_tokens': 0, 'lr': 3.368446839784419e-05, 'episode': 2668, 'epoch': 0.33}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:01:20<1:05:39, 131kB/s]
 33%|███▎      | 668/2041 [58:02<1:57:13,  5.12s/it][A

{'eps': 0, 'objective/kl': 48.29110336303711, 'objective/entropy': 0.0009303092956542969, 'objective/non_score_reward': -2.414555311203003, 'objective/rlhf_reward': -0.3795609474182129, 'objective/scores': 2.03499436378479, 'policy/approxkl_avg': 5.662891488199262e-13, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -1.2258314541213622e-07, 'loss/value_avg': 0.01787429489195347, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.00015166345110628754, 'val/ratio': 1.0000001192092896, 'val/ratio_var': 2.3684757293981375e-14, 'val/num_eos_tokens': 0, 'lr': 3.3659970602645766e-05, 'episode': 2672, 'epoch': 0.33}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:01:25<1:05:39, 131kB/s]
 33%|███▎      | 669/2041 [58:07<1:57:16,  5.13s/it][A

{'eps': 0, 'objective/kl': 46.05606460571289, 'objective/entropy': 0.0009756088256835938, 'objective/non_score_reward': -2.3028032779693604, 'objective/rlhf_reward': -0.2819526195526123, 'objective/scores': 2.020850658416748, 'policy/approxkl_avg': 9.738457869973294e-13, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -3.10956295379583e-07, 'loss/value_avg': 0.020307481288909912, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.00015806927694939077, 'val/ratio': 1.0000001192092896, 'val/ratio_var': 2.3684757293981375e-14, 'val/num_eos_tokens': 0, 'lr': 3.3635472807447334e-05, 'episode': 2676, 'epoch': 0.33}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:01:30<1:05:39, 131kB/s]
 33%|███▎      | 670/2041 [58:12<1:57:41,  5.15s/it][A

{'eps': 0, 'objective/kl': 45.434593200683594, 'objective/entropy': 0.0010938644409179688, 'objective/non_score_reward': -2.2717297077178955, 'objective/rlhf_reward': -0.21575045585632324, 'objective/scores': 2.0559792518615723, 'policy/approxkl_avg': 2.0763935276030265e-12, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -4.374755917524453e-07, 'loss/value_avg': 0.023437052965164185, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.00017383413796778768, 'val/ratio': 1.000000238418579, 'val/ratio_var': 8.05281744607235e-14, 'val/num_eos_tokens': 0, 'lr': 3.3610975012248895e-05, 'episode': 2680, 'epoch': 0.33}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:01:36<1:05:39, 131kB/s]
 33%|███▎      | 671/2041 [58:17<1:57:21,  5.14s/it][A

{'eps': 0, 'objective/kl': 46.80183792114258, 'objective/entropy': 0.0007290840148925781, 'objective/non_score_reward': -2.340092182159424, 'objective/rlhf_reward': -0.30574727058410645, 'objective/scores': 2.0343449115753174, 'policy/approxkl_avg': 9.695557074210215e-13, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -3.6381325685397314e-07, 'loss/value_avg': 0.021712739020586014, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.00012363577843643725, 'val/ratio': 1.0000001192092896, 'val/ratio_var': 2.3684757293981375e-14, 'val/num_eos_tokens': 0, 'lr': 3.358647721705046e-05, 'episode': 2684, 'epoch': 0.33}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:01:41<1:05:39, 131kB/s]
 33%|███▎      | 672/2041 [58:22<1:57:17,  5.14s/it][A

{'eps': 0, 'objective/kl': 47.02067565917969, 'objective/entropy': 0.0009608268737792969, 'objective/non_score_reward': -2.3510336875915527, 'objective/rlhf_reward': -0.29975271224975586, 'objective/scores': 2.051280975341797, 'policy/approxkl_avg': 2.5139807770246936e-12, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -3.9586478806086234e-07, 'loss/value_avg': 0.02173905074596405, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.00015414886001963168, 'val/ratio': 1.000000238418579, 'val/ratio_var': 8.05281744607235e-14, 'val/num_eos_tokens': 0, 'lr': 3.356197942185204e-05, 'episode': 2688, 'epoch': 0.33}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:01:46<1:05:39, 131kB/s]
 33%|███▎      | 673/2041 [58:27<1:57:52,  5.17s/it][A

{'eps': 0, 'objective/kl': 45.05221939086914, 'objective/entropy': 0.0006399154663085938, 'objective/non_score_reward': -2.252610683441162, 'objective/rlhf_reward': -0.1310431957244873, 'objective/scores': 2.121567487716675, 'policy/approxkl_avg': 9.545404831140525e-13, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -4.897702297057549e-07, 'loss/value_avg': 0.021684961393475533, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.0001097962522180751, 'val/ratio': 1.0000001192092896, 'val/ratio_var': 2.3684757293981375e-14, 'val/num_eos_tokens': 0, 'lr': 3.3537481626653606e-05, 'episode': 2692, 'epoch': 0.33}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:01:51<1:05:39, 131kB/s]
 33%|███▎      | 674/2041 [58:33<1:57:12,  5.14s/it][A

{'eps': 0, 'objective/kl': 43.947425842285156, 'objective/entropy': 0.0008115768432617188, 'objective/non_score_reward': -2.197371482849121, 'objective/rlhf_reward': -0.1239924430847168, 'objective/scores': 2.0733790397644043, 'policy/approxkl_avg': 1.3277765388911011e-12, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -2.389808742009336e-07, 'loss/value_avg': 0.02078344114124775, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.00013572091120295227, 'val/ratio': 1.0000001192092896, 'val/ratio_var': 2.3684757293981375e-14, 'val/num_eos_tokens': 0, 'lr': 3.351298383145517e-05, 'episode': 2696, 'epoch': 0.33}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:01:56<1:05:39, 131kB/s]
 33%|███▎      | 675/2041 [58:38<1:57:20,  5.15s/it][A

{'eps': 0, 'objective/kl': 46.76704406738281, 'objective/entropy': 0.0008368492126464844, 'objective/non_score_reward': -2.3383524417877197, 'objective/rlhf_reward': -0.21802234649658203, 'objective/scores': 2.1203300952911377, 'policy/approxkl_avg': 1.2655705476649626e-12, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -3.497555667308916e-07, 'loss/value_avg': 0.018587950617074966, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.00013913074508309364, 'val/ratio': 1.000000238418579, 'val/ratio_var': 4.5001041060850275e-14, 'val/num_eos_tokens': 0, 'lr': 3.348848603625674e-05, 'episode': 2700, 'epoch': 0.33}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:02:01<1:05:39, 131kB/s]
 33%|███▎      | 676/2041 [58:43<1:57:15,  5.15s/it][A

{'eps': 0, 'objective/kl': 47.7474365234375, 'objective/entropy': 0.0006880760192871094, 'objective/non_score_reward': -2.387371778488159, 'objective/rlhf_reward': -0.23242831230163574, 'objective/scores': 2.1549434661865234, 'policy/approxkl_avg': 9.438153383833914e-13, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -2.2042472380690015e-07, 'loss/value_avg': 0.01815800368785858, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.000118111667688936, 'val/ratio': 1.0000001192092896, 'val/ratio_var': 2.3684757293981375e-14, 'val/num_eos_tokens': 0, 'lr': 3.346398824105831e-05, 'episode': 2704, 'epoch': 0.33}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:02:07<1:05:39, 131kB/s]
 33%|███▎      | 677/2041 [58:48<1:57:41,  5.18s/it][A

{'eps': 0, 'objective/kl': 46.1036376953125, 'objective/entropy': 0.0005512237548828125, 'objective/non_score_reward': -2.3051819801330566, 'objective/rlhf_reward': -0.1257929801940918, 'objective/scores': 2.179388999938965, 'policy/approxkl_avg': 4.0755661107234886e-13, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -1.5744623738100927e-07, 'loss/value_avg': 0.018406735733151436, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 9.772912744665518e-05, 'val/ratio': 1.0000001192092896, 'val/ratio_var': 2.1316282072803006e-14, 'val/num_eos_tokens': 0, 'lr': 3.343949044585987e-05, 'episode': 2708, 'epoch': 0.33}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:02:12<1:05:39, 131kB/s]
 33%|███▎      | 678/2041 [58:53<1:57:34,  5.18s/it][A

{'eps': 0, 'objective/kl': 47.691246032714844, 'objective/entropy': 0.0006456375122070312, 'objective/non_score_reward': -2.3845620155334473, 'objective/rlhf_reward': -0.21067285537719727, 'objective/scores': 2.17388916015625, 'policy/approxkl_avg': 3.6894597620074077e-13, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -1.4451315166752465e-07, 'loss/value_avg': 0.022823844105005264, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.00011197350977454334, 'val/ratio': 1.0, 'val/ratio_var': 4.736951712906159e-15, 'val/num_eos_tokens': 0, 'lr': 3.341499265066144e-05, 'episode': 2712, 'epoch': 0.33}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:02:17<1:05:39, 131kB/s]
 33%|███▎      | 679/2041 [58:58<1:56:49,  5.15s/it][A

{'eps': 0, 'objective/kl': 46.92448425292969, 'objective/entropy': 0.0006666183471679688, 'objective/non_score_reward': -2.346224308013916, 'objective/rlhf_reward': -0.18201494216918945, 'objective/scores': 2.1642093658447266, 'policy/approxkl_avg': 3.796711480364562e-13, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -6.86015724227218e-08, 'loss/value_avg': 0.017545107752084732, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.00011568699846975505, 'val/ratio': 1.0000001192092896, 'val/ratio_var': 2.1316282072803006e-14, 'val/num_eos_tokens': 0, 'lr': 3.3390494855463014e-05, 'episode': 2716, 'epoch': 0.33}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:02:22<1:05:39, 131kB/s]
 33%|███▎      | 680/2041 [59:04<1:56:49,  5.15s/it][A

{'eps': 0, 'objective/kl': 46.85594940185547, 'objective/entropy': 0.0007519721984863281, 'objective/non_score_reward': -2.34279727935791, 'objective/rlhf_reward': -0.17600536346435547, 'objective/scores': 2.1667919158935547, 'policy/approxkl_avg': 3.603658441531793e-13, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -5.28569472635354e-08, 'loss/value_avg': 0.01861572265625, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.00012839515693485737, 'val/ratio': 1.0000001192092896, 'val/ratio_var': 2.1316282072803006e-14, 'val/num_eos_tokens': 0, 'lr': 3.3365997060264575e-05, 'episode': 2720, 'epoch': 0.33}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:02:27<1:05:39, 131kB/s]
 33%|███▎      | 681/2041 [59:09<1:56:52,  5.16s/it][A

{'eps': 0, 'objective/kl': 45.59482955932617, 'objective/entropy': 0.0006074905395507812, 'objective/non_score_reward': -2.2797412872314453, 'objective/rlhf_reward': -0.1364743709564209, 'objective/scores': 2.1432669162750244, 'policy/approxkl_avg': 1.630226308764124e-13, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -8.828235564806164e-08, 'loss/value_avg': 0.016856780275702477, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.0001061120128724724, 'val/ratio': 1.0, 'val/ratio_var': 4.736951712906159e-15, 'val/num_eos_tokens': 0, 'lr': 3.334149926506614e-05, 'episode': 2724, 'epoch': 0.33}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:02:32<1:05:39, 131kB/s]
 33%|███▎      | 682/2041 [59:14<1:56:12,  5.13s/it][A

{'eps': 0, 'objective/kl': 45.035892486572266, 'objective/entropy': 0.0007638931274414062, 'objective/non_score_reward': -2.2517948150634766, 'objective/rlhf_reward': -0.12450456619262695, 'objective/scores': 2.1272902488708496, 'policy/approxkl_avg': 2.6598431031484016e-13, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -1.0177774356634472e-07, 'loss/value_avg': 0.01769557036459446, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.00012969970703125, 'val/ratio': 1.0000001192092896, 'val/ratio_var': 2.1316282072803006e-14, 'val/num_eos_tokens': 0, 'lr': 3.331700146986771e-05, 'episode': 2728, 'epoch': 0.33}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:02:37<1:05:39, 131kB/s]
 33%|███▎      | 683/2041 [59:19<1:56:10,  5.13s/it][A

{'eps': 0, 'objective/kl': 44.943878173828125, 'objective/entropy': 0.0005660057067871094, 'objective/non_score_reward': -2.2471938133239746, 'objective/rlhf_reward': -0.07651376724243164, 'objective/scores': 2.170680046081543, 'policy/approxkl_avg': 2.0806837155995517e-13, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -9.221849950336036e-08, 'loss/value_avg': 0.016043147072196007, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 9.94722795439884e-05, 'val/ratio': 1.0, 'val/ratio_var': 2.3684758564530796e-15, 'val/num_eos_tokens': 0, 'lr': 3.329250367466928e-05, 'episode': 2732, 'epoch': 0.33}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:02:43<1:05:39, 131kB/s]
 34%|███▎      | 684/2041 [59:24<1:56:15,  5.14s/it][A

{'eps': 0, 'objective/kl': 44.515830993652344, 'objective/entropy': 0.0006170272827148438, 'objective/non_score_reward': -2.2257914543151855, 'objective/rlhf_reward': -0.050394296646118164, 'objective/scores': 2.1753971576690674, 'policy/approxkl_avg': 1.8876305412415112e-13, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -1.2595698706263647e-07, 'loss/value_avg': 0.01722445897758007, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.00010823978664120659, 'val/ratio': 1.0, 'val/ratio_var': 2.3684758564530796e-15, 'val/num_eos_tokens': 0, 'lr': 3.326800587947085e-05, 'episode': 2736, 'epoch': 0.34}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:02:48<1:05:39, 131kB/s]
 34%|███▎      | 685/2041 [59:29<1:56:51,  5.17s/it][A

{'eps': 0, 'objective/kl': 47.31685256958008, 'objective/entropy': 0.0004649162292480469, 'objective/non_score_reward': -2.365842580795288, 'objective/rlhf_reward': -0.11477136611938477, 'objective/scores': 2.2510712146759033, 'policy/approxkl_avg': 1.1368683772161603e-13, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -7.36623491093269e-08, 'loss/value_avg': 0.020223723724484444, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 8.554503438062966e-05, 'val/ratio': 1.0, 'val/ratio_var': 4.736951712906159e-15, 'val/num_eos_tokens': 0, 'lr': 3.3243508084272415e-05, 'episode': 2740, 'epoch': 0.34}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:02:53<1:05:39, 131kB/s]
 34%|███▎      | 686/2041 [59:34<1:56:31,  5.16s/it][A

{'eps': 0, 'objective/kl': 44.64741516113281, 'objective/entropy': 0.0006618499755859375, 'objective/non_score_reward': -2.2323708534240723, 'objective/rlhf_reward': -0.0993494987487793, 'objective/scores': 2.133021354675293, 'policy/approxkl_avg': 3.174651297052633e-13, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -1.4676238890842797e-07, 'loss/value_avg': 0.015708215534687042, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.00011491550685605034, 'val/ratio': 1.0, 'val/ratio_var': 4.736951712906159e-15, 'val/num_eos_tokens': 0, 'lr': 3.321901028907399e-05, 'episode': 2744, 'epoch': 0.34}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:02:58<1:05:39, 131kB/s]
 34%|███▎      | 687/2041 [59:40<1:56:36,  5.17s/it][A

{'eps': 0, 'objective/kl': 46.24909591674805, 'objective/entropy': 0.000606536865234375, 'objective/non_score_reward': -2.31245493888855, 'objective/rlhf_reward': -0.16812729835510254, 'objective/scores': 2.1443276405334473, 'policy/approxkl_avg': 1.8876305412415112e-13, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -1.0740082245774829e-07, 'loss/value_avg': 0.0168271716684103, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.0001057341432897374, 'val/ratio': 1.0, 'val/ratio_var': 2.3684758564530796e-15, 'val/num_eos_tokens': 0, 'lr': 3.319451249387555e-05, 'episode': 2748, 'epoch': 0.34}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:03:03<1:05:39, 131kB/s]
 34%|███▎      | 688/2041 [59:45<1:56:50,  5.18s/it][A

{'eps': 0, 'objective/kl': 49.284202575683594, 'objective/entropy': 0.0005269050598144531, 'objective/non_score_reward': -2.464210033416748, 'objective/rlhf_reward': -0.2913236618041992, 'objective/scores': 2.172886371612549, 'policy/approxkl_avg': 2.0592331821927407e-13, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -1.4620007426913162e-08, 'loss/value_avg': 0.015917222946882248, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 9.422931907465681e-05, 'val/ratio': 1.0, 'val/ratio_var': 4.736951712906159e-15, 'val/num_eos_tokens': 0, 'lr': 3.317001469867712e-05, 'episode': 2752, 'epoch': 0.34}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:03:08<1:05:39, 131kB/s]
 34%|███▍      | 689/2041 [59:50<1:56:20,  5.16s/it][A

{'eps': 0, 'objective/kl': 46.27082061767578, 'objective/entropy': 0.0004887580871582031, 'objective/non_score_reward': -2.3135411739349365, 'objective/rlhf_reward': -0.08832955360412598, 'objective/scores': 2.2252116203308105, 'policy/approxkl_avg': 1.265570493454854e-13, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -1.349539076045403e-07, 'loss/value_avg': 0.015194524079561234, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 8.783474913798273e-05, 'val/ratio': 1.0, 'val/ratio_var': 2.3684758564530796e-15, 'val/num_eos_tokens': 0, 'lr': 3.314551690347869e-05, 'episode': 2756, 'epoch': 0.34}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:03:14<1:05:39, 131kB/s]
 34%|███▍      | 690/2041 [59:55<1:56:33,  5.18s/it][A

{'eps': 0, 'objective/kl': 47.23100662231445, 'objective/entropy': 0.0005006790161132812, 'objective/non_score_reward': -2.3615503311157227, 'objective/rlhf_reward': -0.1825389862060547, 'objective/scores': 2.179011344909668, 'policy/approxkl_avg': 1.9734319972423975e-13, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -7.647388855502868e-08, 'loss/value_avg': 0.014024948701262474, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 9.051583765540272e-05, 'val/ratio': 1.0, 'val/ratio_var': 4.736951712906159e-15, 'val/num_eos_tokens': 0, 'lr': 3.3121019108280256e-05, 'episode': 2760, 'epoch': 0.34}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:03:19<1:05:39, 131kB/s]
 34%|███▍      | 691/2041 [1:00:00<1:56:03,  5.16s/it][A

{'eps': 0, 'objective/kl': 46.05236053466797, 'objective/entropy': 0.0005893707275390625, 'objective/non_score_reward': -2.3026180267333984, 'objective/rlhf_reward': -0.1461186408996582, 'objective/scores': 2.1564993858337402, 'policy/approxkl_avg': 1.0296166588590061e-13, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -5.623084220474084e-10, 'loss/value_avg': 0.01563004031777382, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.00010386727808509022, 'val/ratio': 1.0, 'val/ratio_var': 0.0, 'val/num_eos_tokens': 0, 'lr': 3.3096521313081824e-05, 'episode': 2764, 'epoch': 0.34}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:03:24<1:05:39, 131kB/s]
 34%|███▍      | 692/2041 [1:00:05<1:56:11,  5.17s/it][A

{'eps': 0, 'objective/kl': 46.05044174194336, 'objective/entropy': 0.0005521774291992188, 'objective/non_score_reward': -2.3025221824645996, 'objective/rlhf_reward': -0.1347799301147461, 'objective/scores': 2.1677422523498535, 'policy/approxkl_avg': 2.0163325219549333e-13, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -1.540723815196543e-07, 'loss/value_avg': 0.019571121782064438, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 9.669672726886347e-05, 'val/ratio': 1.0, 'val/ratio_var': 4.736951712906159e-15, 'val/num_eos_tokens': 0, 'lr': 3.307202351788339e-05, 'episode': 2768, 'epoch': 0.34}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:03:29<1:05:39, 131kB/s]
 34%|███▍      | 693/2041 [1:00:11<1:55:31,  5.14s/it][A

{'eps': 0, 'objective/kl': 44.270957946777344, 'objective/entropy': 0.0005655288696289062, 'objective/non_score_reward': -2.213547945022583, 'objective/rlhf_reward': -0.11214613914489746, 'objective/scores': 2.1014018058776855, 'policy/approxkl_avg': 2.2093856963129738e-13, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -8.209696744643225e-08, 'loss/value_avg': 0.016355620697140694, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 9.963647607946768e-05, 'val/ratio': 1.0, 'val/ratio_var': 4.736951712906159e-15, 'val/num_eos_tokens': 0, 'lr': 3.304752572268496e-05, 'episode': 2772, 'epoch': 0.34}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:03:34<1:05:39, 131kB/s]
 34%|███▍      | 694/2041 [1:00:16<1:55:55,  5.16s/it][A

{'eps': 0, 'objective/kl': 44.8984260559082, 'objective/entropy': 0.0006113052368164062, 'objective/non_score_reward': -2.2449216842651367, 'objective/rlhf_reward': -0.1624007225036621, 'objective/scores': 2.0825209617614746, 'policy/approxkl_avg': 3.410605402699024e-13, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -3.182663022016641e-07, 'loss/value_avg': 0.018355419859290123, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.00010667432070476934, 'val/ratio': 1.0, 'val/ratio_var': 7.105427357601002e-15, 'val/num_eos_tokens': 0, 'lr': 3.302302792748653e-05, 'episode': 2776, 'epoch': 0.34}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:03:39<1:05:39, 131kB/s]
 34%|███▍      | 695/2041 [1:00:21<1:55:07,  5.13s/it][A

{'eps': 0, 'objective/kl': 44.567230224609375, 'objective/entropy': 0.0004763603210449219, 'objective/non_score_reward': -2.2283616065979004, 'objective/rlhf_reward': -0.1412057876586914, 'objective/scores': 2.087155818939209, 'policy/approxkl_avg': 2.380988743840018e-13, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -2.069293287831897e-07, 'loss/value_avg': 0.014842122793197632, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 8.650995732750744e-05, 'val/ratio': 1.0, 'val/ratio_var': 4.736951712906159e-15, 'val/num_eos_tokens': 0, 'lr': 3.2998530132288096e-05, 'episode': 2780, 'epoch': 0.34}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:03:44<1:05:39, 131kB/s]
 34%|███▍      | 696/2041 [1:00:26<1:55:17,  5.14s/it][A

{'eps': 0, 'objective/kl': 46.58702850341797, 'objective/entropy': 0.0004124641418457031, 'objective/non_score_reward': -2.3293514251708984, 'objective/rlhf_reward': -0.1464090347290039, 'objective/scores': 2.1829423904418945, 'policy/approxkl_avg': 1.3299214160489292e-13, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -1.0177774356634472e-07, 'loss/value_avg': 0.015784669667482376, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 7.612300396431237e-05, 'val/ratio': 1.0, 'val/ratio_var': 4.736951712906159e-15, 'val/num_eos_tokens': 0, 'lr': 3.2974032337089664e-05, 'episode': 2784, 'epoch': 0.34}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:03:50<1:05:39, 131kB/s]
 34%|███▍      | 697/2041 [1:00:31<1:55:27,  5.15s/it][A

{'eps': 0, 'objective/kl': 48.63944625854492, 'objective/entropy': 0.0004634857177734375, 'objective/non_score_reward': -2.4319725036621094, 'objective/rlhf_reward': -0.35016822814941406, 'objective/scores': 2.0818042755126953, 'policy/approxkl_avg': 1.201219435335507e-13, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': 2.0805394740364136e-08, 'loss/value_avg': 0.01281055063009262, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 8.418649667873979e-05, 'val/ratio': 1.0, 'val/ratio_var': 2.3684758564530796e-15, 'val/num_eos_tokens': 0, 'lr': 3.294953454189123e-05, 'episode': 2788, 'epoch': 0.34}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:03:55<1:05:39, 131kB/s]
 34%|███▍      | 698/2041 [1:00:36<1:55:09,  5.14s/it][A

{'eps': 0, 'objective/kl': 47.152061462402344, 'objective/entropy': 0.0007181167602539062, 'objective/non_score_reward': -2.357603073120117, 'objective/rlhf_reward': -0.32848691940307617, 'objective/scores': 2.029116153717041, 'policy/approxkl_avg': 3.4320555295300204e-14, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': 1.911847036240033e-08, 'loss/value_avg': 0.013291358947753906, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.00012265960685908794, 'val/ratio': 1.0, 'val/ratio_var': 0.0, 'val/num_eos_tokens': 0, 'lr': 3.29250367466928e-05, 'episode': 2792, 'epoch': 0.34}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:04:00<1:05:39, 131kB/s]
 34%|███▍      | 699/2041 [1:00:42<1:55:34,  5.17s/it][A

{'eps': 0, 'objective/kl': 47.90741729736328, 'objective/entropy': 0.0005407333374023438, 'objective/non_score_reward': -2.3953709602355957, 'objective/rlhf_reward': -0.35585737228393555, 'objective/scores': 2.03951358795166, 'policy/approxkl_avg': 0.0, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -1.4057699004865754e-08, 'loss/value_avg': 0.012892811559140682, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 9.678670176072046e-05, 'val/ratio': 1.0, 'val/ratio_var': 0.0, 'val/num_eos_tokens': 0, 'lr': 3.290053895149437e-05, 'episode': 2796, 'epoch': 0.34}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:04:05<1:05:39, 131kB/s]
 34%|███▍      | 700/2041 [1:00:47<1:56:06,  5.20s/it][A

{'eps': 0, 'objective/kl': 46.723331451416016, 'objective/entropy': 0.0003829002380371094, 'objective/non_score_reward': -2.3361666202545166, 'objective/rlhf_reward': -0.18287444114685059, 'objective/scores': 2.153292179107666, 'policy/approxkl_avg': 0.0, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': 3.3738478677491912e-09, 'loss/value_avg': 0.012787261046469212, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 7.261869905050844e-05, 'val/ratio': 1.0, 'val/ratio_var': 0.0, 'val/num_eos_tokens': 0, 'lr': 3.2876041156295936e-05, 'episode': 2800, 'epoch': 0.34}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:04:10<1:05:39, 131kB/s]
 34%|███▍      | 701/2041 [1:00:52<1:55:53,  5.19s/it][A

{'eps': 0, 'objective/kl': 46.228004455566406, 'objective/entropy': 0.0005273818969726562, 'objective/non_score_reward': -2.3114001750946045, 'objective/rlhf_reward': -0.30440187454223633, 'objective/scores': 2.006998300552368, 'policy/approxkl_avg': 2.7885451177431415e-14, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -3.345732579873584e-08, 'loss/value_avg': 0.009842779487371445, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 9.350506297778338e-05, 'val/ratio': 1.0, 'val/ratio_var': 0.0, 'val/num_eos_tokens': 0, 'lr': 3.2851543361097504e-05, 'episode': 2804, 'epoch': 0.34}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:04:16<1:05:39, 131kB/s]
 34%|███▍      | 702/2041 [1:00:57<1:56:09,  5.21s/it][A

{'eps': 0, 'objective/kl': 45.72965621948242, 'objective/entropy': 0.00047779083251953125, 'objective/non_score_reward': -2.286482810974121, 'objective/rlhf_reward': -0.25091052055358887, 'objective/scores': 2.0355722904205322, 'policy/approxkl_avg': 0.0, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -8.434620113462188e-09, 'loss/value_avg': 0.011539235711097717, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 8.695080759935081e-05, 'val/ratio': 1.0, 'val/ratio_var': 0.0, 'val/num_eos_tokens': 0, 'lr': 3.282704556589907e-05, 'episode': 2808, 'epoch': 0.34}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:04:21<1:05:39, 131kB/s]
 34%|███▍      | 703/2041 [1:01:02<1:55:43,  5.19s/it][A

{'eps': 0, 'objective/kl': 45.56569290161133, 'objective/entropy': 0.0004558563232421875, 'objective/non_score_reward': -2.2782845497131348, 'objective/rlhf_reward': -0.2715773582458496, 'objective/scores': 2.006707191467285, 'policy/approxkl_avg': 2.5740416471475153e-14, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -1.8837315707287416e-08, 'loss/value_avg': 0.012092825025320053, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 8.281222108053043e-05, 'val/ratio': 1.0, 'val/ratio_var': 0.0, 'val/num_eos_tokens': 0, 'lr': 3.2802547770700634e-05, 'episode': 2812, 'epoch': 0.34}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:04:26<1:05:39, 131kB/s]
 34%|███▍      | 704/2041 [1:01:07<1:55:08,  5.17s/it][A

{'eps': 0, 'objective/kl': 45.46246337890625, 'objective/entropy': 0.0005054473876953125, 'objective/non_score_reward': -2.273123264312744, 'objective/rlhf_reward': -0.28884339332580566, 'objective/scores': 1.9842798709869385, 'policy/approxkl_avg': 6.006097176677536e-14, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': 1.6869238450567536e-08, 'loss/value_avg': 0.014348848722875118, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 9.114337444771081e-05, 'val/ratio': 1.0, 'val/ratio_var': 0.0, 'val/num_eos_tokens': 0, 'lr': 3.277804997550221e-05, 'episode': 2816, 'epoch': 0.34}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:04:31<1:05:39, 131kB/s]
 35%|███▍      | 705/2041 [1:01:13<1:54:34,  5.15s/it][A

{'eps': 0, 'objective/kl': 48.05029296875, 'objective/entropy': 0.0005173683166503906, 'objective/non_score_reward': -2.402514934539795, 'objective/rlhf_reward': -0.40032005310058594, 'objective/scores': 2.002194881439209, 'policy/approxkl_avg': 2.5740416471475153e-14, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': 3.373847690113507e-08, 'loss/value_avg': 0.012007411569356918, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 9.25716376514174e-05, 'val/ratio': 1.0, 'val/ratio_var': 0.0, 'val/num_eos_tokens': 0, 'lr': 3.2753552180303776e-05, 'episode': 2820, 'epoch': 0.35}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:04:36<1:05:39, 131kB/s]
 35%|███▍      | 706/2041 [1:01:18<1:54:45,  5.16s/it][A

{'eps': 0, 'objective/kl': 47.95117950439453, 'objective/entropy': 0.000453948974609375, 'objective/non_score_reward': -2.39755916595459, 'objective/rlhf_reward': -0.42237579822540283, 'objective/scores': 1.975183367729187, 'policy/approxkl_avg': 1.7160277647650102e-14, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -5.623079779581985e-10, 'loss/value_avg': 0.011661665514111519, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 8.241859904956073e-05, 'val/ratio': 1.0, 'val/ratio_var': 0.0, 'val/num_eos_tokens': 0, 'lr': 3.272905438510534e-05, 'episode': 2824, 'epoch': 0.35}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:04:41<1:05:39, 131kB/s]
 35%|███▍      | 707/2041 [1:01:23<1:54:35,  5.15s/it][A

{'eps': 0, 'objective/kl': 44.82207489013672, 'objective/entropy': 0.0005216598510742188, 'objective/non_score_reward': -2.2411036491394043, 'objective/rlhf_reward': -0.38183534145355225, 'objective/scores': 1.859268307685852, 'policy/approxkl_avg': 2.5740416471475153e-14, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -8.153465458349274e-09, 'loss/value_avg': 0.011928178369998932, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 9.407637116964906e-05, 'val/ratio': 1.0, 'val/ratio_var': 0.0, 'val/num_eos_tokens': 0, 'lr': 3.270455658990691e-05, 'episode': 2828, 'epoch': 0.35}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:04:46<1:05:39, 131kB/s]
 35%|███▍      | 708/2041 [1:01:28<1:54:09,  5.14s/it][A

{'eps': 0, 'objective/kl': 46.507080078125, 'objective/entropy': 0.0005631446838378906, 'objective/non_score_reward': -2.3253540992736816, 'objective/rlhf_reward': -0.49970459938049316, 'objective/scores': 1.8256494998931885, 'policy/approxkl_avg': 0.0, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': 2.3335779530953005e-08, 'loss/value_avg': 0.012178287841379642, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.00010020328045357019, 'val/ratio': 1.0, 'val/ratio_var': 0.0, 'val/num_eos_tokens': 0, 'lr': 3.268005879470848e-05, 'episode': 2832, 'epoch': 0.35}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:04:51<1:05:39, 131kB/s]
 35%|███▍      | 709/2041 [1:01:33<1:53:33,  5.12s/it][A

{'eps': 0, 'objective/kl': 45.856834411621094, 'objective/entropy': 0.00052642822265625, 'objective/non_score_reward': -2.292841911315918, 'objective/rlhf_reward': -0.4507991075515747, 'objective/scores': 1.8420428037643433, 'policy/approxkl_avg': 0.0, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -2.305462665219693e-08, 'loss/value_avg': 0.009973939508199692, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 9.468816278968006e-05, 'val/ratio': 1.0, 'val/ratio_var': 0.0, 'val/num_eos_tokens': 0, 'lr': 3.265556099951004e-05, 'episode': 2836, 'epoch': 0.35}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:04:57<1:05:39, 131kB/s]
 35%|███▍      | 710/2041 [1:01:38<1:53:58,  5.14s/it][A

{'eps': 0, 'objective/kl': 44.484161376953125, 'objective/entropy': 0.0004649162292480469, 'objective/non_score_reward': -2.224207878112793, 'objective/rlhf_reward': -0.37499427795410156, 'objective/scores': 1.8492136001586914, 'policy/approxkl_avg': 8.580138823825051e-15, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -1.0121543603247574e-08, 'loss/value_avg': 0.014617684297263622, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 8.47982955747284e-05, 'val/ratio': 1.0, 'val/ratio_var': 0.0, 'val/num_eos_tokens': 0, 'lr': 3.263106320431161e-05, 'episode': 2840, 'epoch': 0.35}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:05:02<1:05:39, 131kB/s]
 35%|███▍      | 711/2041 [1:01:43<1:53:44,  5.13s/it][A

{'eps': 0, 'objective/kl': 47.81172180175781, 'objective/entropy': 0.00043964385986328125, 'objective/non_score_reward': -2.3905863761901855, 'objective/rlhf_reward': -0.596233606338501, 'objective/scores': 1.7943527698516846, 'policy/approxkl_avg': 1.7160277647650102e-14, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': 2.5725590546699095e-08, 'loss/value_avg': 0.014390402473509312, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 8.07789183454588e-05, 'val/ratio': 1.0, 'val/ratio_var': 0.0, 'val/num_eos_tokens': 0, 'lr': 3.2606565409113185e-05, 'episode': 2844, 'epoch': 0.35}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:05:07<1:05:39, 131kB/s]
 35%|███▍      | 712/2041 [1:01:49<1:54:11,  5.16s/it][A

{'eps': 0, 'objective/kl': 43.94580841064453, 'objective/entropy': 0.0005927085876464844, 'objective/non_score_reward': -2.1972904205322266, 'objective/rlhf_reward': -0.3988983631134033, 'objective/scores': 1.7983920574188232, 'policy/approxkl_avg': 2.7885451177431415e-14, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -2.1367704050589964e-08, 'loss/value_avg': 0.009112550877034664, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.00010465450759511441, 'val/ratio': 1.0, 'val/ratio_var': 0.0, 'val/num_eos_tokens': 0, 'lr': 3.258206761391475e-05, 'episode': 2848, 'epoch': 0.35}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:05:12<1:05:39, 131kB/s]
 35%|███▍      | 713/2041 [1:01:54<1:53:51,  5.14s/it][A

{'eps': 0, 'objective/kl': 44.72616958618164, 'objective/entropy': 0.0003800392150878906, 'objective/non_score_reward': -2.2363083362579346, 'objective/rlhf_reward': -0.5110527276992798, 'objective/scores': 1.7252556085586548, 'policy/approxkl_avg': 0.0, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -2.9521169508939238e-08, 'loss/value_avg': 0.01155012845993042, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 7.194843055913225e-05, 'val/ratio': 1.0, 'val/ratio_var': 0.0, 'val/num_eos_tokens': 0, 'lr': 3.2557569818716314e-05, 'episode': 2852, 'epoch': 0.35}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:05:17<1:05:39, 131kB/s]
 35%|███▍      | 714/2041 [1:01:59<1:54:02,  5.16s/it][A

{'eps': 0, 'objective/kl': 43.759307861328125, 'objective/entropy': 0.0004119873046875, 'objective/non_score_reward': -2.187965154647827, 'objective/rlhf_reward': -0.46661829948425293, 'objective/scores': 1.7213468551635742, 'policy/approxkl_avg': 0.0, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': 3.092693878770092e-09, 'loss/value_avg': 0.015154380351305008, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 7.661558629479259e-05, 'val/ratio': 1.0, 'val/ratio_var': 0.0, 'val/num_eos_tokens': 0, 'lr': 3.253307202351789e-05, 'episode': 2856, 'epoch': 0.35}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:05:22<1:05:39, 131kB/s]
 35%|███▌      | 715/2041 [1:02:04<1:53:46,  5.15s/it][A

{'eps': 0, 'objective/kl': 46.229759216308594, 'objective/entropy': 0.0005340576171875, 'objective/non_score_reward': -2.311488151550293, 'objective/rlhf_reward': -0.6682155132293701, 'objective/scores': 1.6432726383209229, 'policy/approxkl_avg': 4.2900694119125254e-14, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': 1.8977893034843873e-08, 'loss/value_avg': 0.010835809633135796, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 9.50345493038185e-05, 'val/ratio': 1.0, 'val/ratio_var': 0.0, 'val/num_eos_tokens': 0, 'lr': 3.250857422831946e-05, 'episode': 2860, 'epoch': 0.35}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:05:28<1:05:39, 131kB/s]
 35%|███▌      | 716/2041 [1:02:09<1:54:12,  5.17s/it][A

{'eps': 0, 'objective/kl': 47.64131164550781, 'objective/entropy': 0.00047588348388671875, 'objective/non_score_reward': -2.382065534591675, 'objective/rlhf_reward': -0.6997814178466797, 'objective/scores': 1.6822841167449951, 'policy/approxkl_avg': 0.0, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': 1.7993855294662353e-08, 'loss/value_avg': 0.016931667923927307, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 8.682259795023128e-05, 'val/ratio': 1.0, 'val/ratio_var': 0.0, 'val/num_eos_tokens': 0, 'lr': 3.248407643312102e-05, 'episode': 2864, 'epoch': 0.35}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:05:33<1:05:39, 131kB/s]
 35%|███▌      | 717/2041 [1:02:14<1:54:02,  5.17s/it][A

{'eps': 0, 'objective/kl': 48.18831253051758, 'objective/entropy': 0.00043392181396484375, 'objective/non_score_reward': -2.4094157218933105, 'objective/rlhf_reward': -0.8433666229248047, 'objective/scores': 1.5660490989685059, 'policy/approxkl_avg': 8.580138823825051e-15, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -1.1949044420589416e-08, 'loss/value_avg': 0.014539996162056923, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 7.995569467311725e-05, 'val/ratio': 1.0, 'val/ratio_var': 0.0, 'val/num_eos_tokens': 0, 'lr': 3.2459578637922586e-05, 'episode': 2868, 'epoch': 0.35}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:05:38<1:05:39, 131kB/s]
 35%|███▌      | 718/2041 [1:02:20<1:53:52,  5.16s/it][A

{'eps': 0, 'objective/kl': 46.001182556152344, 'objective/entropy': 0.0004715919494628906, 'objective/non_score_reward': -2.3000593185424805, 'objective/rlhf_reward': -0.7876172065734863, 'objective/scores': 1.5124421119689941, 'policy/approxkl_avg': 3.4320555295300204e-14, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -5.932349012027771e-08, 'loss/value_avg': 0.012283802032470703, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 8.632102981209755e-05, 'val/ratio': 1.0, 'val/ratio_var': 0.0, 'val/num_eos_tokens': 0, 'lr': 3.243508084272416e-05, 'episode': 2872, 'epoch': 0.35}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:05:43<1:05:39, 131kB/s]
 35%|███▌      | 719/2041 [1:02:25<1:54:13,  5.18s/it][A

{'eps': 0, 'objective/kl': 44.822322845458984, 'objective/entropy': 0.0004177093505859375, 'objective/non_score_reward': -2.2411160469055176, 'objective/rlhf_reward': -0.7776561975479126, 'objective/scores': 1.463459849357605, 'policy/approxkl_avg': 0.0, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': 2.249231911832794e-09, 'loss/value_avg': 0.011933721601963043, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 7.72251223679632e-05, 'val/ratio': 1.0, 'val/ratio_var': 0.0, 'val/num_eos_tokens': 0, 'lr': 3.241058304752572e-05, 'episode': 2876, 'epoch': 0.35}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:05:48<1:05:39, 131kB/s]
 35%|███▌      | 720/2041 [1:02:30<1:54:18,  5.19s/it][A

{'eps': 0, 'objective/kl': 47.24435043334961, 'objective/entropy': 0.00040531158447265625, 'objective/non_score_reward': -2.362217426300049, 'objective/rlhf_reward': -0.8306653499603271, 'objective/scores': 1.5315520763397217, 'policy/approxkl_avg': 4.2900694119125254e-14, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -6.7476957354983824e-09, 'loss/value_avg': 0.014069751836359501, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 7.649188046343625e-05, 'val/ratio': 1.0, 'val/ratio_var': 0.0, 'val/num_eos_tokens': 0, 'lr': 3.238608525232729e-05, 'episode': 2880, 'epoch': 0.35}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:05:53<1:05:39, 131kB/s]
 35%|███▌      | 721/2041 [1:02:35<1:53:17,  5.15s/it][A

{'eps': 0, 'objective/kl': 46.87944793701172, 'objective/entropy': 0.0004420280456542969, 'objective/non_score_reward': -2.3439724445343018, 'objective/rlhf_reward': -0.9388148784637451, 'objective/scores': 1.4051575660705566, 'policy/approxkl_avg': 6.006097176677536e-14, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -8.856349609231984e-08, 'loss/value_avg': 0.014506994746625423, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 8.173259266186506e-05, 'val/ratio': 1.0, 'val/ratio_var': 0.0, 'val/num_eos_tokens': 0, 'lr': 3.236158745712886e-05, 'episode': 2884, 'epoch': 0.35}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:05:59<1:05:39, 131kB/s]
 35%|███▌      | 722/2041 [1:02:40<1:53:11,  5.15s/it][A

{'eps': 0, 'objective/kl': 45.591373443603516, 'objective/entropy': 0.000537872314453125, 'objective/non_score_reward': -2.279568672180176, 'objective/rlhf_reward': -0.9153326749801636, 'objective/scores': 1.3642359972000122, 'policy/approxkl_avg': 6.220600477866572e-14, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -5.2294641506023254e-08, 'loss/value_avg': 0.011678370647132397, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 9.678895003162324e-05, 'val/ratio': 1.0, 'val/ratio_var': 5.921189641132699e-16, 'val/num_eos_tokens': 0, 'lr': 3.2337089661930426e-05, 'episode': 2888, 'epoch': 0.35}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:06:04<1:05:39, 131kB/s]
 35%|███▌      | 723/2041 [1:02:45<1:53:43,  5.18s/it][A

{'eps': 0, 'objective/kl': 46.075714111328125, 'objective/entropy': 0.0004277229309082031, 'objective/non_score_reward': -2.303786039352417, 'objective/rlhf_reward': -0.9392310380935669, 'objective/scores': 1.36455500125885, 'policy/approxkl_avg': 6.006097176677536e-14, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -9.840390191584447e-08, 'loss/value_avg': 0.013658769428730011, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 7.967904093675315e-05, 'val/ratio': 1.0, 'val/ratio_var': 5.921189641132699e-16, 'val/num_eos_tokens': 0, 'lr': 3.2312591866731994e-05, 'episode': 2892, 'epoch': 0.35}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:06:09<1:05:39, 131kB/s]
 35%|███▌      | 724/2041 [1:02:51<1:53:25,  5.17s/it][A

{'eps': 0, 'objective/kl': 46.71073913574219, 'objective/entropy': 0.0006108283996582031, 'objective/non_score_reward': -2.3355369567871094, 'objective/rlhf_reward': -1.1131675243377686, 'objective/scores': 1.2223694324493408, 'policy/approxkl_avg': 9.223648727392161e-14, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': 6.185388201629394e-08, 'loss/value_avg': 0.011877752840518951, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.00010770221706479788, 'val/ratio': 0.9999999403953552, 'val/ratio_var': 5.329070518200751e-15, 'val/num_eos_tokens': 0, 'lr': 3.228809407153356e-05, 'episode': 2896, 'epoch': 0.35}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:06:14<1:05:39, 131kB/s]
 36%|███▌      | 725/2041 [1:02:56<1:53:22,  5.17s/it][A

{'eps': 0, 'objective/kl': 45.32443618774414, 'objective/entropy': 0.00043392181396484375, 'objective/non_score_reward': -2.2662220001220703, 'objective/rlhf_reward': -1.0372998714447021, 'objective/scores': 1.2289221286773682, 'policy/approxkl_avg': 6.864111059060041e-14, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -1.4057698649594386e-07, 'loss/value_avg': 0.01725814864039421, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 8.051350596360862e-05, 'val/ratio': 1.0, 'val/ratio_var': 5.921189641132699e-16, 'val/num_eos_tokens': 0, 'lr': 3.226359627633514e-05, 'episode': 2900, 'epoch': 0.36}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:06:19<1:05:39, 131kB/s]
 36%|███▌      | 726/2041 [1:03:01<1:54:06,  5.21s/it][A

{'eps': 0, 'objective/kl': 47.95314025878906, 'objective/entropy': 0.00045680999755859375, 'objective/non_score_reward': -2.3976571559906006, 'objective/rlhf_reward': -1.1322256326675415, 'objective/scores': 1.265431523323059, 'policy/approxkl_avg': 1.7160277647650102e-13, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -1.2904968116345117e-07, 'loss/value_avg': 0.012959647923707962, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 8.317659376189113e-05, 'val/ratio': 1.0, 'val/ratio_var': 1.1842379282265398e-15, 'val/num_eos_tokens': 0, 'lr': 3.22390984811367e-05, 'episode': 2904, 'epoch': 0.36}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:06:25<1:05:39, 131kB/s]
 36%|███▌      | 727/2041 [1:03:06<1:53:26,  5.18s/it][A

{'eps': 0, 'objective/kl': 45.46637725830078, 'objective/entropy': 0.0004901885986328125, 'objective/non_score_reward': -2.2733187675476074, 'objective/rlhf_reward': -1.1499488353729248, 'objective/scores': 1.1233699321746826, 'policy/approxkl_avg': 1.2870208913363934e-13, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -1.1583543368942628e-07, 'loss/value_avg': 0.014683877117931843, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 8.902235276764259e-05, 'val/ratio': 1.0, 'val/ratio_var': 1.1842379282265398e-15, 'val/num_eos_tokens': 0, 'lr': 3.2214600685938267e-05, 'episode': 2908, 'epoch': 0.36}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:06:30<1:05:39, 131kB/s]
 36%|███▌      | 728/2041 [1:03:11<1:53:11,  5.17s/it][A

{'eps': 0, 'objective/kl': 47.46559524536133, 'objective/entropy': 0.00048828125, 'objective/non_score_reward': -2.3732800483703613, 'objective/rlhf_reward': -1.2590575218200684, 'objective/scores': 1.114222526550293, 'policy/approxkl_avg': 2.9172470645752457e-13, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -2.7271937597106444e-07, 'loss/value_avg': 0.018506117165088654, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 8.99535370990634e-05, 'val/ratio': 0.9999999403953552, 'val/ratio_var': 5.329070518200751e-15, 'val/num_eos_tokens': 0, 'lr': 3.2190102890739835e-05, 'episode': 2912, 'epoch': 0.36}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:06:35<1:05:39, 131kB/s]
 36%|███▌      | 729/2041 [1:03:16<1:53:01,  5.17s/it][A

{'eps': 0, 'objective/kl': 47.73208236694336, 'objective/entropy': 0.0005049705505371094, 'objective/non_score_reward': -2.386604070663452, 'objective/rlhf_reward': -1.4007749557495117, 'objective/scores': 0.9858291149139404, 'policy/approxkl_avg': 3.367704606935945e-13, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -2.1592626353594824e-07, 'loss/value_avg': 0.016538694500923157, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 9.156847954727709e-05, 'val/ratio': 0.9999999403953552, 'val/ratio_var': 4.736951712906159e-15, 'val/num_eos_tokens': 0, 'lr': 3.21656050955414e-05, 'episode': 2916, 'epoch': 0.36}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:06:40<1:05:39, 131kB/s]
 36%|███▌      | 730/2041 [1:03:22<1:52:35,  5.15s/it][A

{'eps': 0, 'objective/kl': 47.61002731323242, 'objective/entropy': 0.0008296966552734375, 'objective/non_score_reward': -2.3805015087127686, 'objective/rlhf_reward': -1.4487173557281494, 'objective/scores': 0.9317841529846191, 'policy/approxkl_avg': 1.278440677973669e-12, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -3.7351304627009085e-07, 'loss/value_avg': 0.014407368376851082, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.00014011815073899925, 'val/ratio': 0.9999997615814209, 'val/ratio_var': 3.256654318504852e-14, 'val/num_eos_tokens': 0, 'lr': 3.214110730034297e-05, 'episode': 2920, 'epoch': 0.36}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:06:45<1:05:39, 131kB/s]
 36%|███▌      | 731/2041 [1:03:27<1:52:36,  5.16s/it][A

{'eps': 0, 'objective/kl': 45.38700866699219, 'objective/entropy': 0.00048828125, 'objective/non_score_reward': -2.269350528717041, 'objective/rlhf_reward': -1.3918042182922363, 'objective/scores': 0.8775463104248047, 'policy/approxkl_avg': 2.6598428320978584e-13, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -1.8837316417830152e-07, 'loss/value_avg': 0.015533867292106152, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 8.890988829080015e-05, 'val/ratio': 0.9999999403953552, 'val/ratio_var': 4.736951712906159e-15, 'val/num_eos_tokens': 0, 'lr': 3.211660950514454e-05, 'episode': 2924, 'epoch': 0.36}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:06:50<1:05:39, 131kB/s]
 36%|███▌      | 732/2041 [1:03:32<1:52:17,  5.15s/it][A

{'eps': 0, 'objective/kl': 48.94976806640625, 'objective/entropy': 0.0006093978881835938, 'objective/non_score_reward': -2.447488307952881, 'objective/rlhf_reward': -1.6284565925598145, 'objective/scores': 0.8190317153930664, 'policy/approxkl_avg': 6.778309467533883e-13, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -3.424455314871011e-07, 'loss/value_avg': 0.020506519824266434, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.00010775170812848955, 'val/ratio': 0.9999998807907104, 'val/ratio_var': 4.144832484095093e-15, 'val/num_eos_tokens': 0, 'lr': 3.209211170994611e-05, 'episode': 2928, 'epoch': 0.36}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:06:55<1:05:39, 131kB/s]
 36%|███▌      | 733/2041 [1:03:37<1:52:17,  5.15s/it][A

{'eps': 0, 'objective/kl': 46.145301818847656, 'objective/entropy': 0.000553131103515625, 'objective/non_score_reward': -2.307265043258667, 'objective/rlhf_reward': -1.5755958557128906, 'objective/scores': 0.7316692471504211, 'policy/approxkl_avg': 5.190983819007566e-13, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -3.52567099071166e-07, 'loss/value_avg': 0.020120222121477127, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 9.947903163265437e-05, 'val/ratio': 0.9999998807907104, 'val/ratio_var': 2.4868995751603507e-14, 'val/num_eos_tokens': 0, 'lr': 3.2067613914747675e-05, 'episode': 2932, 'epoch': 0.36}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:07:01<1:05:39, 131kB/s]
 36%|███▌      | 734/2041 [1:03:42<1:52:40,  5.17s/it][A

{'eps': 0, 'objective/kl': 47.83095169067383, 'objective/entropy': 0.0005774497985839844, 'objective/non_score_reward': -2.391547679901123, 'objective/rlhf_reward': -1.728285789489746, 'objective/scores': 0.663261890411377, 'policy/approxkl_avg': 7.979529038394662e-13, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -4.24823667799501e-07, 'loss/value_avg': 0.01998702995479107, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.00010391001706011593, 'val/ratio': 0.9999998807907104, 'val/ratio_var': 2.1908400454581124e-14, 'val/num_eos_tokens': 0, 'lr': 3.204311611954924e-05, 'episode': 2936, 'epoch': 0.36}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:07:06<1:05:39, 131kB/s]
 36%|███▌      | 735/2041 [1:03:47<1:52:42,  5.18s/it][A

{'eps': 0, 'objective/kl': 47.49528121948242, 'objective/entropy': 0.0006041526794433594, 'objective/non_score_reward': -2.3747644424438477, 'objective/rlhf_reward': -1.7810288667678833, 'objective/scores': 0.5937355756759644, 'policy/approxkl_avg': 7.979529038394662e-13, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -3.168605644532363e-07, 'loss/value_avg': 0.02029215730726719, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.00010700720304157585, 'val/ratio': 0.9999998807907104, 'val/ratio_var': 2.4868995751603507e-14, 'val/num_eos_tokens': 0, 'lr': 3.201861832435081e-05, 'episode': 2940, 'epoch': 0.36}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:07:11<1:05:39, 131kB/s]
 36%|███▌      | 736/2041 [1:03:53<1:53:00,  5.20s/it][A

{'eps': 0, 'objective/kl': 45.83866882324219, 'objective/entropy': 0.0005292892456054688, 'objective/non_score_reward': -2.291933298110962, 'objective/rlhf_reward': -1.7631394863128662, 'objective/scores': 0.5287937521934509, 'policy/approxkl_avg': 4.4187713926259475e-13, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -1.4844931683910545e-07, 'loss/value_avg': 0.022279977798461914, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 9.537868027109653e-05, 'val/ratio': 0.9999998807907104, 'val/ratio_var': 2.4276877369825388e-14, 'val/num_eos_tokens': 0, 'lr': 3.199412052915238e-05, 'episode': 2944, 'epoch': 0.36}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:07:16<1:05:39, 131kB/s]
 36%|███▌      | 737/2041 [1:03:58<1:52:36,  5.18s/it][A

{'eps': 0, 'objective/kl': 47.931373596191406, 'objective/entropy': 0.0008997917175292969, 'objective/non_score_reward': -2.396568775177002, 'objective/rlhf_reward': -1.9239344596862793, 'objective/scores': 0.47263437509536743, 'policy/approxkl_avg': 3.0223536347240287e-12, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -9.025042686516827e-07, 'loss/value_avg': 0.024579061195254326, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.00014958965766709298, 'val/ratio': 0.9999997615814209, 'val/ratio_var': 5.447494194556375e-14, 'val/num_eos_tokens': 0, 'lr': 3.196962273395395e-05, 'episode': 2948, 'epoch': 0.36}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:07:21<1:05:39, 131kB/s]
 36%|███▌      | 738/2041 [1:04:03<1:52:21,  5.17s/it][A

{'eps': 0, 'objective/kl': 44.759788513183594, 'objective/entropy': 0.0005631446838378906, 'objective/non_score_reward': -2.2379894256591797, 'objective/rlhf_reward': -1.8420913219451904, 'objective/scores': 0.39589816331863403, 'policy/approxkl_avg': 1.1239981384872366e-12, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -4.6980832735243894e-07, 'loss/value_avg': 0.02079899236559868, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.00010151683090953156, 'val/ratio': 0.9999998807907104, 'val/ratio_var': 2.1908400454581124e-14, 'val/num_eos_tokens': 0, 'lr': 3.1945124938755515e-05, 'episode': 2952, 'epoch': 0.36}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:07:27<1:05:39, 131kB/s]
 36%|███▌      | 739/2041 [1:04:08<1:51:40,  5.15s/it][A

{'eps': 0, 'objective/kl': 43.583641052246094, 'objective/entropy': 0.0006508827209472656, 'objective/non_score_reward': -2.1791820526123047, 'objective/rlhf_reward': -1.867422342300415, 'objective/scores': 0.31175968050956726, 'policy/approxkl_avg': 1.2977460902771631e-12, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -5.645571832246787e-07, 'loss/value_avg': 0.01879478618502617, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.00011534960503922775, 'val/ratio': 0.9999998211860657, 'val/ratio_var': 4.55931577485625e-14, 'val/num_eos_tokens': 0, 'lr': 3.192062714355708e-05, 'episode': 2956, 'epoch': 0.36}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:07:32<1:05:39, 131kB/s]
 36%|███▋      | 740/2041 [1:04:13<1:51:05,  5.12s/it][A

{'eps': 0, 'objective/kl': 45.97199249267578, 'objective/entropy': 0.0007495880126953125, 'objective/non_score_reward': -2.2985997200012207, 'objective/rlhf_reward': -2.068706750869751, 'objective/scores': 0.22989293932914734, 'policy/approxkl_avg': 2.653408092204157e-12, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -7.506811243729317e-07, 'loss/value_avg': 0.02419392392039299, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.0001289147330680862, 'val/ratio': 0.9999997615814209, 'val/ratio_var': 2.9605946193960245e-14, 'val/num_eos_tokens': 0, 'lr': 3.189612934835865e-05, 'episode': 2960, 'epoch': 0.36}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:07:37<1:05:39, 131kB/s]
 36%|███▋      | 741/2041 [1:04:18<1:51:00,  5.12s/it][A

{'eps': 0, 'objective/kl': 49.588645935058594, 'objective/entropy': 0.0006184577941894531, 'objective/non_score_reward': -2.4794321060180664, 'objective/rlhf_reward': -2.3239893913269043, 'objective/scores': 0.15544281899929047, 'policy/approxkl_avg': 1.7482033615873194e-12, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -7.737357918813359e-07, 'loss/value_avg': 0.031070321798324585, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.0001087188720703125, 'val/ratio': 0.9999997615814209, 'val/ratio_var': 3.256654318504852e-14, 'val/num_eos_tokens': 0, 'lr': 3.187163155316022e-05, 'episode': 2964, 'epoch': 0.36}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:07:42<1:05:39, 131kB/s]
 36%|███▋      | 742/2041 [1:04:23<1:51:22,  5.14s/it][A

{'eps': 0, 'objective/kl': 47.579803466796875, 'objective/entropy': 0.0009627342224121094, 'objective/non_score_reward': -2.3789901733398438, 'objective/rlhf_reward': -2.3252899646759033, 'objective/scores': 0.05370016023516655, 'policy/approxkl_avg': 5.4569682106375694e-12, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -1.030710564009496e-06, 'loss/value_avg': 0.027659593150019646, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.00015923887258395553, 'val/ratio': 0.9999996423721313, 'val/ratio_var': 9.592326932761353e-14, 'val/num_eos_tokens': 0, 'lr': 3.184713375796178e-05, 'episode': 2968, 'epoch': 0.36}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:07:47<1:05:39, 131kB/s]
 36%|███▋      | 743/2041 [1:04:29<1:51:33,  5.16s/it][A

{'eps': 0, 'objective/kl': 48.08108901977539, 'objective/entropy': 0.0009813308715820312, 'objective/non_score_reward': -2.404054641723633, 'objective/rlhf_reward': -2.4279229640960693, 'objective/scores': -0.023868262767791748, 'policy/approxkl_avg': 6.7053780063164314e-12, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -1.4760585145268124e-06, 'loss/value_avg': 0.02954496443271637, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.0001629411126486957, 'val/ratio': 0.9999996423721313, 'val/ratio_var': 1.2256862191861728e-13, 'val/num_eos_tokens': 0, 'lr': 3.1822635962763355e-05, 'episode': 2972, 'epoch': 0.36}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:07:52<1:05:39, 131kB/s]
 36%|███▋      | 744/2041 [1:04:34<1:51:10,  5.14s/it][A

{'eps': 0, 'objective/kl': 48.30866241455078, 'objective/entropy': 0.0008344650268554688, 'objective/non_score_reward': -2.415433168411255, 'objective/rlhf_reward': -2.5256850719451904, 'objective/scores': -0.11025190353393555, 'policy/approxkl_avg': 6.615287310995921e-12, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -1.7262855180888437e-06, 'loss/value_avg': 0.03418976441025734, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.00014268001541495323, 'val/ratio': 0.9999996423721313, 'val/ratio_var': 1.2020014161524123e-13, 'val/num_eos_tokens': 0, 'lr': 3.179813816756492e-05, 'episode': 2976, 'epoch': 0.36}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:07:57<1:05:39, 131kB/s]
 37%|███▋      | 745/2041 [1:04:39<1:50:21,  5.11s/it][A

{'eps': 0, 'objective/kl': 46.16596984863281, 'objective/entropy': 0.00121307373046875, 'objective/non_score_reward': -2.308298349380493, 'objective/rlhf_reward': -2.499985694885254, 'objective/scores': -0.1916874200105667, 'policy/approxkl_avg': 1.8307872579059747e-11, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -2.9003845156694297e-06, 'loss/value_avg': 0.02955969050526619, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.00019707545288838446, 'val/ratio': 0.9999993443489075, 'val/ratio_var': 3.090860900556436e-13, 'val/num_eos_tokens': 0, 'lr': 3.1773640372366485e-05, 'episode': 2980, 'epoch': 0.37}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:08:02<1:05:39, 131kB/s]
 37%|███▋      | 746/2041 [1:04:44<1:50:24,  5.12s/it][A

{'eps': 0, 'objective/kl': 44.695343017578125, 'objective/entropy': 0.0010848045349121094, 'objective/non_score_reward': -2.234767198562622, 'objective/rlhf_reward': -2.49780535697937, 'objective/scores': -0.2630380690097809, 'policy/approxkl_avg': 1.842799350615376e-11, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -2.2765038920624647e-06, 'loss/value_avg': 0.03018638864159584, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.00018026470206677914, 'val/ratio': 0.9999992847442627, 'val/ratio_var': 3.5467925796860145e-13, 'val/num_eos_tokens': 0, 'lr': 3.174914257716806e-05, 'episode': 2984, 'epoch': 0.37}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:08:07<1:05:39, 131kB/s]
 37%|███▋      | 747/2041 [1:04:49<1:49:59,  5.10s/it][A

{'eps': 0, 'objective/kl': 46.26646423339844, 'objective/entropy': 0.0014257431030273438, 'objective/non_score_reward': -2.3133232593536377, 'objective/rlhf_reward': -2.70601487159729, 'objective/scores': -0.3926917016506195, 'policy/approxkl_avg': 4.029018785267624e-11, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -4.50633615400875e-06, 'loss/value_avg': 0.030900174751877785, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.00022762002481613308, 'val/ratio': 0.9999989867210388, 'val/ratio_var': 7.448856527252079e-13, 'val/num_eos_tokens': 0, 'lr': 3.172464478196963e-05, 'episode': 2988, 'epoch': 0.37}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:08:13<1:05:39, 131kB/s]
 37%|███▋      | 748/2041 [1:04:54<1:50:00,  5.10s/it][A

{'eps': 0, 'objective/kl': 48.318729400634766, 'objective/entropy': 0.0012722015380859375, 'objective/non_score_reward': -2.4159364700317383, 'objective/rlhf_reward': -2.891169548034668, 'objective/scores': -0.47523295879364014, 'policy/approxkl_avg': 4.077067503049925e-11, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -4.230805188853992e-06, 'loss/value_avg': 0.0386081337928772, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.0002088614273816347, 'val/ratio': 0.9999989867210388, 'val/ratio_var': 7.448856527252079e-13, 'val/num_eos_tokens': 0, 'lr': 3.170014698677119e-05, 'episode': 2992, 'epoch': 0.37}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:08:18<1:05:39, 131kB/s]
 37%|███▋      | 749/2041 [1:04:59<1:49:34,  5.09s/it][A

{'eps': 0, 'objective/kl': 45.82851791381836, 'objective/entropy': 0.0013995170593261719, 'objective/non_score_reward': -2.291425943374634, 'objective/rlhf_reward': -2.8548150062561035, 'objective/scores': -0.5633890628814697, 'policy/approxkl_avg': 5.310033315830687e-11, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -4.890955096925609e-06, 'loss/value_avg': 0.03356435149908066, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.00022661911498289555, 'val/ratio': 0.9999988675117493, 'val/ratio_var': 1.0190367430093494e-12, 'val/num_eos_tokens': 0, 'lr': 3.167564919157276e-05, 'episode': 2996, 'epoch': 0.37}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:08:23<1:05:39, 131kB/s]
 37%|███▋      | 750/2041 [1:05:04<1:50:20,  5.13s/it][A

{'eps': 0, 'objective/kl': 46.42811965942383, 'objective/entropy': 0.0015835762023925781, 'objective/non_score_reward': -2.321406126022339, 'objective/rlhf_reward': -2.998058795928955, 'objective/scores': -0.6766525506973267, 'policy/approxkl_avg': 7.837956672585022e-11, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -5.051775133324554e-06, 'loss/value_avg': 0.038332030177116394, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.0002498739049769938, 'val/ratio': 0.9999985694885254, 'val/ratio_var': 1.576812790581028e-12, 'val/num_eos_tokens': 0, 'lr': 3.165115139637433e-05, 'episode': 3000, 'epoch': 0.37}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:08:28<1:05:39, 131kB/s]
 37%|███▋      | 751/2041 [1:05:09<1:49:55,  5.11s/it][A

{'eps': 0, 'objective/kl': 48.594337463378906, 'objective/entropy': 0.0013971328735351562, 'objective/non_score_reward': -2.4297170639038086, 'objective/rlhf_reward': -3.203558921813965, 'objective/scores': -0.7738417983055115, 'policy/approxkl_avg': 7.037430360679053e-11, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -5.996451818646165e-06, 'loss/value_avg': 0.04336213693022728, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.0002257666492369026, 'val/ratio': 0.9999986886978149, 'val/ratio_var': 1.2452261444195756e-12, 'val/num_eos_tokens': 0, 'lr': 3.162665360117589e-05, 'episode': 3004, 'epoch': 0.37}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:08:33<1:05:39, 131kB/s]
 37%|███▋      | 752/2041 [1:05:15<1:49:42,  5.11s/it][A

{'eps': 0, 'objective/kl': 45.05213165283203, 'objective/entropy': 0.0014786720275878906, 'objective/non_score_reward': -2.2526068687438965, 'objective/rlhf_reward': -3.102743148803711, 'objective/scores': -0.8501361608505249, 'policy/approxkl_avg': 8.809228346784437e-11, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -5.25757968716789e-06, 'loss/value_avg': 0.037161942571401596, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.00023703306214883924, 'val/ratio': 0.9999985694885254, 'val/ratio_var': 1.5170087408478139e-12, 'val/num_eos_tokens': 0, 'lr': 3.160215580597746e-05, 'episode': 3008, 'epoch': 0.37}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:08:38<1:05:39, 131kB/s]
 37%|███▋      | 753/2041 [1:05:20<1:49:54,  5.12s/it][A

{'eps': 0, 'objective/kl': 46.3149528503418, 'objective/entropy': 0.0018138885498046875, 'objective/non_score_reward': -2.3157477378845215, 'objective/rlhf_reward': -3.2522130012512207, 'objective/scores': -0.9364651441574097, 'policy/approxkl_avg': 1.3853276858988295e-10, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -6.080798357288586e-06, 'loss/value_avg': 0.043745823204517365, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.0002801238442771137, 'val/ratio': 0.9999982118606567, 'val/ratio_var': 2.301566272636113e-12, 'val/num_eos_tokens': 0, 'lr': 3.157765801077903e-05, 'episode': 3012, 'epoch': 0.37}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:08:43<1:05:39, 131kB/s]
 37%|███▋      | 754/2041 [1:05:25<1:49:45,  5.12s/it][A

{'eps': 0, 'objective/kl': 47.81233596801758, 'objective/entropy': 0.0028815269470214844, 'objective/non_score_reward': -2.3906166553497314, 'objective/rlhf_reward': -3.467799425125122, 'objective/scores': -1.0771827697753906, 'policy/approxkl_avg': 5.01155228427308e-10, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -1.4672864381282125e-05, 'loss/value_avg': 0.04354707524180412, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.0004167264560237527, 'val/ratio': 0.9999967217445374, 'val/ratio_var': 8.569145393266808e-12, 'val/num_eos_tokens': 0, 'lr': 3.1553160215580604e-05, 'episode': 3016, 'epoch': 0.37}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:08:48<1:05:39, 131kB/s]
 37%|███▋      | 755/2041 [1:05:30<1:49:36,  5.11s/it][A

{'eps': 0, 'objective/kl': 45.35791778564453, 'objective/entropy': 0.003132343292236328, 'objective/non_score_reward': -2.2678959369659424, 'objective/rlhf_reward': -3.4520952701568604, 'objective/scores': -1.184199333190918, 'policy/approxkl_avg': 1.0950702566958626e-09, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -2.365517138969153e-05, 'loss/value_avg': 0.053345561027526855, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.00045729358680546284, 'val/ratio': 0.9999950528144836, 'val/ratio_var': 1.999467258428922e-11, 'val/num_eos_tokens': 0, 'lr': 3.1528662420382165e-05, 'episode': 3020, 'epoch': 0.37}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:08:54<1:05:39, 131kB/s]
 37%|███▋      | 756/2041 [1:05:35<1:49:51,  5.13s/it][A

{'eps': 0, 'objective/kl': 49.15114974975586, 'objective/entropy': 0.0049266815185546875, 'objective/non_score_reward': -2.4575576782226562, 'objective/rlhf_reward': -3.7140560150146484, 'objective/scores': -1.2564983367919922, 'policy/approxkl_avg': 5.9549791764368365e-09, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -5.811115261167288e-05, 'loss/value_avg': 0.05989709123969078, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.0006839531706646085, 'val/ratio': 0.999988317489624, 'val/ratio_var': 1.1711698277849791e-10, 'val/num_eos_tokens': 0, 'lr': 3.150416462518373e-05, 'episode': 3024, 'epoch': 0.37}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:08:59<1:05:39, 131kB/s]
 37%|███▋      | 757/2041 [1:05:40<1:49:56,  5.14s/it][A

{'eps': 0, 'objective/kl': 44.25822830200195, 'objective/entropy': 0.0052127838134765625, 'objective/non_score_reward': -2.212911605834961, 'objective/rlhf_reward': -3.622138023376465, 'objective/scores': -1.409226417541504, 'policy/approxkl_avg': 1.0598510513659676e-08, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -6.578890315722674e-05, 'loss/value_avg': 0.04298456385731697, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.0007418259046971798, 'val/ratio': 0.9999841451644897, 'val/ratio_var': 2.0540309730865403e-10, 'val/num_eos_tokens': 0, 'lr': 3.147966682998531e-05, 'episode': 3028, 'epoch': 0.37}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:09:04<1:05:39, 131kB/s]
 37%|███▋      | 758/2041 [1:05:45<1:50:12,  5.15s/it][A

{'eps': 0, 'objective/kl': 45.46543884277344, 'objective/entropy': 0.007869243621826172, 'objective/non_score_reward': -2.2732720375061035, 'objective/rlhf_reward': -3.72476863861084, 'objective/scores': -1.4514964818954468, 'policy/approxkl_avg': 4.39122764817057e-08, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.00012430886272341013, 'loss/value_avg': 0.05453304201364517, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.0010603914270177484, 'val/ratio': 0.9999688267707825, 'val/ratio_var': 8.26798185471489e-10, 'val/num_eos_tokens': 0, 'lr': 3.145516903478687e-05, 'episode': 3032, 'epoch': 0.37}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:09:09<1:05:39, 131kB/s]
 37%|███▋      | 759/2041 [1:05:51<1:50:27,  5.17s/it][A

{'eps': 0, 'objective/kl': 44.60757827758789, 'objective/entropy': 0.011420249938964844, 'objective/non_score_reward': -2.2303788661956787, 'objective/rlhf_reward': -3.8394455909729004, 'objective/scores': -1.6090666055679321, 'policy/approxkl_avg': 1.9448289378942718e-07, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.0002749101258814335, 'loss/value_avg': 0.05046145245432854, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.0015233282465487719, 'val/ratio': 0.999934196472168, 'val/ratio_var': 3.889633504172707e-09, 'val/num_eos_tokens': 0, 'lr': 3.143067123958844e-05, 'episode': 3036, 'epoch': 0.37}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:09:14<1:05:39, 131kB/s]
 37%|███▋      | 760/2041 [1:05:56<1:50:18,  5.17s/it][A

{'eps': 0, 'objective/kl': 45.86531448364258, 'objective/entropy': 0.02747344970703125, 'objective/non_score_reward': -2.2932658195495605, 'objective/rlhf_reward': -4.0631537437438965, 'objective/scores': -1.7698878049850464, 'policy/approxkl_avg': 5.279558990878286e-06, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.0013768835924565792, 'loss/value_avg': 0.05821498483419418, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.0034807741176337004, 'val/ratio': 0.9996769428253174, 'val/ratio_var': 1.1285668932714543e-07, 'val/num_eos_tokens': 0, 'lr': 3.1406173444390005e-05, 'episode': 3040, 'epoch': 0.37}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:09:19<1:05:39, 131kB/s]
 37%|███▋      | 761/2041 [1:06:01<1:50:30,  5.18s/it][A

{'eps': 0, 'objective/kl': 41.872039794921875, 'objective/entropy': 1.9224472045898438, 'objective/non_score_reward': -2.093601703643799, 'objective/rlhf_reward': -3.712358236312866, 'objective/scores': -1.6187565326690674, 'policy/approxkl_avg': 0.0013026209780946374, 'policy/clipfrac_avg': 0.001179245300590992, 'loss/policy_avg': -0.0016453806310892105, 'loss/value_avg': 0.04974218085408211, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.032664284110069275, 'val/ratio': 1.0035371780395508, 'val/ratio_var': 1.0914423910435289e-05, 'val/num_eos_tokens': 0, 'lr': 3.138167564919157e-05, 'episode': 3044, 'epoch': 0.37}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:09:25<1:05:39, 131kB/s]
 37%|███▋      | 762/2041 [1:06:06<1:50:02,  5.16s/it][A

{'eps': 0, 'objective/kl': 41.770469665527344, 'objective/entropy': 3.133347511291504, 'objective/non_score_reward': -2.0885236263275146, 'objective/rlhf_reward': -3.603318452835083, 'objective/scores': -1.5147948265075684, 'policy/approxkl_avg': 0.008650514297187328, 'policy/clipfrac_avg': 0.015330188907682896, 'loss/policy_avg': -0.006300678011029959, 'loss/value_avg': 0.036113440990448, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.07624562829732895, 'val/ratio': 1.0016028881072998, 'val/ratio_var': 1.6463078509332263e-06, 'val/num_eos_tokens': 0, 'lr': 3.135717785399314e-05, 'episode': 3048, 'epoch': 0.37}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:09:30<1:05:39, 131kB/s]
 37%|███▋      | 763/2041 [1:06:11<1:50:03,  5.17s/it][A

{'eps': 0, 'objective/kl': 36.71460723876953, 'objective/entropy': 4.131661891937256, 'objective/non_score_reward': -1.8357303142547607, 'objective/rlhf_reward': -2.8418643474578857, 'objective/scores': -1.006134033203125, 'policy/approxkl_avg': 0.000713986053597182, 'policy/clipfrac_avg': 0.009433962404727936, 'loss/policy_avg': -0.003549008397385478, 'loss/value_avg': 0.01766935922205448, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.13499803841114044, 'val/ratio': 0.9971287846565247, 'val/ratio_var': 1.0680550076358486e-05, 'val/num_eos_tokens': 0, 'lr': 3.133268005879471e-05, 'episode': 3052, 'epoch': 0.37}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:09:35<1:05:39, 131kB/s]
 37%|███▋      | 764/2041 [1:06:17<1:50:18,  5.18s/it][A

{'eps': 0, 'objective/kl': 35.67091751098633, 'objective/entropy': 5.20251989364624, 'objective/non_score_reward': -1.783545970916748, 'objective/rlhf_reward': -3.0370657444000244, 'objective/scores': -1.2535197734832764, 'policy/approxkl_avg': 0.004200374707579613, 'policy/clipfrac_avg': 0.012971698306500912, 'loss/policy_avg': -0.00565175898373127, 'loss/value_avg': 0.0309763103723526, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1068950816988945, 'val/ratio': 0.9944601058959961, 'val/ratio_var': 2.116710129485e-05, 'val/num_eos_tokens': 0, 'lr': 3.130818226359628e-05, 'episode': 3056, 'epoch': 0.37}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:09:40<1:05:39, 131kB/s]
 37%|███▋      | 765/2041 [1:06:22<1:50:12,  5.18s/it][A

{'eps': 0, 'objective/kl': 44.323280334472656, 'objective/entropy': 5.570587158203125, 'objective/non_score_reward': -2.2161643505096436, 'objective/rlhf_reward': -3.863093137741089, 'objective/scores': -1.6469287872314453, 'policy/approxkl_avg': 0.0005981653230264783, 'policy/clipfrac_avg': 0.010613207705318928, 'loss/policy_avg': -0.004424832761287689, 'loss/value_avg': 0.0681089237332344, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.08519400656223297, 'val/ratio': 1.000594139099121, 'val/ratio_var': 5.298014684740338e-07, 'val/num_eos_tokens': 0, 'lr': 3.1283684468397845e-05, 'episode': 3060, 'epoch': 0.37}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:09:45<1:05:39, 131kB/s]
 38%|███▊      | 766/2041 [1:06:27<1:50:02,  5.18s/it][A

{'eps': 0, 'objective/kl': 41.27803039550781, 'objective/entropy': 6.352334022521973, 'objective/non_score_reward': -2.063901424407959, 'objective/rlhf_reward': -3.4848756790161133, 'objective/scores': -1.4209742546081543, 'policy/approxkl_avg': 0.0009210490970872343, 'policy/clipfrac_avg': 0.004716981202363968, 'loss/policy_avg': -0.0037192523013800383, 'loss/value_avg': 0.06567277759313583, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.13377077877521515, 'val/ratio': 0.9991397261619568, 'val/ratio_var': 1.0524794333832688e-06, 'val/num_eos_tokens': 0, 'lr': 3.1259186673199413e-05, 'episode': 3064, 'epoch': 0.38}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:09:50<1:05:39, 131kB/s]
 38%|███▊      | 767/2041 [1:06:32<1:49:40,  5.17s/it][A

{'eps': 0, 'objective/kl': 33.67325210571289, 'objective/entropy': 7.073668956756592, 'objective/non_score_reward': -1.6836626529693604, 'objective/rlhf_reward': -2.765296459197998, 'objective/scores': -1.0816338062286377, 'policy/approxkl_avg': 0.001051546772941947, 'policy/clipfrac_avg': 0.008254717104136944, 'loss/policy_avg': -0.0047739846631884575, 'loss/value_avg': 0.012758105993270874, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.16012905538082123, 'val/ratio': 0.9983845949172974, 'val/ratio_var': 3.902885964635061e-06, 'val/num_eos_tokens': 0, 'lr': 3.123468887800098e-05, 'episode': 3068, 'epoch': 0.38}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:09:56<1:05:39, 131kB/s]
 38%|███▊      | 768/2041 [1:06:37<1:50:07,  5.19s/it][A

{'eps': 0, 'objective/kl': 51.426109313964844, 'objective/entropy': 0.8123202323913574, 'objective/non_score_reward': -2.571305274963379, 'objective/rlhf_reward': -4.956717491149902, 'objective/scores': -2.3854119777679443, 'policy/approxkl_avg': 0.004289763513952494, 'policy/clipfrac_avg': 0.01179245300590992, 'loss/policy_avg': -0.01051376573741436, 'loss/value_avg': 0.13463766872882843, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.019513752311468124, 'val/ratio': 0.9934746026992798, 'val/ratio_var': 3.422122972551733e-05, 'val/num_eos_tokens': 0, 'lr': 3.121019108280255e-05, 'episode': 3072, 'epoch': 0.38}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:10:01<1:05:39, 131kB/s]
 38%|███▊      | 769/2041 [1:06:42<1:49:43,  5.18s/it][A

{'eps': 0, 'objective/kl': 41.854949951171875, 'objective/entropy': 4.70997953414917, 'objective/non_score_reward': -2.092747449874878, 'objective/rlhf_reward': -3.8972620964050293, 'objective/scores': -1.804514765739441, 'policy/approxkl_avg': 0.027592269703745842, 'policy/clipfrac_avg': 0.01179245300590992, 'loss/policy_avg': -0.011378415860235691, 'loss/value_avg': 0.06262495368719101, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.07255580276250839, 'val/ratio': 0.9926134347915649, 'val/ratio_var': 3.2303669286193326e-05, 'val/num_eos_tokens': 0, 'lr': 3.118569328760412e-05, 'episode': 3076, 'epoch': 0.38}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:10:06<1:05:39, 131kB/s]
 38%|███▊      | 770/2041 [1:06:48<1:49:47,  5.18s/it][A

{'eps': 0, 'objective/kl': 33.94172668457031, 'objective/entropy': 3.310338020324707, 'objective/non_score_reward': -1.697086215019226, 'objective/rlhf_reward': -3.081714153289795, 'objective/scores': -1.3846278190612793, 'policy/approxkl_avg': 0.00014918237866368145, 'policy/clipfrac_avg': 0.001179245300590992, 'loss/policy_avg': -0.002145820064470172, 'loss/value_avg': 0.015798427164554596, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.14004108309745789, 'val/ratio': 0.9983845949172974, 'val/ratio_var': 5.365833203541115e-06, 'val/num_eos_tokens': 0, 'lr': 3.1161195492405686e-05, 'episode': 3080, 'epoch': 0.38}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:10:11<1:05:39, 131kB/s]
 38%|███▊      | 771/2041 [1:06:53<1:49:56,  5.19s/it][A

{'eps': 0, 'objective/kl': 35.74781799316406, 'objective/entropy': 6.43714714050293, 'objective/non_score_reward': -1.787390947341919, 'objective/rlhf_reward': -3.116821765899658, 'objective/scores': -1.3294306993484497, 'policy/approxkl_avg': 0.015003201551735401, 'policy/clipfrac_avg': 0.02358490601181984, 'loss/policy_avg': -0.005269842222332954, 'loss/value_avg': 0.010196187533438206, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.13382336497306824, 'val/ratio': 0.9903714060783386, 'val/ratio_var': 5.821784361614846e-05, 'val/num_eos_tokens': 0, 'lr': 3.1136697697207254e-05, 'episode': 3084, 'epoch': 0.38}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:10:16<1:05:39, 131kB/s]
 38%|███▊      | 772/2041 [1:06:58<1:50:06,  5.21s/it][A

{'eps': 0, 'objective/kl': 35.316890716552734, 'objective/entropy': 6.414262294769287, 'objective/non_score_reward': -1.7658445835113525, 'objective/rlhf_reward': -3.060555934906006, 'objective/scores': -1.2947114706039429, 'policy/approxkl_avg': 0.000891436415258795, 'policy/clipfrac_avg': 0.02004716917872429, 'loss/policy_avg': -0.005439882166683674, 'loss/value_avg': 0.014326701872050762, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1378623992204666, 'val/ratio': 1.000198245048523, 'val/ratio_var': 3.1123855137593637e-07, 'val/num_eos_tokens': 0, 'lr': 3.111219990200882e-05, 'episode': 3088, 'epoch': 0.38}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:10:22<1:05:39, 131kB/s]
 38%|███▊      | 773/2041 [1:07:03<1:49:15,  5.17s/it][A

{'eps': 0, 'objective/kl': 35.62492370605469, 'objective/entropy': 6.892995834350586, 'objective/non_score_reward': -1.7812460660934448, 'objective/rlhf_reward': -3.2345705032348633, 'objective/scores': -1.453324317932129, 'policy/approxkl_avg': 0.0012157338205724955, 'policy/clipfrac_avg': 0.007075471803545952, 'loss/policy_avg': -0.0002772414591163397, 'loss/value_avg': 0.008324580267071724, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.13271157443523407, 'val/ratio': 0.9991223812103271, 'val/ratio_var': 3.9015267816466803e-07, 'val/num_eos_tokens': 0, 'lr': 3.108770210681039e-05, 'episode': 3092, 'epoch': 0.38}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:10:27<1:05:39, 131kB/s]
 38%|███▊      | 774/2041 [1:07:08<1:49:15,  5.17s/it][A

{'eps': 0, 'objective/kl': 42.69721984863281, 'objective/entropy': 8.136210441589355, 'objective/non_score_reward': -2.1348609924316406, 'objective/rlhf_reward': -3.6059770584106445, 'objective/scores': -1.471116065979004, 'policy/approxkl_avg': 0.0018242717487737536, 'policy/clipfrac_avg': 0.017688680440187454, 'loss/policy_avg': -0.0035460242070257664, 'loss/value_avg': 0.028852807357907295, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.16733910143375397, 'val/ratio': 1.0005888938903809, 'val/ratio_var': 3.123045360098331e-07, 'val/num_eos_tokens': 0, 'lr': 3.106320431161195e-05, 'episode': 3096, 'epoch': 0.38}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:10:32<1:05:39, 131kB/s]
 38%|███▊      | 775/2041 [1:07:14<1:49:26,  5.19s/it][A

{'eps': 0, 'objective/kl': 35.13053894042969, 'objective/entropy': 14.184463500976562, 'objective/non_score_reward': -1.7565269470214844, 'objective/rlhf_reward': -3.3638947010040283, 'objective/scores': -1.607367753982544, 'policy/approxkl_avg': 0.009248318150639534, 'policy/clipfrac_avg': 0.022405659779906273, 'loss/policy_avg': -0.01113019697368145, 'loss/value_avg': 0.016134042292833328, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.18483050167560577, 'val/ratio': 0.9901663661003113, 'val/ratio_var': 9.129889076575637e-05, 'val/num_eos_tokens': 0, 'lr': 3.1038706516413526e-05, 'episode': 3100, 'epoch': 0.38}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:10:37<1:05:39, 131kB/s]
 38%|███▊      | 776/2041 [1:07:19<1:48:42,  5.16s/it][A

{'eps': 0, 'objective/kl': 43.70615005493164, 'objective/entropy': 17.00438690185547, 'objective/non_score_reward': -2.185307502746582, 'objective/rlhf_reward': -3.8365583419799805, 'objective/scores': -1.6512507200241089, 'policy/approxkl_avg': 0.1888217329978943, 'policy/clipfrac_avg': 0.03419811278581619, 'loss/policy_avg': -0.02061537280678749, 'loss/value_avg': 0.10618549585342407, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.32572686672210693, 'val/ratio': 0.9734976291656494, 'val/ratio_var': 0.00033196969889104366, 'val/num_eos_tokens': 0, 'lr': 3.1014208721215094e-05, 'episode': 3104, 'epoch': 0.38}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:10:42<1:05:39, 131kB/s]
 38%|███▊      | 777/2041 [1:07:24<1:49:00,  5.17s/it][A

{'eps': 0, 'objective/kl': 59.152530670166016, 'objective/entropy': 42.57718276977539, 'objective/non_score_reward': -2.9576265811920166, 'objective/rlhf_reward': -3.726118564605713, 'objective/scores': -0.7684918642044067, 'policy/approxkl_avg': 0.39127224683761597, 'policy/clipfrac_avg': 0.1745283007621765, 'loss/policy_avg': -0.05028736963868141, 'loss/value_avg': 0.3600994348526001, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8671016693115234, 'val/ratio': 1.0084164142608643, 'val/ratio_var': 0.0005939971306361258, 'val/num_eos_tokens': 0, 'lr': 3.0989710926016655e-05, 'episode': 3108, 'epoch': 0.38}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:10:47<1:05:39, 131kB/s]
 38%|███▊      | 778/2041 [1:07:29<1:48:43,  5.17s/it][A

{'eps': 0, 'objective/kl': 66.92967224121094, 'objective/entropy': 21.388671875, 'objective/non_score_reward': -3.3464837074279785, 'objective/rlhf_reward': -4.977664947509766, 'objective/scores': -1.631181240081787, 'policy/approxkl_avg': 0.2247437685728073, 'policy/clipfrac_avg': 0.061320751905441284, 'loss/policy_avg': -0.036106646060943604, 'loss/value_avg': 0.22279734909534454, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.39732611179351807, 'val/ratio': 0.962883710861206, 'val/ratio_var': 0.0007299924618564546, 'val/num_eos_tokens': 0, 'lr': 3.096521313081823e-05, 'episode': 3112, 'epoch': 0.38}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:10:53<1:05:39, 131kB/s]
 38%|███▊      | 779/2041 [1:07:34<1:48:19,  5.15s/it][A

{'eps': 0, 'objective/kl': 76.27560424804688, 'objective/entropy': 39.4737548828125, 'objective/non_score_reward': -3.813779830932617, 'objective/rlhf_reward': -4.708642482757568, 'objective/scores': -0.8948628306388855, 'policy/approxkl_avg': 0.830345094203949, 'policy/clipfrac_avg': 0.13325470685958862, 'loss/policy_avg': -0.03978070616722107, 'loss/value_avg': 0.4987400770187378, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6495307087898254, 'val/ratio': 1.0895636081695557, 'val/ratio_var': 0.008998502977192402, 'val/num_eos_tokens': 0, 'lr': 3.09407153356198e-05, 'episode': 3116, 'epoch': 0.38}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:10:58<1:05:39, 131kB/s]
 38%|███▊      | 780/2041 [1:07:39<1:48:02,  5.14s/it][A

{'eps': 0, 'objective/kl': 84.89401245117188, 'objective/entropy': 75.5639877319336, 'objective/non_score_reward': -4.244700908660889, 'objective/rlhf_reward': -3.7997424602508545, 'objective/scores': 0.44495850801467896, 'policy/approxkl_avg': 0.689315676689148, 'policy/clipfrac_avg': 0.3266509473323822, 'loss/policy_avg': -0.06707976013422012, 'loss/value_avg': 0.4496641159057617, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.4760003089904785, 'val/ratio': 1.096989393234253, 'val/ratio_var': 0.012251739390194416, 'val/num_eos_tokens': 0, 'lr': 3.0916217540421366e-05, 'episode': 3120, 'epoch': 0.38}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:11:03<1:05:39, 131kB/s]
 38%|███▊      | 781/2041 [1:07:44<1:47:52,  5.14s/it][A

{'eps': 0, 'objective/kl': 85.57218933105469, 'objective/entropy': 50.88231658935547, 'objective/non_score_reward': -4.278609275817871, 'objective/rlhf_reward': -4.477572441101074, 'objective/scores': -0.1989632397890091, 'policy/approxkl_avg': 0.2340967357158661, 'policy/clipfrac_avg': 0.17924529314041138, 'loss/policy_avg': -0.04261203110218048, 'loss/value_avg': 0.5348557233810425, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8754333257675171, 'val/ratio': 0.9885272979736328, 'val/ratio_var': 0.0003697862848639488, 'val/num_eos_tokens': 0, 'lr': 3.089171974522293e-05, 'episode': 3124, 'epoch': 0.38}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:11:08<1:05:39, 131kB/s]
 38%|███▊      | 782/2041 [1:07:49<1:47:58,  5.15s/it][A

{'eps': 0, 'objective/kl': 52.41498565673828, 'objective/entropy': 49.73700714111328, 'objective/non_score_reward': -2.6207492351531982, 'objective/rlhf_reward': -3.4588451385498047, 'objective/scores': -0.8380958437919617, 'policy/approxkl_avg': 0.18910299241542816, 'policy/clipfrac_avg': 0.22523584961891174, 'loss/policy_avg': -0.05373597890138626, 'loss/value_avg': 0.44323334097862244, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9327592253684998, 'val/ratio': 1.5619210004806519, 'val/ratio_var': 0.5282748341560364, 'val/num_eos_tokens': 0, 'lr': 3.08672219500245e-05, 'episode': 3128, 'epoch': 0.38}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:11:13<1:05:39, 131kB/s]
 38%|███▊      | 783/2041 [1:07:55<1:48:03,  5.15s/it][A

{'eps': 0, 'objective/kl': 73.01811218261719, 'objective/entropy': 66.60391235351562, 'objective/non_score_reward': -3.6509058475494385, 'objective/rlhf_reward': -3.7312068939208984, 'objective/scores': -0.08030098676681519, 'policy/approxkl_avg': 0.24879315495491028, 'policy/clipfrac_avg': 0.2653301954269409, 'loss/policy_avg': -0.05226815119385719, 'loss/value_avg': 0.631435751914978, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2374597787857056, 'val/ratio': 1.0213478803634644, 'val/ratio_var': 0.0005310048582032323, 'val/num_eos_tokens': 0, 'lr': 3.084272415482607e-05, 'episode': 3132, 'epoch': 0.38}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:11:18<1:05:39, 131kB/s]
 38%|███▊      | 784/2041 [1:08:00<1:48:06,  5.16s/it][A

{'eps': 0, 'objective/kl': 63.242637634277344, 'objective/entropy': 54.09663772583008, 'objective/non_score_reward': -3.1621317863464355, 'objective/rlhf_reward': -4.903270721435547, 'objective/scores': -1.7411390542984009, 'policy/approxkl_avg': 0.2932193875312805, 'policy/clipfrac_avg': 0.24646225571632385, 'loss/policy_avg': -0.05240402743220329, 'loss/value_avg': 0.891262948513031, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8888623714447021, 'val/ratio': 1.1374695301055908, 'val/ratio_var': 0.018050577491521835, 'val/num_eos_tokens': 0, 'lr': 3.081822635962763e-05, 'episode': 3136, 'epoch': 0.38}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:11:23<1:05:39, 131kB/s]
 38%|███▊      | 785/2041 [1:08:05<1:47:57,  5.16s/it][A

{'eps': 0, 'objective/kl': 68.6426010131836, 'objective/entropy': 64.5417251586914, 'objective/non_score_reward': -3.4321303367614746, 'objective/rlhf_reward': -3.769230842590332, 'objective/scores': -0.3371005058288574, 'policy/approxkl_avg': 0.37208470702171326, 'policy/clipfrac_avg': 0.2794811427593231, 'loss/policy_avg': -0.06024770438671112, 'loss/value_avg': 0.5793951749801636, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.25730299949646, 'val/ratio': 0.9973323345184326, 'val/ratio_var': 0.00030101448646746576, 'val/num_eos_tokens': 0, 'lr': 3.07937285644292e-05, 'episode': 3140, 'epoch': 0.38}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:11:29<1:05:39, 131kB/s]
 39%|███▊      | 786/2041 [1:08:10<1:48:05,  5.17s/it][A

{'eps': 0, 'objective/kl': 72.7041015625, 'objective/entropy': 54.44043731689453, 'objective/non_score_reward': -3.6352052688598633, 'objective/rlhf_reward': -3.947373628616333, 'objective/scores': -0.3121684491634369, 'policy/approxkl_avg': 0.3876117467880249, 'policy/clipfrac_avg': 0.2158018797636032, 'loss/policy_avg': -0.05562640726566315, 'loss/value_avg': 0.38437867164611816, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9054545760154724, 'val/ratio': 1.0038220882415771, 'val/ratio_var': 0.0006521178293041885, 'val/num_eos_tokens': 0, 'lr': 3.0769230769230774e-05, 'episode': 3144, 'epoch': 0.39}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:11:34<1:05:39, 131kB/s]
 39%|███▊      | 787/2041 [1:08:15<1:47:33,  5.15s/it][A

{'eps': 0, 'objective/kl': 80.2230224609375, 'objective/entropy': 75.78491973876953, 'objective/non_score_reward': -4.011151313781738, 'objective/rlhf_reward': -4.9217705726623535, 'objective/scores': -0.9106194376945496, 'policy/approxkl_avg': 0.25367844104766846, 'policy/clipfrac_avg': 0.3207547068595886, 'loss/policy_avg': -0.06952871382236481, 'loss/value_avg': 0.5797519683837891, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.3687148094177246, 'val/ratio': 1.0630537271499634, 'val/ratio_var': 0.00812696572393179, 'val/num_eos_tokens': 0, 'lr': 3.0744732974032336e-05, 'episode': 3148, 'epoch': 0.39}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:11:39<1:05:39, 131kB/s]
 39%|███▊      | 788/2041 [1:08:20<1:47:43,  5.16s/it][A

{'eps': 0, 'objective/kl': 76.57804870605469, 'objective/entropy': 65.77455139160156, 'objective/non_score_reward': -3.828902244567871, 'objective/rlhf_reward': -3.696101665496826, 'objective/scores': 0.13280054926872253, 'policy/approxkl_avg': 0.4386020302772522, 'policy/clipfrac_avg': 0.2853773534297943, 'loss/policy_avg': -0.060982443392276764, 'loss/value_avg': 0.5418086647987366, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1577669382095337, 'val/ratio': 1.1407400369644165, 'val/ratio_var': 0.027958745136857033, 'val/num_eos_tokens': 0, 'lr': 3.0720235178833904e-05, 'episode': 3152, 'epoch': 0.39}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:11:44<1:05:39, 131kB/s]
 39%|███▊      | 789/2041 [1:08:26<1:47:31,  5.15s/it][A

{'eps': 0, 'objective/kl': 53.27373504638672, 'objective/entropy': 43.4315299987793, 'objective/non_score_reward': -2.663686752319336, 'objective/rlhf_reward': -3.4493842124938965, 'objective/scores': -0.7856974005699158, 'policy/approxkl_avg': 0.13400320708751678, 'policy/clipfrac_avg': 0.19575472176074982, 'loss/policy_avg': -0.04051150381565094, 'loss/value_avg': 0.4659181237220764, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8562411069869995, 'val/ratio': 1.129997968673706, 'val/ratio_var': 0.025238102301955223, 'val/num_eos_tokens': 0, 'lr': 3.069573738363548e-05, 'episode': 3156, 'epoch': 0.39}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:11:49<1:05:39, 131kB/s]
 39%|███▊      | 790/2041 [1:08:31<1:47:46,  5.17s/it][A

{'eps': 0, 'objective/kl': 65.68070220947266, 'objective/entropy': 48.80928039550781, 'objective/non_score_reward': -3.2840352058410645, 'objective/rlhf_reward': -4.128307819366455, 'objective/scores': -0.844272792339325, 'policy/approxkl_avg': 0.151193767786026, 'policy/clipfrac_avg': 0.19693396985530853, 'loss/policy_avg': -0.04677519202232361, 'loss/value_avg': 0.5212026238441467, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9867188930511475, 'val/ratio': 1.0331509113311768, 'val/ratio_var': 0.0012114773271605372, 'val/num_eos_tokens': 0, 'lr': 3.067123958843704e-05, 'episode': 3160, 'epoch': 0.39}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:11:54<1:05:39, 131kB/s]
 39%|███▉      | 791/2041 [1:08:36<1:47:44,  5.17s/it][A

{'eps': 0, 'objective/kl': 61.032562255859375, 'objective/entropy': 43.113521575927734, 'objective/non_score_reward': -3.0516281127929688, 'objective/rlhf_reward': -3.914557933807373, 'objective/scores': -0.8629298210144043, 'policy/approxkl_avg': 0.14734002947807312, 'policy/clipfrac_avg': 0.23113207519054413, 'loss/policy_avg': -0.05177155137062073, 'loss/value_avg': 0.3455878496170044, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0196874141693115, 'val/ratio': 0.988861083984375, 'val/ratio_var': 0.0003124656213913113, 'val/num_eos_tokens': 0, 'lr': 3.064674179323861e-05, 'episode': 3164, 'epoch': 0.39}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:12:00<1:05:39, 131kB/s]
 39%|███▉      | 792/2041 [1:08:41<1:47:53,  5.18s/it][A

{'eps': 0, 'objective/kl': 80.55644989013672, 'objective/entropy': 86.45478820800781, 'objective/non_score_reward': -4.027822494506836, 'objective/rlhf_reward': -4.7150492668151855, 'objective/scores': -0.6872267127037048, 'policy/approxkl_avg': 0.2493768334388733, 'policy/clipfrac_avg': 0.2629716992378235, 'loss/policy_avg': -0.05482449755072594, 'loss/value_avg': 0.5520665049552917, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.5158876180648804, 'val/ratio': 1.1666476726531982, 'val/ratio_var': 0.026923058554530144, 'val/num_eos_tokens': 0, 'lr': 3.0622243998040176e-05, 'episode': 3168, 'epoch': 0.39}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:12:05<1:05:39, 131kB/s]
 39%|███▉      | 793/2041 [1:08:46<1:47:28,  5.17s/it][A

{'eps': 0, 'objective/kl': 68.77751159667969, 'objective/entropy': 64.28856658935547, 'objective/non_score_reward': -3.438875198364258, 'objective/rlhf_reward': -4.415229320526123, 'objective/scores': -0.9763541221618652, 'policy/approxkl_avg': 0.2530648708343506, 'policy/clipfrac_avg': 0.26061320304870605, 'loss/policy_avg': -0.061255306005477905, 'loss/value_avg': 0.4780775010585785, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2168917655944824, 'val/ratio': 1.0309009552001953, 'val/ratio_var': 0.002447314327582717, 'val/num_eos_tokens': 0, 'lr': 3.059774620284175e-05, 'episode': 3172, 'epoch': 0.39}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:12:10<1:05:39, 131kB/s]
 39%|███▉      | 794/2041 [1:08:51<1:47:11,  5.16s/it][A

{'eps': 0, 'objective/kl': 94.88341522216797, 'objective/entropy': 90.88058471679688, 'objective/non_score_reward': -4.744171142578125, 'objective/rlhf_reward': -4.900207996368408, 'objective/scores': -0.1560368537902832, 'policy/approxkl_avg': 0.40076255798339844, 'policy/clipfrac_avg': 0.3125, 'loss/policy_avg': -0.06826721876859665, 'loss/value_avg': 0.7121110558509827, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.5148355960845947, 'val/ratio': 1.033568024635315, 'val/ratio_var': 0.0028135711327195168, 'val/num_eos_tokens': 0, 'lr': 3.057324840764331e-05, 'episode': 3176, 'epoch': 0.39}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:12:15<1:05:39, 131kB/s]
 39%|███▉      | 795/2041 [1:08:57<1:47:03,  5.16s/it][A

{'eps': 0, 'objective/kl': 71.62744140625, 'objective/entropy': 86.49563598632812, 'objective/non_score_reward': -3.5813722610473633, 'objective/rlhf_reward': -4.057734966278076, 'objective/scores': -0.47636255621910095, 'policy/approxkl_avg': 0.36155304312705994, 'policy/clipfrac_avg': 0.31721699237823486, 'loss/policy_avg': -0.06907367706298828, 'loss/value_avg': 0.535418689250946, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.5479812622070312, 'val/ratio': 1.1748765707015991, 'val/ratio_var': 0.020514890551567078, 'val/num_eos_tokens': 0, 'lr': 3.054875061244488e-05, 'episode': 3180, 'epoch': 0.39}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:12:20<1:05:39, 131kB/s]
 39%|███▉      | 796/2041 [1:09:02<1:46:39,  5.14s/it][A

{'eps': 0, 'objective/kl': 78.1891098022461, 'objective/entropy': 66.12026977539062, 'objective/non_score_reward': -3.9094552993774414, 'objective/rlhf_reward': -4.16996431350708, 'objective/scores': -0.2605092227458954, 'policy/approxkl_avg': 0.37912532687187195, 'policy/clipfrac_avg': 0.23938678205013275, 'loss/policy_avg': -0.05592763423919678, 'loss/value_avg': 0.4532592296600342, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.306236743927002, 'val/ratio': 1.5513300895690918, 'val/ratio_var': 0.15019337832927704, 'val/num_eos_tokens': 0, 'lr': 3.0524252817246455e-05, 'episode': 3184, 'epoch': 0.39}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:12:25<1:05:39, 131kB/s]
 39%|███▉      | 797/2041 [1:09:07<1:46:29,  5.14s/it][A

{'eps': 0, 'objective/kl': 74.1856460571289, 'objective/entropy': 73.7789535522461, 'objective/non_score_reward': -3.709282398223877, 'objective/rlhf_reward': -4.357713222503662, 'objective/scores': -0.6484310030937195, 'policy/approxkl_avg': 0.17867492139339447, 'policy/clipfrac_avg': 0.2641509473323822, 'loss/policy_avg': -0.06380254030227661, 'loss/value_avg': 0.537732720375061, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2568063735961914, 'val/ratio': 1.0123121738433838, 'val/ratio_var': 0.0006668965215794742, 'val/num_eos_tokens': 0, 'lr': 3.0499755022048016e-05, 'episode': 3188, 'epoch': 0.39}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:12:30<1:05:39, 131kB/s]
 39%|███▉      | 798/2041 [1:09:12<1:46:40,  5.15s/it][A

{'eps': 0, 'objective/kl': 76.16271209716797, 'objective/entropy': 76.77450561523438, 'objective/non_score_reward': -3.808135509490967, 'objective/rlhf_reward': -3.9592461585998535, 'objective/scores': -0.15111064910888672, 'policy/approxkl_avg': 0.2340807020664215, 'policy/clipfrac_avg': 0.2900943160057068, 'loss/policy_avg': -0.06482139229774475, 'loss/value_avg': 0.4724278450012207, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.3937067985534668, 'val/ratio': 1.0244319438934326, 'val/ratio_var': 0.001574984285980463, 'val/num_eos_tokens': 0, 'lr': 3.0475257226849584e-05, 'episode': 3192, 'epoch': 0.39}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:12:36<1:05:39, 131kB/s]
 39%|███▉      | 799/2041 [1:09:17<1:46:54,  5.16s/it][A

{'eps': 0, 'objective/kl': 74.55929565429688, 'objective/entropy': 73.29814910888672, 'objective/non_score_reward': -3.7279648780822754, 'objective/rlhf_reward': -4.008585453033447, 'objective/scores': -0.28062039613723755, 'policy/approxkl_avg': 0.30448034405708313, 'policy/clipfrac_avg': 0.26886793971061707, 'loss/policy_avg': -0.0668007880449295, 'loss/value_avg': 0.4160647988319397, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.3299617767333984, 'val/ratio': 1.3036779165267944, 'val/ratio_var': 0.10335554927587509, 'val/num_eos_tokens': 0, 'lr': 3.0450759431651156e-05, 'episode': 3196, 'epoch': 0.39}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:12:41<1:05:39, 131kB/s]
 39%|███▉      | 800/2041 [1:09:22<1:46:14,  5.14s/it][A

{'eps': 0, 'objective/kl': 83.7044677734375, 'objective/entropy': 66.6588134765625, 'objective/non_score_reward': -4.185223579406738, 'objective/rlhf_reward': -3.8255910873413086, 'objective/scores': 0.3596324324607849, 'policy/approxkl_avg': 0.8771646022796631, 'policy/clipfrac_avg': 0.2783018946647644, 'loss/policy_avg': -0.06537733972072601, 'loss/value_avg': 0.45709991455078125, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.377619981765747, 'val/ratio': 1.0781959295272827, 'val/ratio_var': 0.011335738934576511, 'val/num_eos_tokens': 0, 'lr': 3.042626163645272e-05, 'episode': 3200, 'epoch': 0.39}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:12:46<1:05:39, 131kB/s]
 39%|███▉      | 801/2041 [1:09:27<1:46:06,  5.13s/it][A

{'eps': 0, 'objective/kl': 77.53826904296875, 'objective/entropy': 62.076053619384766, 'objective/non_score_reward': -3.876913547515869, 'objective/rlhf_reward': -3.9622652530670166, 'objective/scores': -0.08535169064998627, 'policy/approxkl_avg': 0.33980557322502136, 'policy/clipfrac_avg': 0.26886793971061707, 'loss/policy_avg': -0.06025945395231247, 'loss/value_avg': 0.607281506061554, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2965831756591797, 'val/ratio': 0.9943289756774902, 'val/ratio_var': 0.00035072446917183697, 'val/num_eos_tokens': 0, 'lr': 3.0401763841254288e-05, 'episode': 3204, 'epoch': 0.39}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:12:51<1:05:39, 131kB/s]
 39%|███▉      | 802/2041 [1:09:33<1:45:42,  5.12s/it][A

{'eps': 0, 'objective/kl': 72.58092498779297, 'objective/entropy': 74.26543426513672, 'objective/non_score_reward': -3.6290462017059326, 'objective/rlhf_reward': -4.413593769073486, 'objective/scores': -0.7845474481582642, 'policy/approxkl_avg': 0.11101672053337097, 'policy/clipfrac_avg': 0.27476415038108826, 'loss/policy_avg': -0.061692770570516586, 'loss/value_avg': 0.5251189470291138, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2570910453796387, 'val/ratio': 1.045262336730957, 'val/ratio_var': 0.003142339875921607, 'val/num_eos_tokens': 0, 'lr': 3.0377266046055856e-05, 'episode': 3208, 'epoch': 0.39}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:12:56<1:05:39, 131kB/s]
 39%|███▉      | 803/2041 [1:09:38<1:45:51,  5.13s/it][A

{'eps': 0, 'objective/kl': 70.7513427734375, 'objective/entropy': 61.68125534057617, 'objective/non_score_reward': -3.537567615509033, 'objective/rlhf_reward': -4.774061679840088, 'objective/scores': -1.2364941835403442, 'policy/approxkl_avg': 0.21838581562042236, 'policy/clipfrac_avg': 0.2429245263338089, 'loss/policy_avg': -0.05697331577539444, 'loss/value_avg': 0.4783031940460205, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1405134201049805, 'val/ratio': 1.0253591537475586, 'val/ratio_var': 0.001249908935278654, 'val/num_eos_tokens': 0, 'lr': 3.035276825085742e-05, 'episode': 3212, 'epoch': 0.39}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:13:01<1:05:39, 131kB/s]
 39%|███▉      | 804/2041 [1:09:43<1:45:25,  5.11s/it][A

{'eps': 0, 'objective/kl': 84.97184753417969, 'objective/entropy': 83.44816589355469, 'objective/non_score_reward': -4.248592376708984, 'objective/rlhf_reward': -4.710657119750977, 'objective/scores': -0.46206456422805786, 'policy/approxkl_avg': 0.15203633904457092, 'policy/clipfrac_avg': 0.2629716992378235, 'loss/policy_avg': -0.06239067763090134, 'loss/value_avg': 0.5715178847312927, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.5593245029449463, 'val/ratio': 1.0362576246261597, 'val/ratio_var': 0.0016815923154354095, 'val/num_eos_tokens': 0, 'lr': 3.0328270455658992e-05, 'episode': 3216, 'epoch': 0.39}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:13:06<1:05:39, 131kB/s]
 39%|███▉      | 805/2041 [1:09:48<1:45:44,  5.13s/it][A

{'eps': 0, 'objective/kl': 73.85403442382812, 'objective/entropy': 74.768798828125, 'objective/non_score_reward': -3.692701816558838, 'objective/rlhf_reward': -5.661261558532715, 'objective/scores': -1.9685595035552979, 'policy/approxkl_avg': 0.26326167583465576, 'policy/clipfrac_avg': 0.28066039085388184, 'loss/policy_avg': -0.0690688043832779, 'loss/value_avg': 0.8278393149375916, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.3544080257415771, 'val/ratio': 1.1654026508331299, 'val/ratio_var': 0.04252861067652702, 'val/num_eos_tokens': 0, 'lr': 3.030377266046056e-05, 'episode': 3220, 'epoch': 0.39}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:13:12<1:05:39, 131kB/s]
 39%|███▉      | 806/2041 [1:09:53<1:45:58,  5.15s/it][A

{'eps': 0, 'objective/kl': 86.12577819824219, 'objective/entropy': 72.84980773925781, 'objective/non_score_reward': -4.306288719177246, 'objective/rlhf_reward': -4.471623420715332, 'objective/scores': -0.16533470153808594, 'policy/approxkl_avg': 0.14481839537620544, 'policy/clipfrac_avg': 0.26179245114326477, 'loss/policy_avg': -0.05601944401860237, 'loss/value_avg': 0.4533086121082306, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.3779114484786987, 'val/ratio': 0.9742718935012817, 'val/ratio_var': 0.0003253474133089185, 'val/num_eos_tokens': 0, 'lr': 3.0279274865262132e-05, 'episode': 3224, 'epoch': 0.39}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:13:17<1:05:39, 131kB/s]
 40%|███▉      | 807/2041 [1:09:58<1:46:05,  5.16s/it][A

{'eps': 0, 'objective/kl': 92.95794677734375, 'objective/entropy': 55.68023681640625, 'objective/non_score_reward': -4.647897243499756, 'objective/rlhf_reward': -6.03317928314209, 'objective/scores': -1.385282278060913, 'policy/approxkl_avg': 0.10205166786909103, 'policy/clipfrac_avg': 0.2146226465702057, 'loss/policy_avg': -0.05265168473124504, 'loss/value_avg': 0.756371259689331, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1130001544952393, 'val/ratio': 1.029572606086731, 'val/ratio_var': 0.0013582106912508607, 'val/num_eos_tokens': 0, 'lr': 3.0254777070063693e-05, 'episode': 3228, 'epoch': 0.4}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:13:22<1:05:39, 131kB/s]
 40%|███▉      | 808/2041 [1:10:03<1:45:59,  5.16s/it][A

{'eps': 0, 'objective/kl': 66.22673034667969, 'objective/entropy': 74.22259521484375, 'objective/non_score_reward': -3.3113362789154053, 'objective/rlhf_reward': -4.220312118530273, 'objective/scores': -0.9089758396148682, 'policy/approxkl_avg': 0.1680523008108139, 'policy/clipfrac_avg': 0.26650941371917725, 'loss/policy_avg': -0.05912516266107559, 'loss/value_avg': 0.41235634684562683, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.3400495052337646, 'val/ratio': 1.0820157527923584, 'val/ratio_var': 0.006892547011375427, 'val/num_eos_tokens': 0, 'lr': 3.0230279274865265e-05, 'episode': 3232, 'epoch': 0.4}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:13:27<1:05:39, 131kB/s]
 40%|███▉      | 809/2041 [1:10:09<1:45:45,  5.15s/it][A

{'eps': 0, 'objective/kl': 93.41162109375, 'objective/entropy': 76.54003143310547, 'objective/non_score_reward': -4.670581340789795, 'objective/rlhf_reward': -5.817376136779785, 'objective/scores': -1.1467945575714111, 'policy/approxkl_avg': 0.1401030421257019, 'policy/clipfrac_avg': 0.26179245114326477, 'loss/policy_avg': -0.0631541907787323, 'loss/value_avg': 0.8083357214927673, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.484363317489624, 'val/ratio': 1.0683045387268066, 'val/ratio_var': 0.00814807415008545, 'val/num_eos_tokens': 0, 'lr': 3.0205781479666833e-05, 'episode': 3236, 'epoch': 0.4}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:13:32<1:05:39, 131kB/s]
 40%|███▉      | 810/2041 [1:10:14<1:45:10,  5.13s/it][A

{'eps': 0, 'objective/kl': 96.89207458496094, 'objective/entropy': 68.87379455566406, 'objective/non_score_reward': -4.8446044921875, 'objective/rlhf_reward': -5.660345077514648, 'objective/scores': -0.815740704536438, 'policy/approxkl_avg': 0.19081942737102509, 'policy/clipfrac_avg': 0.2582547068595886, 'loss/policy_avg': -0.059979408979415894, 'loss/value_avg': 0.6571766138076782, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.3565106391906738, 'val/ratio': 0.9686511754989624, 'val/ratio_var': 0.0004464668163564056, 'val/num_eos_tokens': 0, 'lr': 3.0181283684468397e-05, 'episode': 3240, 'epoch': 0.4}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:13:37<1:05:39, 131kB/s]
 40%|███▉      | 811/2041 [1:10:19<1:45:04,  5.13s/it][A

{'eps': 0, 'objective/kl': 91.28510284423828, 'objective/entropy': 62.823577880859375, 'objective/non_score_reward': -4.564255237579346, 'objective/rlhf_reward': -4.521342754364014, 'objective/scores': 0.04291271045804024, 'policy/approxkl_avg': 0.21765798330307007, 'policy/clipfrac_avg': 0.2853773534297943, 'loss/policy_avg': -0.06075200438499451, 'loss/value_avg': 0.615041971206665, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2745888233184814, 'val/ratio': 1.0605592727661133, 'val/ratio_var': 0.003831444540992379, 'val/num_eos_tokens': 0, 'lr': 3.015678588926997e-05, 'episode': 3244, 'epoch': 0.4}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:13:42<1:05:39, 131kB/s]
 40%|███▉      | 812/2041 [1:10:24<1:45:36,  5.16s/it][A

{'eps': 0, 'objective/kl': 76.67925262451172, 'objective/entropy': 89.8624267578125, 'objective/non_score_reward': -3.8339626789093018, 'objective/rlhf_reward': -4.825754642486572, 'objective/scores': -0.9917920231819153, 'policy/approxkl_avg': 0.26152634620666504, 'policy/clipfrac_avg': 0.3195754885673523, 'loss/policy_avg': -0.07271730899810791, 'loss/value_avg': 0.5487962961196899, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.5252059698104858, 'val/ratio': 1.0178805589675903, 'val/ratio_var': 0.0016912597930058837, 'val/num_eos_tokens': 0, 'lr': 3.0132288094071537e-05, 'episode': 3248, 'epoch': 0.4}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:13:48<1:05:39, 131kB/s]
 40%|███▉      | 813/2041 [1:10:29<1:45:12,  5.14s/it][A

{'eps': 0, 'objective/kl': 89.25315856933594, 'objective/entropy': 62.618202209472656, 'objective/non_score_reward': -4.462657928466797, 'objective/rlhf_reward': -5.019488334655762, 'objective/scores': -0.5568305253982544, 'policy/approxkl_avg': 0.5865646600723267, 'policy/clipfrac_avg': 0.24764151871204376, 'loss/policy_avg': -0.05364497750997543, 'loss/value_avg': 0.5537370443344116, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2788712978363037, 'val/ratio': 1.0999983549118042, 'val/ratio_var': 0.014021173119544983, 'val/num_eos_tokens': 0, 'lr': 3.01077902988731e-05, 'episode': 3252, 'epoch': 0.4}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:13:53<1:05:39, 131kB/s]
 40%|███▉      | 814/2041 [1:10:34<1:45:36,  5.16s/it][A

{'eps': 0, 'objective/kl': 96.2325668334961, 'objective/entropy': 60.23713684082031, 'objective/non_score_reward': -4.811628341674805, 'objective/rlhf_reward': -5.593686103820801, 'objective/scores': -0.782057523727417, 'policy/approxkl_avg': 0.20608994364738464, 'policy/clipfrac_avg': 0.22523584961891174, 'loss/policy_avg': -0.04925128072500229, 'loss/value_avg': 0.6431204080581665, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1963437795639038, 'val/ratio': 1.8993531465530396, 'val/ratio_var': 0.6975904107093811, 'val/num_eos_tokens': 0, 'lr': 3.008329250367467e-05, 'episode': 3256, 'epoch': 0.4}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:13:58<1:05:39, 131kB/s]
 40%|███▉      | 815/2041 [1:10:40<1:45:41,  5.17s/it][A

{'eps': 0, 'objective/kl': 98.9499740600586, 'objective/entropy': 70.41948699951172, 'objective/non_score_reward': -4.947498321533203, 'objective/rlhf_reward': -6.178953647613525, 'objective/scores': -1.2314553260803223, 'policy/approxkl_avg': 0.10507598519325256, 'policy/clipfrac_avg': 0.2511792480945587, 'loss/policy_avg': -0.05931885167956352, 'loss/value_avg': 0.7145767211914062, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.25617253780365, 'val/ratio': 1.0037312507629395, 'val/ratio_var': 0.0005046918522566557, 'val/num_eos_tokens': 0, 'lr': 3.005879470847624e-05, 'episode': 3260, 'epoch': 0.4}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:14:03<1:05:39, 131kB/s]
 40%|███▉      | 816/2041 [1:10:45<1:45:30,  5.17s/it][A

{'eps': 0, 'objective/kl': 88.74874877929688, 'objective/entropy': 86.73954772949219, 'objective/non_score_reward': -4.437438011169434, 'objective/rlhf_reward': -4.427677631378174, 'objective/scores': 0.009760499000549316, 'policy/approxkl_avg': 0.12646609544754028, 'policy/clipfrac_avg': 0.2771226465702057, 'loss/policy_avg': -0.06702559441328049, 'loss/value_avg': 0.517530083656311, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.3948838710784912, 'val/ratio': 1.0258009433746338, 'val/ratio_var': 0.0008812358719296753, 'val/num_eos_tokens': 0, 'lr': 3.0034296913277805e-05, 'episode': 3264, 'epoch': 0.4}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:14:08<1:05:39, 131kB/s]
 40%|████      | 817/2041 [1:10:50<1:45:01,  5.15s/it][A

{'eps': 0, 'objective/kl': 85.32952880859375, 'objective/entropy': 58.03849411010742, 'objective/non_score_reward': -4.266477108001709, 'objective/rlhf_reward': -6.05696964263916, 'objective/scores': -1.7904927730560303, 'policy/approxkl_avg': 0.12801803648471832, 'policy/clipfrac_avg': 0.2158018946647644, 'loss/policy_avg': -0.051422201097011566, 'loss/value_avg': 0.7200466394424438, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1561698913574219, 'val/ratio': 1.0903704166412354, 'val/ratio_var': 0.008688567206263542, 'val/num_eos_tokens': 0, 'lr': 3.0009799118079374e-05, 'episode': 3268, 'epoch': 0.4}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:14:17<1:05:39, 131kB/s]
 40%|████      | 818/2041 [1:10:58<2:04:22,  6.10s/it][A

{'eps': 0, 'objective/kl': 64.57634735107422, 'objective/entropy': 60.47710418701172, 'objective/non_score_reward': -3.2288174629211426, 'objective/rlhf_reward': -3.7330429553985596, 'objective/scores': -0.5042255520820618, 'policy/approxkl_avg': 0.13970598578453064, 'policy/clipfrac_avg': 0.24056604504585266, 'loss/policy_avg': -0.05761034041643143, 'loss/value_avg': 0.36072200536727905, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2162271738052368, 'val/ratio': 1.025944471359253, 'val/ratio_var': 0.0010634387144818902, 'val/num_eos_tokens': 0, 'lr': 2.998530132288094e-05, 'episode': 3272, 'epoch': 0.4}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:14:22<1:05:39, 131kB/s]
 40%|████      | 819/2041 [1:11:03<1:58:57,  5.84s/it][A

{'eps': 0, 'objective/kl': 79.52359008789062, 'objective/entropy': 51.398685455322266, 'objective/non_score_reward': -3.976179599761963, 'objective/rlhf_reward': -4.993542194366455, 'objective/scores': -1.0173624753952026, 'policy/approxkl_avg': 0.3011089265346527, 'policy/clipfrac_avg': 0.17924529314041138, 'loss/policy_avg': -0.04675275459885597, 'loss/value_avg': 0.5808147192001343, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8324828743934631, 'val/ratio': 0.9757596254348755, 'val/ratio_var': 0.0002687652304302901, 'val/num_eos_tokens': 0, 'lr': 2.9960803527682513e-05, 'episode': 3276, 'epoch': 0.4}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:14:27<1:05:39, 131kB/s]
 40%|████      | 820/2041 [1:11:09<1:55:08,  5.66s/it][A

{'eps': 0, 'objective/kl': 90.36852264404297, 'objective/entropy': 54.446563720703125, 'objective/non_score_reward': -4.518425941467285, 'objective/rlhf_reward': -5.785759925842285, 'objective/scores': -1.267333745956421, 'policy/approxkl_avg': 0.09988346695899963, 'policy/clipfrac_avg': 0.2299528270959854, 'loss/policy_avg': -0.055551446974277496, 'loss/value_avg': 0.7317770719528198, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9961308240890503, 'val/ratio': 1.0623903274536133, 'val/ratio_var': 0.0034746192395687103, 'val/num_eos_tokens': 0, 'lr': 2.9936305732484078e-05, 'episode': 3280, 'epoch': 0.4}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:14:32<1:05:39, 131kB/s]
 40%|████      | 821/2041 [1:11:14<1:51:41,  5.49s/it][A

{'eps': 0, 'objective/kl': 96.04332733154297, 'objective/entropy': 59.468666076660156, 'objective/non_score_reward': -4.802166938781738, 'objective/rlhf_reward': -5.575803279876709, 'objective/scores': -0.7736363410949707, 'policy/approxkl_avg': 0.14624156057834625, 'policy/clipfrac_avg': 0.23349055647850037, 'loss/policy_avg': -0.05903922766447067, 'loss/value_avg': 0.6990334391593933, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0652275085449219, 'val/ratio': 1.0063295364379883, 'val/ratio_var': 0.00019290833733975887, 'val/num_eos_tokens': 0, 'lr': 2.9911807937285646e-05, 'episode': 3284, 'epoch': 0.4}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:14:37<1:05:39, 131kB/s]
 40%|████      | 822/2041 [1:11:19<1:49:32,  5.39s/it][A

{'eps': 0, 'objective/kl': 102.50093078613281, 'objective/entropy': 72.01042175292969, 'objective/non_score_reward': -5.125046730041504, 'objective/rlhf_reward': -6.375353813171387, 'objective/scores': -1.250307321548462, 'policy/approxkl_avg': 0.20354613661766052, 'policy/clipfrac_avg': 0.2641509473323822, 'loss/policy_avg': -0.0603601410984993, 'loss/value_avg': 0.823577344417572, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.3152539730072021, 'val/ratio': 1.064763069152832, 'val/ratio_var': 0.0030846029985696077, 'val/num_eos_tokens': 0, 'lr': 2.9887310142087217e-05, 'episode': 3288, 'epoch': 0.4}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:14:42<1:05:39, 131kB/s]
 40%|████      | 823/2041 [1:11:24<1:48:07,  5.33s/it][A

{'eps': 0, 'objective/kl': 89.51361083984375, 'objective/entropy': 83.29997253417969, 'objective/non_score_reward': -4.475680828094482, 'objective/rlhf_reward': -4.8955559730529785, 'objective/scores': -0.4198749363422394, 'policy/approxkl_avg': 0.4002401530742645, 'policy/clipfrac_avg': 0.28891509771347046, 'loss/policy_avg': -0.06704462319612503, 'loss/value_avg': 0.6321967840194702, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.417056679725647, 'val/ratio': 1.1331734657287598, 'val/ratio_var': 0.024821609258651733, 'val/num_eos_tokens': 0, 'lr': 2.986281234688878e-05, 'episode': 3292, 'epoch': 0.4}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:14:48<1:05:39, 131kB/s]
 40%|████      | 824/2041 [1:11:29<1:47:31,  5.30s/it][A

{'eps': 0, 'objective/kl': 78.15086364746094, 'objective/entropy': 71.12092590332031, 'objective/non_score_reward': -3.907543182373047, 'objective/rlhf_reward': -4.6435651779174805, 'objective/scores': -0.7360221147537231, 'policy/approxkl_avg': 0.36947035789489746, 'policy/clipfrac_avg': 0.24764151871204376, 'loss/policy_avg': -0.058998748660087585, 'loss/value_avg': 0.5018801689147949, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.107024908065796, 'val/ratio': 1.113619089126587, 'val/ratio_var': 0.018001968041062355, 'val/num_eos_tokens': 0, 'lr': 2.983831455169035e-05, 'episode': 3296, 'epoch': 0.4}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:14:53<1:05:39, 131kB/s]
 40%|████      | 825/2041 [1:11:34<1:46:40,  5.26s/it][A

{'eps': 0, 'objective/kl': 84.35365295410156, 'objective/entropy': 61.90178298950195, 'objective/non_score_reward': -4.217682838439941, 'objective/rlhf_reward': -5.618304252624512, 'objective/scores': -1.4006211757659912, 'policy/approxkl_avg': 0.19653981924057007, 'policy/clipfrac_avg': 0.20518867671489716, 'loss/policy_avg': -0.0521266870200634, 'loss/value_avg': 0.4953901171684265, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0789400339126587, 'val/ratio': 1.0413239002227783, 'val/ratio_var': 0.0034575306344777346, 'val/num_eos_tokens': 0, 'lr': 2.9813816756491918e-05, 'episode': 3300, 'epoch': 0.4}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:14:58<1:05:39, 131kB/s]
 40%|████      | 826/2041 [1:11:40<1:45:34,  5.21s/it][A

{'eps': 0, 'objective/kl': 101.33354949951172, 'objective/entropy': 68.95584869384766, 'objective/non_score_reward': -5.066678047180176, 'objective/rlhf_reward': -6.244141578674316, 'objective/scores': -1.1774636507034302, 'policy/approxkl_avg': 0.17669586837291718, 'policy/clipfrac_avg': 0.2570754885673523, 'loss/policy_avg': -0.06080794334411621, 'loss/value_avg': 0.7716628313064575, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1807000637054443, 'val/ratio': 1.0473616123199463, 'val/ratio_var': 0.0019535229075700045, 'val/num_eos_tokens': 0, 'lr': 2.9789318961293483e-05, 'episode': 3304, 'epoch': 0.4}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:15:03<1:05:39, 131kB/s]
 41%|████      | 827/2041 [1:11:45<1:45:11,  5.20s/it][A

{'eps': 0, 'objective/kl': 86.50177764892578, 'objective/entropy': 50.74269104003906, 'objective/non_score_reward': -4.3250885009765625, 'objective/rlhf_reward': -6.580050468444824, 'objective/scores': -2.2549617290496826, 'policy/approxkl_avg': 0.2060161679983139, 'policy/clipfrac_avg': 0.2216981053352356, 'loss/policy_avg': -0.05285076051950455, 'loss/value_avg': 0.8559454679489136, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9308449625968933, 'val/ratio': 0.9799472689628601, 'val/ratio_var': 0.00029460605583153665, 'val/num_eos_tokens': 0, 'lr': 2.9764821166095054e-05, 'episode': 3308, 'epoch': 0.41}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:15:08<1:05:39, 131kB/s]
 41%|████      | 828/2041 [1:11:50<1:44:51,  5.19s/it][A

{'eps': 0, 'objective/kl': 95.2051773071289, 'objective/entropy': 71.87389373779297, 'objective/non_score_reward': -4.760258674621582, 'objective/rlhf_reward': -6.067485809326172, 'objective/scores': -1.3072271347045898, 'policy/approxkl_avg': 0.08916034549474716, 'policy/clipfrac_avg': 0.25353774428367615, 'loss/policy_avg': -0.05690542608499527, 'loss/value_avg': 0.6279977560043335, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.275834560394287, 'val/ratio': 1.0248963832855225, 'val/ratio_var': 0.0014309771358966827, 'val/num_eos_tokens': 0, 'lr': 2.9740323370896622e-05, 'episode': 3312, 'epoch': 0.41}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:15:14<1:05:39, 131kB/s]
 41%|████      | 829/2041 [1:11:55<1:45:00,  5.20s/it][A

{'eps': 0, 'objective/kl': 90.27806091308594, 'objective/entropy': 62.091041564941406, 'objective/non_score_reward': -4.5139031410217285, 'objective/rlhf_reward': -6.079469680786133, 'objective/scores': -1.5655665397644043, 'policy/approxkl_avg': 0.2834397852420807, 'policy/clipfrac_avg': 0.2358490526676178, 'loss/policy_avg': -0.05356719717383385, 'loss/value_avg': 0.5995082855224609, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.285740852355957, 'val/ratio': 0.9940943717956543, 'val/ratio_var': 0.00014694922720082104, 'val/num_eos_tokens': 0, 'lr': 2.9715825575698187e-05, 'episode': 3316, 'epoch': 0.41}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:15:19<1:05:39, 131kB/s]
 41%|████      | 830/2041 [1:12:00<1:45:16,  5.22s/it][A

{'eps': 0, 'objective/kl': 73.13494873046875, 'objective/entropy': 76.2987289428711, 'objective/non_score_reward': -3.656747341156006, 'objective/rlhf_reward': -5.446554660797119, 'objective/scores': -1.7898073196411133, 'policy/approxkl_avg': 0.2534569501876831, 'policy/clipfrac_avg': 0.2688679099082947, 'loss/policy_avg': -0.06624308228492737, 'loss/value_avg': 0.5731728672981262, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.4895203113555908, 'val/ratio': 1.027340054512024, 'val/ratio_var': 0.002255150815472007, 'val/num_eos_tokens': 0, 'lr': 2.9691327780499755e-05, 'episode': 3320, 'epoch': 0.41}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:15:24<1:05:39, 131kB/s]
 41%|████      | 831/2041 [1:12:05<1:44:42,  5.19s/it][A

{'eps': 0, 'objective/kl': 97.19505310058594, 'objective/entropy': 95.62583923339844, 'objective/non_score_reward': -4.859752655029297, 'objective/rlhf_reward': -5.833573818206787, 'objective/scores': -0.9738213419914246, 'policy/approxkl_avg': 0.19028203189373016, 'policy/clipfrac_avg': 0.2712264060974121, 'loss/policy_avg': -0.06947748363018036, 'loss/value_avg': 0.6545954942703247, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.736363172531128, 'val/ratio': 1.1070210933685303, 'val/ratio_var': 0.009976574219763279, 'val/num_eos_tokens': 0, 'lr': 2.9666829985301326e-05, 'episode': 3324, 'epoch': 0.41}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:15:29<1:05:39, 131kB/s]
 41%|████      | 832/2041 [1:12:11<1:44:21,  5.18s/it][A

{'eps': 0, 'objective/kl': 102.16153717041016, 'objective/entropy': 65.34529113769531, 'objective/non_score_reward': -5.108077049255371, 'objective/rlhf_reward': -6.528682708740234, 'objective/scores': -1.4206054210662842, 'policy/approxkl_avg': 0.1616201549768448, 'policy/clipfrac_avg': 0.25471699237823486, 'loss/policy_avg': -0.06475032866001129, 'loss/value_avg': 0.7760459184646606, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.4519121646881104, 'val/ratio': 0.9949338436126709, 'val/ratio_var': 0.0002944363804999739, 'val/num_eos_tokens': 0, 'lr': 2.964233219010289e-05, 'episode': 3328, 'epoch': 0.41}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:15:34<1:05:39, 131kB/s]
 41%|████      | 833/2041 [1:12:16<1:44:29,  5.19s/it][A

{'eps': 0, 'objective/kl': 98.60504150390625, 'objective/entropy': 64.16767883300781, 'objective/non_score_reward': -4.9302520751953125, 'objective/rlhf_reward': -6.766784191131592, 'objective/scores': -1.8365321159362793, 'policy/approxkl_avg': 0.367463082075119, 'policy/clipfrac_avg': 0.2287735939025879, 'loss/policy_avg': -0.060362935066223145, 'loss/value_avg': 0.7332738637924194, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.249225378036499, 'val/ratio': 0.9856066703796387, 'val/ratio_var': 0.0001655011874390766, 'val/num_eos_tokens': 0, 'lr': 2.961783439490446e-05, 'episode': 3332, 'epoch': 0.41}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:15:39<1:05:39, 131kB/s]
 41%|████      | 834/2041 [1:12:21<1:44:36,  5.20s/it][A

{'eps': 0, 'objective/kl': 111.47904968261719, 'objective/entropy': 61.109737396240234, 'objective/non_score_reward': -5.573951721191406, 'objective/rlhf_reward': -6.518811225891113, 'objective/scores': -0.9448596239089966, 'policy/approxkl_avg': 0.14666904509067535, 'policy/clipfrac_avg': 0.23113209009170532, 'loss/policy_avg': -0.056863002479076385, 'loss/value_avg': 0.7810265421867371, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1714389324188232, 'val/ratio': 0.9877110123634338, 'val/ratio_var': 9.414037776878104e-05, 'val/num_eos_tokens': 0, 'lr': 2.9593336599706027e-05, 'episode': 3336, 'epoch': 0.41}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:15:45<1:05:39, 131kB/s]
 41%|████      | 835/2041 [1:12:26<1:44:15,  5.19s/it][A

{'eps': 0, 'objective/kl': 82.7589340209961, 'objective/entropy': 59.40056610107422, 'objective/non_score_reward': -4.137946605682373, 'objective/rlhf_reward': -5.988390922546387, 'objective/scores': -1.8504443168640137, 'policy/approxkl_avg': 0.10434263199567795, 'policy/clipfrac_avg': 0.2158018797636032, 'loss/policy_avg': -0.05510334670543671, 'loss/value_avg': 0.7234671115875244, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1735424995422363, 'val/ratio': 1.054008960723877, 'val/ratio_var': 0.003047322854399681, 'val/num_eos_tokens': 0, 'lr': 2.95688388045076e-05, 'episode': 3340, 'epoch': 0.41}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:15:50<1:05:39, 131kB/s]
 41%|████      | 836/2041 [1:12:31<1:44:24,  5.20s/it][A

{'eps': 0, 'objective/kl': 84.2261962890625, 'objective/entropy': 70.52214050292969, 'objective/non_score_reward': -4.211310386657715, 'objective/rlhf_reward': -6.138148784637451, 'objective/scores': -1.9268383979797363, 'policy/approxkl_avg': 0.11153151094913483, 'policy/clipfrac_avg': 0.25235849618911743, 'loss/policy_avg': -0.056665048003196716, 'loss/value_avg': 0.7140794992446899, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.3014605045318604, 'val/ratio': 1.0485379695892334, 'val/ratio_var': 0.0037139200139790773, 'val/num_eos_tokens': 0, 'lr': 2.9544341009309163e-05, 'episode': 3344, 'epoch': 0.41}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:15:55<1:05:39, 131kB/s]
 41%|████      | 837/2041 [1:12:37<1:44:21,  5.20s/it][A

{'eps': 0, 'objective/kl': 94.02763366699219, 'objective/entropy': 68.59683227539062, 'objective/non_score_reward': -4.701381683349609, 'objective/rlhf_reward': -6.735983371734619, 'objective/scores': -2.0346016883850098, 'policy/approxkl_avg': 0.07599815726280212, 'policy/clipfrac_avg': 0.18632075190544128, 'loss/policy_avg': -0.04356811195611954, 'loss/value_avg': 0.7734655141830444, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.321062445640564, 'val/ratio': 0.994707465171814, 'val/ratio_var': 6.170782580738887e-05, 'val/num_eos_tokens': 0, 'lr': 2.951984321411073e-05, 'episode': 3348, 'epoch': 0.41}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:16:00<1:05:39, 131kB/s]
 41%|████      | 838/2041 [1:12:42<1:43:47,  5.18s/it][A

{'eps': 0, 'objective/kl': 92.88246154785156, 'objective/entropy': 88.63551330566406, 'objective/non_score_reward': -4.644123077392578, 'objective/rlhf_reward': -6.0914306640625, 'objective/scores': -1.4473074674606323, 'policy/approxkl_avg': 0.08745285868644714, 'policy/clipfrac_avg': 0.24764150381088257, 'loss/policy_avg': -0.054438233375549316, 'loss/value_avg': 0.5168647766113281, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.5142765045166016, 'val/ratio': 1.0055197477340698, 'val/ratio_var': 0.00015356864605564624, 'val/num_eos_tokens': 0, 'lr': 2.9495345418912302e-05, 'episode': 3352, 'epoch': 0.41}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:16:05<1:05:39, 131kB/s]
 41%|████      | 839/2041 [1:12:47<1:43:14,  5.15s/it][A

{'eps': 0, 'objective/kl': 92.22538757324219, 'objective/entropy': 63.931785583496094, 'objective/non_score_reward': -4.611268997192383, 'objective/rlhf_reward': -6.0876359939575195, 'objective/scores': -1.4763669967651367, 'policy/approxkl_avg': 0.0977235808968544, 'policy/clipfrac_avg': 0.2075471580028534, 'loss/policy_avg': -0.04574932903051376, 'loss/value_avg': 0.605424165725708, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.3336591720581055, 'val/ratio': 0.9751371741294861, 'val/ratio_var': 0.00029056789935566485, 'val/num_eos_tokens': 0, 'lr': 2.9470847623713864e-05, 'episode': 3356, 'epoch': 0.41}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:16:10<1:05:39, 131kB/s]
 41%|████      | 840/2041 [1:12:52<1:43:22,  5.16s/it][A

{'eps': 0, 'objective/kl': 101.30387878417969, 'objective/entropy': 78.36870574951172, 'objective/non_score_reward': -5.065194129943848, 'objective/rlhf_reward': -7.561032772064209, 'objective/scores': -2.4958386421203613, 'policy/approxkl_avg': 0.241439551115036, 'policy/clipfrac_avg': 0.26886793971061707, 'loss/policy_avg': -0.05828166380524635, 'loss/value_avg': 0.9532016515731812, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.5882441997528076, 'val/ratio': 1.2384824752807617, 'val/ratio_var': 0.08220091462135315, 'val/num_eos_tokens': 0, 'lr': 2.9446349828515435e-05, 'episode': 3360, 'epoch': 0.41}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:16:16<1:05:39, 131kB/s]
 41%|████      | 841/2041 [1:12:57<1:42:57,  5.15s/it][A

{'eps': 0, 'objective/kl': 83.16340637207031, 'objective/entropy': 48.782649993896484, 'objective/non_score_reward': -4.158170700073242, 'objective/rlhf_reward': -6.158926963806152, 'objective/scores': -2.000756025314331, 'policy/approxkl_avg': 0.11690722405910492, 'policy/clipfrac_avg': 0.19929245114326477, 'loss/policy_avg': -0.04681646078824997, 'loss/value_avg': 0.7046549320220947, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9981091618537903, 'val/ratio': 0.9483765363693237, 'val/ratio_var': 0.0013796425191685557, 'val/num_eos_tokens': 0, 'lr': 2.9421852033317003e-05, 'episode': 3364, 'epoch': 0.41}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:16:21<1:05:39, 131kB/s]
 41%|████▏     | 842/2041 [1:13:02<1:43:00,  5.15s/it][A

{'eps': 0, 'objective/kl': 80.41952514648438, 'objective/entropy': 59.586326599121094, 'objective/non_score_reward': -4.0209760665893555, 'objective/rlhf_reward': -5.463229656219482, 'objective/scores': -1.442253589630127, 'policy/approxkl_avg': 0.18816131353378296, 'policy/clipfrac_avg': 0.18396225571632385, 'loss/policy_avg': -0.051556624472141266, 'loss/value_avg': 0.5795612931251526, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2197835445404053, 'val/ratio': 0.9925364255905151, 'val/ratio_var': 0.00022112733859103173, 'val/num_eos_tokens': 0, 'lr': 2.9397354238118568e-05, 'episode': 3368, 'epoch': 0.41}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:16:26<1:05:39, 131kB/s]
 41%|████▏     | 843/2041 [1:13:08<1:43:27,  5.18s/it][A

{'eps': 0, 'objective/kl': 84.31170654296875, 'objective/entropy': 65.35660552978516, 'objective/non_score_reward': -4.215585231781006, 'objective/rlhf_reward': -6.012731552124023, 'objective/scores': -1.7971463203430176, 'policy/approxkl_avg': 0.07641755789518356, 'policy/clipfrac_avg': 0.2158018946647644, 'loss/policy_avg': -0.0593215674161911, 'loss/value_avg': 0.5767824053764343, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.3295750617980957, 'val/ratio': 0.9826477766036987, 'val/ratio_var': 0.00018618338799569756, 'val/num_eos_tokens': 0, 'lr': 2.937285644292014e-05, 'episode': 3372, 'epoch': 0.41}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:16:31<1:05:39, 131kB/s]
 41%|████▏     | 844/2041 [1:13:13<1:43:21,  5.18s/it][A

{'eps': 0, 'objective/kl': 77.20376586914062, 'objective/entropy': 67.15552520751953, 'objective/non_score_reward': -3.8601880073547363, 'objective/rlhf_reward': -5.409738063812256, 'objective/scores': -1.549550175666809, 'policy/approxkl_avg': 0.23363247513771057, 'policy/clipfrac_avg': 0.18278302252292633, 'loss/policy_avg': -0.04223332181572914, 'loss/value_avg': 0.6657858490943909, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1949516534805298, 'val/ratio': 0.9508863687515259, 'val/ratio_var': 0.001193352392874658, 'val/num_eos_tokens': 0, 'lr': 2.9348358647721707e-05, 'episode': 3376, 'epoch': 0.41}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:16:36<1:05:39, 131kB/s]
 41%|████▏     | 845/2041 [1:13:18<1:43:09,  5.18s/it][A

{'eps': 0, 'objective/kl': 98.53673553466797, 'objective/entropy': 93.71036529541016, 'objective/non_score_reward': -4.926836967468262, 'objective/rlhf_reward': -6.118584632873535, 'objective/scores': -1.1917479038238525, 'policy/approxkl_avg': 0.22599387168884277, 'policy/clipfrac_avg': 0.21108490228652954, 'loss/policy_avg': -0.05583071708679199, 'loss/value_avg': 0.871695876121521, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.7024699449539185, 'val/ratio': 0.9410092830657959, 'val/ratio_var': 0.0017234679544344544, 'val/num_eos_tokens': 0, 'lr': 2.9323860852523272e-05, 'episode': 3380, 'epoch': 0.41}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:16:42<1:05:39, 131kB/s]
 41%|████▏     | 846/2041 [1:13:23<1:43:02,  5.17s/it][A

{'eps': 0, 'objective/kl': 79.42496490478516, 'objective/entropy': 35.366851806640625, 'objective/non_score_reward': -3.971248149871826, 'objective/rlhf_reward': -5.820622444152832, 'objective/scores': -1.8493742942810059, 'policy/approxkl_avg': 0.129366934299469, 'policy/clipfrac_avg': 0.12971697747707367, 'loss/policy_avg': -0.041579291224479675, 'loss/value_avg': 0.4645931124687195, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7959328889846802, 'val/ratio': 0.9458897113800049, 'val/ratio_var': 0.0017385765677317977, 'val/num_eos_tokens': 0, 'lr': 2.929936305732484e-05, 'episode': 3384, 'epoch': 0.41}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:16:47<1:05:39, 131kB/s]
 41%|████▏     | 847/2041 [1:13:28<1:43:00,  5.18s/it][A

{'eps': 0, 'objective/kl': 56.306915283203125, 'objective/entropy': 27.202335357666016, 'objective/non_score_reward': -2.8153457641601562, 'objective/rlhf_reward': -4.792141914367676, 'objective/scores': -1.9767961502075195, 'policy/approxkl_avg': 0.12057793140411377, 'policy/clipfrac_avg': 0.09905660152435303, 'loss/policy_avg': -0.03304816782474518, 'loss/value_avg': 0.38617631793022156, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.46096938848495483, 'val/ratio': 0.9602900743484497, 'val/ratio_var': 0.00082819489762187, 'val/num_eos_tokens': 0, 'lr': 2.927486526212641e-05, 'episode': 3388, 'epoch': 0.42}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:16:52<1:05:39, 131kB/s]
 42%|████▏     | 848/2041 [1:13:33<1:42:56,  5.18s/it][A

{'eps': 0, 'objective/kl': 72.0476303100586, 'objective/entropy': 56.553565979003906, 'objective/non_score_reward': -3.602381467819214, 'objective/rlhf_reward': -5.113638401031494, 'objective/scores': -1.5112569332122803, 'policy/approxkl_avg': 0.15981194376945496, 'policy/clipfrac_avg': 0.18632075190544128, 'loss/policy_avg': -0.04474511370062828, 'loss/value_avg': 0.46957090497016907, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9950958490371704, 'val/ratio': 1.0556398630142212, 'val/ratio_var': 0.004550186451524496, 'val/num_eos_tokens': 0, 'lr': 2.925036746692798e-05, 'episode': 3392, 'epoch': 0.42}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:16:57<1:05:39, 131kB/s]
 42%|████▏     | 849/2041 [1:13:39<1:42:46,  5.17s/it][A

{'eps': 0, 'objective/kl': 56.072391510009766, 'objective/entropy': 12.896045684814453, 'objective/non_score_reward': -2.803619861602783, 'objective/rlhf_reward': -4.064687728881836, 'objective/scores': -1.2610676288604736, 'policy/approxkl_avg': 0.02316313423216343, 'policy/clipfrac_avg': 0.0766509473323822, 'loss/policy_avg': -0.03274144232273102, 'loss/value_avg': 0.1893455684185028, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.3530080318450928, 'val/ratio': 0.963283121585846, 'val/ratio_var': 0.0009300554520450532, 'val/num_eos_tokens': 0, 'lr': 2.9225869671729544e-05, 'episode': 3396, 'epoch': 0.42}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:17:02<1:05:39, 131kB/s]
 42%|████▏     | 850/2041 [1:13:44<1:42:13,  5.15s/it][A

{'eps': 0, 'objective/kl': 70.34944152832031, 'objective/entropy': 47.0030517578125, 'objective/non_score_reward': -3.517472267150879, 'objective/rlhf_reward': -5.183885097503662, 'objective/scores': -1.6664129495620728, 'policy/approxkl_avg': 0.20648807287216187, 'policy/clipfrac_avg': 0.1462264209985733, 'loss/policy_avg': -0.04804505780339241, 'loss/value_avg': 0.4092140197753906, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8090341687202454, 'val/ratio': 0.9317200779914856, 'val/ratio_var': 0.0026258572470396757, 'val/num_eos_tokens': 0, 'lr': 2.9201371876531116e-05, 'episode': 3400, 'epoch': 0.42}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:17:07<1:05:39, 131kB/s]
 42%|████▏     | 851/2041 [1:13:49<1:42:09,  5.15s/it][A

{'eps': 0, 'objective/kl': 101.99989318847656, 'objective/entropy': 70.29977416992188, 'objective/non_score_reward': -5.099995136260986, 'objective/rlhf_reward': -6.472441673278809, 'objective/scores': -1.3724467754364014, 'policy/approxkl_avg': 0.09099198877811432, 'policy/clipfrac_avg': 0.20636790990829468, 'loss/policy_avg': -0.04937238618731499, 'loss/value_avg': 0.7616170048713684, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.3763697147369385, 'val/ratio': 1.0749359130859375, 'val/ratio_var': 0.0071669952012598515, 'val/num_eos_tokens': 0, 'lr': 2.9176874081332684e-05, 'episode': 3404, 'epoch': 0.42}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:17:13<1:05:39, 131kB/s]
 42%|████▏     | 852/2041 [1:13:54<1:42:41,  5.18s/it][A

{'eps': 0, 'objective/kl': 87.79878234863281, 'objective/entropy': 65.2417984008789, 'objective/non_score_reward': -4.389938831329346, 'objective/rlhf_reward': -6.00685977935791, 'objective/scores': -1.6169207096099854, 'policy/approxkl_avg': 0.2593609690666199, 'policy/clipfrac_avg': 0.24646225571632385, 'loss/policy_avg': -0.06059233844280243, 'loss/value_avg': 0.7492151260375977, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1962788105010986, 'val/ratio': 1.1334304809570312, 'val/ratio_var': 0.022247537970542908, 'val/num_eos_tokens': 0, 'lr': 2.9152376286134248e-05, 'episode': 3408, 'epoch': 0.42}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:17:18<1:05:39, 131kB/s]
 42%|████▏     | 853/2041 [1:13:59<1:42:52,  5.20s/it][A

{'eps': 0, 'objective/kl': 95.93756103515625, 'objective/entropy': 83.08499145507812, 'objective/non_score_reward': -4.796877861022949, 'objective/rlhf_reward': -5.647501468658447, 'objective/scores': -0.8506234288215637, 'policy/approxkl_avg': 0.07431206107139587, 'policy/clipfrac_avg': 0.21933962404727936, 'loss/policy_avg': -0.05186781287193298, 'loss/value_avg': 0.53849196434021, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.4915621280670166, 'val/ratio': 1.028031349182129, 'val/ratio_var': 0.001162477070465684, 'val/num_eos_tokens': 0, 'lr': 2.9127878490935816e-05, 'episode': 3412, 'epoch': 0.42}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:17:23<1:05:39, 131kB/s]
 42%|████▏     | 854/2041 [1:14:04<1:42:20,  5.17s/it][A

{'eps': 0, 'objective/kl': 88.58492279052734, 'objective/entropy': 40.20914840698242, 'objective/non_score_reward': -4.429245948791504, 'objective/rlhf_reward': -6.325857639312744, 'objective/scores': -1.8966118097305298, 'policy/approxkl_avg': 0.21425481140613556, 'policy/clipfrac_avg': 0.1320754736661911, 'loss/policy_avg': -0.03865007683634758, 'loss/value_avg': 0.7421281337738037, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7240007519721985, 'val/ratio': 0.9462583065032959, 'val/ratio_var': 0.001697632484138012, 'val/num_eos_tokens': 0, 'lr': 2.9103380695737388e-05, 'episode': 3416, 'epoch': 0.42}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:17:28<1:05:39, 131kB/s]
 42%|████▏     | 855/2041 [1:14:10<1:42:17,  5.18s/it][A

{'eps': 0, 'objective/kl': 74.8255844116211, 'objective/entropy': 85.05500030517578, 'objective/non_score_reward': -3.741279363632202, 'objective/rlhf_reward': -4.821776390075684, 'objective/scores': -1.0804970264434814, 'policy/approxkl_avg': 0.4246748089790344, 'policy/clipfrac_avg': 0.2511792480945587, 'loss/policy_avg': -0.05753964185714722, 'loss/value_avg': 0.6126350164413452, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.4608306884765625, 'val/ratio': 1.0221986770629883, 'val/ratio_var': 0.00035817455500364304, 'val/num_eos_tokens': 0, 'lr': 2.907888290053895e-05, 'episode': 3420, 'epoch': 0.42}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:17:33<1:05:39, 131kB/s]
 42%|████▏     | 856/2041 [1:14:15<1:42:19,  5.18s/it][A

{'eps': 0, 'objective/kl': 79.68585968017578, 'objective/entropy': 94.50161743164062, 'objective/non_score_reward': -3.984292984008789, 'objective/rlhf_reward': -5.381961345672607, 'objective/scores': -1.3976683616638184, 'policy/approxkl_avg': 0.0950658917427063, 'policy/clipfrac_avg': 0.24056604504585266, 'loss/policy_avg': -0.056596539914608, 'loss/value_avg': 0.705597996711731, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.7303504943847656, 'val/ratio': 1.0594980716705322, 'val/ratio_var': 0.003141748486086726, 'val/num_eos_tokens': 0, 'lr': 2.905438510534052e-05, 'episode': 3424, 'epoch': 0.42}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:17:38<1:05:39, 131kB/s]
 42%|████▏     | 857/2041 [1:14:20<1:42:10,  5.18s/it][A

{'eps': 0, 'objective/kl': 82.98945617675781, 'objective/entropy': 55.891578674316406, 'objective/non_score_reward': -4.149473190307617, 'objective/rlhf_reward': -6.735237121582031, 'objective/scores': -2.585763931274414, 'policy/approxkl_avg': 0.2244509756565094, 'policy/clipfrac_avg': 0.13679245114326477, 'loss/policy_avg': -0.039710573852062225, 'loss/value_avg': 0.6892319917678833, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1497613191604614, 'val/ratio': 1.043218731880188, 'val/ratio_var': 0.00197377591393888, 'val/num_eos_tokens': 0, 'lr': 2.902988731014209e-05, 'episode': 3428, 'epoch': 0.42}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:17:44<1:05:39, 131kB/s]
 42%|████▏     | 858/2041 [1:14:25<1:41:58,  5.17s/it][A

{'eps': 0, 'objective/kl': 92.8291015625, 'objective/entropy': 78.83522033691406, 'objective/non_score_reward': -4.641455173492432, 'objective/rlhf_reward': -6.08736515045166, 'objective/scores': -1.4459097385406494, 'policy/approxkl_avg': 0.09779562056064606, 'policy/clipfrac_avg': 0.2075471580028534, 'loss/policy_avg': -0.05181219428777695, 'loss/value_avg': 0.6193972826004028, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.3621914386749268, 'val/ratio': 1.0218169689178467, 'val/ratio_var': 0.001474178396165371, 'val/num_eos_tokens': 0, 'lr': 2.9005389514943653e-05, 'episode': 3432, 'epoch': 0.42}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:17:49<1:05:39, 131kB/s]
 42%|████▏     | 859/2041 [1:14:30<1:42:04,  5.18s/it][A

{'eps': 0, 'objective/kl': 68.73178100585938, 'objective/entropy': 62.1834716796875, 'objective/non_score_reward': -3.436589241027832, 'objective/rlhf_reward': -5.074897766113281, 'objective/scores': -1.6383087635040283, 'policy/approxkl_avg': 0.062437474727630615, 'policy/clipfrac_avg': 0.1591981202363968, 'loss/policy_avg': -0.0421551838517189, 'loss/value_avg': 0.4993983507156372, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.062517762184143, 'val/ratio': 1.0435125827789307, 'val/ratio_var': 0.003821565769612789, 'val/num_eos_tokens': 0, 'lr': 2.8980891719745225e-05, 'episode': 3436, 'epoch': 0.42}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:17:54<1:05:39, 131kB/s]
 42%|████▏     | 860/2041 [1:14:35<1:41:26,  5.15s/it][A

{'eps': 0, 'objective/kl': 88.11602020263672, 'objective/entropy': 79.40863800048828, 'objective/non_score_reward': -4.405800819396973, 'objective/rlhf_reward': -5.932154655456543, 'objective/scores': -1.5263539552688599, 'policy/approxkl_avg': 0.06660463660955429, 'policy/clipfrac_avg': 0.21226415038108826, 'loss/policy_avg': -0.04863651469349861, 'loss/value_avg': 0.7081641554832458, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.3463943004608154, 'val/ratio': 1.0812318325042725, 'val/ratio_var': 0.004889648407697678, 'val/num_eos_tokens': 0, 'lr': 2.8956393924546793e-05, 'episode': 3440, 'epoch': 0.42}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:17:59<1:05:39, 131kB/s]
 42%|████▏     | 861/2041 [1:14:41<1:41:29,  5.16s/it][A

{'eps': 0, 'objective/kl': 85.75776672363281, 'objective/entropy': 84.95813751220703, 'objective/non_score_reward': -4.287888526916504, 'objective/rlhf_reward': -5.533318519592285, 'objective/scores': -1.2454299926757812, 'policy/approxkl_avg': 0.14109615981578827, 'policy/clipfrac_avg': 0.2087264060974121, 'loss/policy_avg': -0.04613127559423447, 'loss/value_avg': 0.6812414526939392, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.5083744525909424, 'val/ratio': 1.1188302040100098, 'val/ratio_var': 0.02160792052745819, 'val/num_eos_tokens': 0, 'lr': 2.8931896129348364e-05, 'episode': 3444, 'epoch': 0.42}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:18:04<1:05:39, 131kB/s]
 42%|████▏     | 862/2041 [1:14:46<1:41:51,  5.18s/it][A

{'eps': 0, 'objective/kl': 55.13130187988281, 'objective/entropy': 65.44380950927734, 'objective/non_score_reward': -2.7565650939941406, 'objective/rlhf_reward': -4.963294982910156, 'objective/scores': -2.2067296504974365, 'policy/approxkl_avg': 0.04569520056247711, 'policy/clipfrac_avg': 0.13089622557163239, 'loss/policy_avg': -0.03216047212481499, 'loss/value_avg': 0.48408111929893494, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2026523351669312, 'val/ratio': 1.0461628437042236, 'val/ratio_var': 0.002667408436536789, 'val/num_eos_tokens': 0, 'lr': 2.8907398334149925e-05, 'episode': 3448, 'epoch': 0.42}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:18:09<1:05:39, 131kB/s]
 42%|████▏     | 863/2041 [1:14:51<1:41:37,  5.18s/it][A

{'eps': 0, 'objective/kl': 82.24398803710938, 'objective/entropy': 93.09468841552734, 'objective/non_score_reward': -4.112199783325195, 'objective/rlhf_reward': -5.390742778778076, 'objective/scores': -1.2785429954528809, 'policy/approxkl_avg': 0.24863578379154205, 'policy/clipfrac_avg': 0.20400944352149963, 'loss/policy_avg': -0.0508950799703598, 'loss/value_avg': 0.5555022954940796, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.7365059852600098, 'val/ratio': 0.9561268091201782, 'val/ratio_var': 0.0009791014017537236, 'val/num_eos_tokens': 0, 'lr': 2.8882900538951497e-05, 'episode': 3452, 'epoch': 0.42}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:18:15<1:05:39, 131kB/s]
 42%|████▏     | 864/2041 [1:14:56<1:41:10,  5.16s/it][A

{'eps': 0, 'objective/kl': 78.46163940429688, 'objective/entropy': 76.76927185058594, 'objective/non_score_reward': -3.9230823516845703, 'objective/rlhf_reward': -6.125236988067627, 'objective/scores': -2.2021546363830566, 'policy/approxkl_avg': 0.08541988581418991, 'policy/clipfrac_avg': 0.1804245412349701, 'loss/policy_avg': -0.04624257981777191, 'loss/value_avg': 0.735792875289917, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.4749581813812256, 'val/ratio': 0.978952169418335, 'val/ratio_var': 0.00020357691391836852, 'val/num_eos_tokens': 0, 'lr': 2.8858402743753065e-05, 'episode': 3456, 'epoch': 0.42}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:18:20<1:05:39, 131kB/s]
 42%|████▏     | 865/2041 [1:15:01<1:41:26,  5.18s/it][A

{'eps': 0, 'objective/kl': 79.83322143554688, 'objective/entropy': 69.16770935058594, 'objective/non_score_reward': -3.991661310195923, 'objective/rlhf_reward': -5.677402496337891, 'objective/scores': -1.6857414245605469, 'policy/approxkl_avg': 0.09184809774160385, 'policy/clipfrac_avg': 0.1733490526676178, 'loss/policy_avg': -0.04688405990600586, 'loss/value_avg': 0.8437988758087158, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.3686257600784302, 'val/ratio': 0.9993122220039368, 'val/ratio_var': 2.7972964744549245e-05, 'val/num_eos_tokens': 0, 'lr': 2.883390494855463e-05, 'episode': 3460, 'epoch': 0.42}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:18:25<1:05:39, 131kB/s]
 42%|████▏     | 866/2041 [1:15:06<1:40:59,  5.16s/it][A

{'eps': 0, 'objective/kl': 80.27718353271484, 'objective/entropy': 74.4185791015625, 'objective/non_score_reward': -4.013859272003174, 'objective/rlhf_reward': -4.899921417236328, 'objective/scores': -0.8860623836517334, 'policy/approxkl_avg': 0.13067923486232758, 'policy/clipfrac_avg': 0.18985848128795624, 'loss/policy_avg': -0.049423784017562866, 'loss/value_avg': 0.5879292488098145, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.3749523162841797, 'val/ratio': 1.0270721912384033, 'val/ratio_var': 0.0012997062876820564, 'val/num_eos_tokens': 0, 'lr': 2.88094071533562e-05, 'episode': 3464, 'epoch': 0.42}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:18:30<1:05:39, 131kB/s]
 42%|████▏     | 867/2041 [1:15:12<1:40:29,  5.14s/it][A

{'eps': 0, 'objective/kl': 87.99891662597656, 'objective/entropy': 70.33184814453125, 'objective/non_score_reward': -4.399946212768555, 'objective/rlhf_reward': -5.836662769317627, 'objective/scores': -1.4367165565490723, 'policy/approxkl_avg': 0.438029408454895, 'policy/clipfrac_avg': 0.1521226465702057, 'loss/policy_avg': -0.04422571510076523, 'loss/value_avg': 0.6016430854797363, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2674415111541748, 'val/ratio': 1.0083776712417603, 'val/ratio_var': 3.739245221368037e-05, 'val/num_eos_tokens': 0, 'lr': 2.878490935815777e-05, 'episode': 3468, 'epoch': 0.42}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:18:35<1:05:39, 131kB/s]
 43%|████▎     | 868/2041 [1:15:17<1:40:44,  5.15s/it][A

{'eps': 0, 'objective/kl': 78.16787719726562, 'objective/entropy': 57.846839904785156, 'objective/non_score_reward': -3.9083938598632812, 'objective/rlhf_reward': -5.19762659072876, 'objective/scores': -1.2892327308654785, 'policy/approxkl_avg': 0.12984971702098846, 'policy/clipfrac_avg': 0.12264151871204376, 'loss/policy_avg': -0.03183354064822197, 'loss/value_avg': 0.47505784034729004, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1489332914352417, 'val/ratio': 0.9924980401992798, 'val/ratio_var': 4.121539313928224e-05, 'val/num_eos_tokens': 0, 'lr': 2.8760411562959334e-05, 'episode': 3472, 'epoch': 0.43}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:18:40<1:05:39, 131kB/s]
 43%|████▎     | 869/2041 [1:15:22<1:40:46,  5.16s/it][A

{'eps': 0, 'objective/kl': 66.72029113769531, 'objective/entropy': 55.15227508544922, 'objective/non_score_reward': -3.3360142707824707, 'objective/rlhf_reward': -6.045184135437012, 'objective/scores': -2.709169626235962, 'policy/approxkl_avg': 0.05287501960992813, 'policy/clipfrac_avg': 0.11320754885673523, 'loss/policy_avg': -0.0326499417424202, 'loss/value_avg': 0.584869384765625, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1249973773956299, 'val/ratio': 0.995396614074707, 'val/ratio_var': 3.6055032978765666e-05, 'val/num_eos_tokens': 0, 'lr': 2.87359137677609e-05, 'episode': 3476, 'epoch': 0.43}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:18:46<1:05:39, 131kB/s]
 43%|████▎     | 870/2041 [1:15:27<1:40:41,  5.16s/it][A

{'eps': 0, 'objective/kl': 74.65919494628906, 'objective/entropy': 80.96461486816406, 'objective/non_score_reward': -3.732959270477295, 'objective/rlhf_reward': -4.832512378692627, 'objective/scores': -1.099553108215332, 'policy/approxkl_avg': 0.39758598804473877, 'policy/clipfrac_avg': 0.15566037595272064, 'loss/policy_avg': -0.041622839868068695, 'loss/value_avg': 0.5565963387489319, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.3774421215057373, 'val/ratio': 0.9604873657226562, 'val/ratio_var': 0.000776340311858803, 'val/num_eos_tokens': 0, 'lr': 2.8711415972562473e-05, 'episode': 3480, 'epoch': 0.43}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:18:51<1:05:39, 131kB/s]
 43%|████▎     | 871/2041 [1:15:32<1:40:49,  5.17s/it][A

{'eps': 0, 'objective/kl': 77.44541931152344, 'objective/entropy': 58.66443634033203, 'objective/non_score_reward': -3.8722710609436035, 'objective/rlhf_reward': -5.368721008300781, 'objective/scores': -1.4964501857757568, 'policy/approxkl_avg': 0.796863317489624, 'policy/clipfrac_avg': 0.1391509473323822, 'loss/policy_avg': -0.036583103239536285, 'loss/value_avg': 0.6079797148704529, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0182569026947021, 'val/ratio': 0.9757628440856934, 'val/ratio_var': 0.0002651878457982093, 'val/num_eos_tokens': 0, 'lr': 2.8686918177364038e-05, 'episode': 3484, 'epoch': 0.43}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:18:56<1:05:39, 131kB/s]
 43%|████▎     | 872/2041 [1:15:37<1:40:51,  5.18s/it][A

{'eps': 0, 'objective/kl': 58.49839401245117, 'objective/entropy': 52.54778289794922, 'objective/non_score_reward': -2.924919366836548, 'objective/rlhf_reward': -4.961625099182129, 'objective/scores': -2.036705493927002, 'policy/approxkl_avg': 0.4416216015815735, 'policy/clipfrac_avg': 0.14976416528224945, 'loss/policy_avg': -0.0379612073302269, 'loss/value_avg': 0.4698181450366974, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.046303391456604, 'val/ratio': 0.955121636390686, 'val/ratio_var': 0.0010810790117830038, 'val/num_eos_tokens': 0, 'lr': 2.8662420382165606e-05, 'episode': 3488, 'epoch': 0.43}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:19:01<1:05:39, 131kB/s]
 43%|████▎     | 873/2041 [1:15:43<1:40:47,  5.18s/it][A

{'eps': 0, 'objective/kl': 75.82078552246094, 'objective/entropy': 81.04450988769531, 'objective/non_score_reward': -3.791038990020752, 'objective/rlhf_reward': -5.689184188842773, 'objective/scores': -1.898145079612732, 'policy/approxkl_avg': 0.15225860476493835, 'policy/clipfrac_avg': 0.16273584961891174, 'loss/policy_avg': -0.0478607714176178, 'loss/value_avg': 0.4523806571960449, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.5210826396942139, 'val/ratio': 1.02962064743042, 'val/ratio_var': 0.0007766247726976871, 'val/num_eos_tokens': 0, 'lr': 2.8637922586967174e-05, 'episode': 3492, 'epoch': 0.43}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:19:06<1:05:39, 131kB/s]
 43%|████▎     | 874/2041 [1:15:48<1:41:01,  5.19s/it][A

{'eps': 0, 'objective/kl': 73.78648376464844, 'objective/entropy': 81.59492492675781, 'objective/non_score_reward': -3.689324378967285, 'objective/rlhf_reward': -5.362587928771973, 'objective/scores': -1.6732637882232666, 'policy/approxkl_avg': 1.0437711477279663, 'policy/clipfrac_avg': 0.19929245114326477, 'loss/policy_avg': -0.05093357339501381, 'loss/value_avg': 0.5635619163513184, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.4046635627746582, 'val/ratio': 0.9485907554626465, 'val/ratio_var': 0.0013533615274354815, 'val/num_eos_tokens': 0, 'lr': 2.8613424791768745e-05, 'episode': 3496, 'epoch': 0.43}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:19:11<1:05:39, 131kB/s]
 43%|████▎     | 875/2041 [1:15:53<1:40:34,  5.18s/it][A

{'eps': 0, 'objective/kl': 61.62848663330078, 'objective/entropy': 62.8626708984375, 'objective/non_score_reward': -3.0814244747161865, 'objective/rlhf_reward': -4.5839738845825195, 'objective/scores': -1.5025495290756226, 'policy/approxkl_avg': 0.2697352468967438, 'policy/clipfrac_avg': 0.13561320304870605, 'loss/policy_avg': -0.03467058762907982, 'loss/value_avg': 0.5095909237861633, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2617052793502808, 'val/ratio': 1.024669885635376, 'val/ratio_var': 0.0010620950488373637, 'val/num_eos_tokens': 0, 'lr': 2.858892699657031e-05, 'episode': 3500, 'epoch': 0.43}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:19:17<1:05:39, 131kB/s]
 43%|████▎     | 876/2041 [1:15:58<1:40:00,  5.15s/it][A

{'eps': 0, 'objective/kl': 78.37094116210938, 'objective/entropy': 62.872169494628906, 'objective/non_score_reward': -3.918546676635742, 'objective/rlhf_reward': -5.206195831298828, 'objective/scores': -1.2876490354537964, 'policy/approxkl_avg': 1.133074164390564, 'policy/clipfrac_avg': 0.15683962404727936, 'loss/policy_avg': -0.037489697337150574, 'loss/value_avg': 0.6665037870407104, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.194053292274475, 'val/ratio': 0.9839694499969482, 'val/ratio_var': 0.00013679053517989814, 'val/num_eos_tokens': 0, 'lr': 2.8564429201371878e-05, 'episode': 3504, 'epoch': 0.43}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:19:22<1:05:39, 131kB/s]
 43%|████▎     | 877/2041 [1:16:03<1:40:07,  5.16s/it][A

{'eps': 0, 'objective/kl': 74.43016815185547, 'objective/entropy': 53.85612487792969, 'objective/non_score_reward': -3.721508502960205, 'objective/rlhf_reward': -5.956684112548828, 'objective/scores': -2.235175609588623, 'policy/approxkl_avg': 0.13314956426620483, 'policy/clipfrac_avg': 0.12735849618911743, 'loss/policy_avg': -0.034706972539424896, 'loss/value_avg': 0.6548280119895935, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9935476183891296, 'val/ratio': 0.9969115853309631, 'val/ratio_var': 3.4402281016809866e-05, 'val/num_eos_tokens': 0, 'lr': 2.853993140617345e-05, 'episode': 3508, 'epoch': 0.43}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:19:27<1:05:39, 131kB/s]
 43%|████▎     | 878/2041 [1:16:08<1:40:10,  5.17s/it][A

{'eps': 0, 'objective/kl': 80.96985626220703, 'objective/entropy': 71.97129821777344, 'objective/non_score_reward': -4.048492908477783, 'objective/rlhf_reward': -6.174056529998779, 'objective/scores': -2.125563621520996, 'policy/approxkl_avg': 0.24112442135810852, 'policy/clipfrac_avg': 0.14033019542694092, 'loss/policy_avg': -0.0434076264500618, 'loss/value_avg': 0.6761618256568909, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.5279808044433594, 'val/ratio': 0.953717827796936, 'val/ratio_var': 0.0014513982459902763, 'val/num_eos_tokens': 0, 'lr': 2.851543361097501e-05, 'episode': 3512, 'epoch': 0.43}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:19:32<1:05:39, 131kB/s]
 43%|████▎     | 879/2041 [1:16:14<1:40:14,  5.18s/it][A

{'eps': 0, 'objective/kl': 65.11338806152344, 'objective/entropy': 57.84392166137695, 'objective/non_score_reward': -3.255669355392456, 'objective/rlhf_reward': -5.101783752441406, 'objective/scores': -1.846114158630371, 'policy/approxkl_avg': 0.19168595969676971, 'policy/clipfrac_avg': 0.1379716843366623, 'loss/policy_avg': -0.03825072944164276, 'loss/value_avg': 0.41040241718292236, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.024322748184204, 'val/ratio': 0.9818697571754456, 'val/ratio_var': 0.0001497937337262556, 'val/num_eos_tokens': 0, 'lr': 2.8490935815776582e-05, 'episode': 3516, 'epoch': 0.43}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:19:37<1:05:39, 131kB/s]
 43%|████▎     | 880/2041 [1:16:19<1:40:16,  5.18s/it][A

{'eps': 0, 'objective/kl': 74.67115783691406, 'objective/entropy': 70.12721252441406, 'objective/non_score_reward': -3.733558177947998, 'objective/rlhf_reward': -6.369979381561279, 'objective/scores': -2.6364212036132812, 'policy/approxkl_avg': 0.06198021396994591, 'policy/clipfrac_avg': 0.16509434580802917, 'loss/policy_avg': -0.045307744294404984, 'loss/value_avg': 0.8087434768676758, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1806461811065674, 'val/ratio': 1.0019989013671875, 'val/ratio_var': 0.00010518001363379881, 'val/num_eos_tokens': 0, 'lr': 2.846643802057815e-05, 'episode': 3520, 'epoch': 0.43}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:19:43<1:05:39, 131kB/s]
 43%|████▎     | 881/2041 [1:16:24<1:40:44,  5.21s/it][A

{'eps': 0, 'objective/kl': 83.32398223876953, 'objective/entropy': 55.63902282714844, 'objective/non_score_reward': -4.166199684143066, 'objective/rlhf_reward': -5.977524757385254, 'objective/scores': -1.8113253116607666, 'policy/approxkl_avg': 0.4600169360637665, 'policy/clipfrac_avg': 0.1379716992378235, 'loss/policy_avg': -0.03783472627401352, 'loss/value_avg': 0.5966880321502686, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0973339080810547, 'val/ratio': 0.9650986194610596, 'val/ratio_var': 0.0007879685144871473, 'val/num_eos_tokens': 0, 'lr': 2.8441940225379715e-05, 'episode': 3524, 'epoch': 0.43}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:19:48<1:05:39, 131kB/s]
 43%|████▎     | 882/2041 [1:16:29<1:40:26,  5.20s/it][A

{'eps': 0, 'objective/kl': 84.958251953125, 'objective/entropy': 62.20298767089844, 'objective/non_score_reward': -4.247912406921387, 'objective/rlhf_reward': -6.365661144256592, 'objective/scores': -2.117748737335205, 'policy/approxkl_avg': 0.08325768262147903, 'policy/clipfrac_avg': 0.16509434580802917, 'loss/policy_avg': -0.04525567218661308, 'loss/value_avg': 0.6426373720169067, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.23106050491333, 'val/ratio': 1.0120471715927124, 'val/ratio_var': 0.00045948103070259094, 'val/num_eos_tokens': 0, 'lr': 2.8417442430181286e-05, 'episode': 3528, 'epoch': 0.43}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:19:53<1:05:39, 131kB/s]
 43%|████▎     | 883/2041 [1:16:34<1:40:03,  5.18s/it][A

{'eps': 0, 'objective/kl': 73.31549835205078, 'objective/entropy': 57.23094177246094, 'objective/non_score_reward': -3.6657750606536865, 'objective/rlhf_reward': -6.060575485229492, 'objective/scores': -2.3948001861572266, 'policy/approxkl_avg': 0.21297889947891235, 'policy/clipfrac_avg': 0.15448112785816193, 'loss/policy_avg': -0.040366679430007935, 'loss/value_avg': 0.7184522747993469, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.064373254776001, 'val/ratio': 1.1488633155822754, 'val/ratio_var': 0.04485543444752693, 'val/num_eos_tokens': 0, 'lr': 2.8392944634982854e-05, 'episode': 3532, 'epoch': 0.43}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:19:58<1:05:39, 131kB/s]
 43%|████▎     | 884/2041 [1:16:40<1:40:14,  5.20s/it][A

{'eps': 0, 'objective/kl': 80.99994659423828, 'objective/entropy': 55.33641815185547, 'objective/non_score_reward': -4.049997329711914, 'objective/rlhf_reward': -6.4055705070495605, 'objective/scores': -2.3555731773376465, 'policy/approxkl_avg': 0.059840861707925797, 'policy/clipfrac_avg': 0.14386792480945587, 'loss/policy_avg': -0.03812599182128906, 'loss/value_avg': 0.8318727016448975, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1905038356781006, 'val/ratio': 0.9776318073272705, 'val/ratio_var': 0.0004620864347089082, 'val/num_eos_tokens': 0, 'lr': 2.836844683978442e-05, 'episode': 3536, 'epoch': 0.43}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:20:03<1:05:39, 131kB/s]
 43%|████▎     | 885/2041 [1:16:45<1:40:12,  5.20s/it][A

{'eps': 0, 'objective/kl': 82.57302856445312, 'objective/entropy': 57.277915954589844, 'objective/non_score_reward': -4.1286516189575195, 'objective/rlhf_reward': -6.353793144226074, 'objective/scores': -2.225141763687134, 'policy/approxkl_avg': 0.40297797322273254, 'policy/clipfrac_avg': 0.1320754736661911, 'loss/policy_avg': -0.03207646310329437, 'loss/value_avg': 0.7164342403411865, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1316156387329102, 'val/ratio': 1.0153748989105225, 'val/ratio_var': 0.0005405729752965271, 'val/num_eos_tokens': 0, 'lr': 2.8343949044585987e-05, 'episode': 3540, 'epoch': 0.43}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:20:09<1:05:39, 131kB/s]
 43%|████▎     | 886/2041 [1:16:50<1:39:48,  5.19s/it][A

{'eps': 0, 'objective/kl': 97.52845764160156, 'objective/entropy': 72.74429321289062, 'objective/non_score_reward': -4.87642240524292, 'objective/rlhf_reward': -7.105777740478516, 'objective/scores': -2.2293550968170166, 'policy/approxkl_avg': 1.0038832426071167, 'policy/clipfrac_avg': 0.18396227061748505, 'loss/policy_avg': -0.04441850259900093, 'loss/value_avg': 0.8188114166259766, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.4304430484771729, 'val/ratio': 1.0060255527496338, 'val/ratio_var': 0.00010998186917277053, 'val/num_eos_tokens': 0, 'lr': 2.831945124938756e-05, 'episode': 3544, 'epoch': 0.43}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:20:14<1:05:39, 131kB/s]
 43%|████▎     | 887/2041 [1:16:55<1:39:43,  5.19s/it][A

{'eps': 0, 'objective/kl': 76.33775329589844, 'objective/entropy': 68.99830627441406, 'objective/non_score_reward': -3.8168880939483643, 'objective/rlhf_reward': -6.565305709838867, 'objective/scores': -2.748417615890503, 'policy/approxkl_avg': 0.10356087982654572, 'policy/clipfrac_avg': 0.16863207519054413, 'loss/policy_avg': -0.04715526103973389, 'loss/value_avg': 0.8346585035324097, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.18669855594635, 'val/ratio': 1.1783453226089478, 'val/ratio_var': 0.028987348079681396, 'val/num_eos_tokens': 0, 'lr': 2.8294953454189126e-05, 'episode': 3548, 'epoch': 0.43}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:20:19<1:05:39, 131kB/s]
 44%|████▎     | 888/2041 [1:17:00<1:39:58,  5.20s/it][A

{'eps': 0, 'objective/kl': 97.9560317993164, 'objective/entropy': 63.69046401977539, 'objective/non_score_reward': -4.897801399230957, 'objective/rlhf_reward': -6.596749782562256, 'objective/scores': -1.6989483833312988, 'policy/approxkl_avg': 0.2436118870973587, 'policy/clipfrac_avg': 0.18985849618911743, 'loss/policy_avg': -0.050840869545936584, 'loss/value_avg': 0.7561942338943481, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2681455612182617, 'val/ratio': 1.4301533699035645, 'val/ratio_var': 0.25597235560417175, 'val/num_eos_tokens': 0, 'lr': 2.827045565899069e-05, 'episode': 3552, 'epoch': 0.44}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:20:24<1:05:39, 131kB/s]
 44%|████▎     | 889/2041 [1:17:06<1:39:47,  5.20s/it][A

{'eps': 0, 'objective/kl': 109.96443176269531, 'objective/entropy': 62.350059509277344, 'objective/non_score_reward': -5.498221397399902, 'objective/rlhf_reward': -8.634746551513672, 'objective/scores': -3.1365246772766113, 'policy/approxkl_avg': 1.244264006614685, 'policy/clipfrac_avg': 0.15094339847564697, 'loss/policy_avg': -0.045405931770801544, 'loss/value_avg': 1.5002081394195557, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1450695991516113, 'val/ratio': 1.0303722620010376, 'val/ratio_var': 0.0012629524571821094, 'val/num_eos_tokens': 0, 'lr': 2.824595786379226e-05, 'episode': 3556, 'epoch': 0.44}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:20:29<1:05:39, 131kB/s]
 44%|████▎     | 890/2041 [1:17:11<1:39:27,  5.18s/it][A

{'eps': 0, 'objective/kl': 107.7191162109375, 'objective/entropy': 45.60924530029297, 'objective/non_score_reward': -5.385955810546875, 'objective/rlhf_reward': -8.381393432617188, 'objective/scores': -2.9954376220703125, 'policy/approxkl_avg': 0.038485269993543625, 'policy/clipfrac_avg': 0.14268867671489716, 'loss/policy_avg': -0.03940262645483017, 'loss/value_avg': 1.0501375198364258, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.022796869277954, 'val/ratio': 1.0420424938201904, 'val/ratio_var': 0.0014934297651052475, 'val/num_eos_tokens': 0, 'lr': 2.822146006859383e-05, 'episode': 3560, 'epoch': 0.44}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:20:34<1:05:39, 131kB/s]
 44%|████▎     | 891/2041 [1:17:16<1:39:06,  5.17s/it][A

{'eps': 0, 'objective/kl': 99.03775024414062, 'objective/entropy': 66.48285675048828, 'objective/non_score_reward': -4.951888084411621, 'objective/rlhf_reward': -6.946561336517334, 'objective/scores': -1.994673252105713, 'policy/approxkl_avg': 0.2810734808444977, 'policy/clipfrac_avg': 0.17216981947422028, 'loss/policy_avg': -0.03892632946372032, 'loss/value_avg': 0.8017470836639404, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1632161140441895, 'val/ratio': 0.9897039532661438, 'val/ratio_var': 6.337998638628051e-05, 'val/num_eos_tokens': 0, 'lr': 2.8196962273395395e-05, 'episode': 3564, 'epoch': 0.44}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:20:40<1:05:39, 131kB/s]
 44%|████▎     | 892/2041 [1:17:21<1:39:27,  5.19s/it][A

{'eps': 0, 'objective/kl': 113.14959716796875, 'objective/entropy': 65.14169311523438, 'objective/non_score_reward': -5.657479763031006, 'objective/rlhf_reward': -7.594141006469727, 'objective/scores': -1.9366610050201416, 'policy/approxkl_avg': 0.14599007368087769, 'policy/clipfrac_avg': 0.15801885724067688, 'loss/policy_avg': -0.040025386959314346, 'loss/value_avg': 1.1702632904052734, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1196625232696533, 'val/ratio': 0.9742119312286377, 'val/ratio_var': 0.0003160172200296074, 'val/num_eos_tokens': 0, 'lr': 2.8172464478196963e-05, 'episode': 3568, 'epoch': 0.44}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:20:45<1:05:39, 131kB/s]
 44%|████▍     | 893/2041 [1:17:26<1:39:26,  5.20s/it][A

{'eps': 0, 'objective/kl': 111.2606201171875, 'objective/entropy': 68.33860778808594, 'objective/non_score_reward': -5.563031196594238, 'objective/rlhf_reward': -7.936306953430176, 'objective/scores': -2.3732757568359375, 'policy/approxkl_avg': 0.32194557785987854, 'policy/clipfrac_avg': 0.14268869161605835, 'loss/policy_avg': -0.033861592411994934, 'loss/value_avg': 1.0932949781417847, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2426121234893799, 'val/ratio': 0.9648705124855042, 'val/ratio_var': 0.0006001657457090914, 'val/num_eos_tokens': 0, 'lr': 2.8147966682998535e-05, 'episode': 3572, 'epoch': 0.44}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:20:50<1:05:39, 131kB/s]
 44%|████▍     | 894/2041 [1:17:32<1:39:21,  5.20s/it][A

{'eps': 0, 'objective/kl': 95.07026672363281, 'objective/entropy': 44.47172164916992, 'objective/non_score_reward': -4.753512859344482, 'objective/rlhf_reward': -7.448955535888672, 'objective/scores': -2.6954426765441895, 'policy/approxkl_avg': 0.08717291057109833, 'policy/clipfrac_avg': 0.12028302252292633, 'loss/policy_avg': -0.03354362025856972, 'loss/value_avg': 0.9448114633560181, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9675703644752502, 'val/ratio': 0.9844921231269836, 'val/ratio_var': 0.00011608690692810342, 'val/num_eos_tokens': 0, 'lr': 2.8123468887800096e-05, 'episode': 3576, 'epoch': 0.44}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:20:55<1:05:39, 131kB/s]
 44%|████▍     | 895/2041 [1:17:37<1:39:28,  5.21s/it][A

{'eps': 0, 'objective/kl': 93.95452117919922, 'objective/entropy': 53.71209716796875, 'objective/non_score_reward': -4.697726249694824, 'objective/rlhf_reward': -7.836322784423828, 'objective/scores': -3.138596773147583, 'policy/approxkl_avg': 0.6113544702529907, 'policy/clipfrac_avg': 0.12028302252292633, 'loss/policy_avg': -0.030449051409959793, 'loss/value_avg': 1.2075374126434326, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9596163034439087, 'val/ratio': 0.9987735152244568, 'val/ratio_var': 1.132460602093488e-05, 'val/num_eos_tokens': 0, 'lr': 2.8098971092601667e-05, 'episode': 3580, 'epoch': 0.44}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:21:01<1:05:39, 131kB/s]
 44%|████▍     | 896/2041 [1:17:42<1:39:24,  5.21s/it][A

{'eps': 0, 'objective/kl': 113.73615264892578, 'objective/entropy': 54.427894592285156, 'objective/non_score_reward': -5.686808109283447, 'objective/rlhf_reward': -7.727542877197266, 'objective/scores': -2.0407347679138184, 'policy/approxkl_avg': 0.5314676761627197, 'policy/clipfrac_avg': 0.18278302252292633, 'loss/policy_avg': -0.045832790434360504, 'loss/value_avg': 1.0287971496582031, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1482875347137451, 'val/ratio': 0.9972637891769409, 'val/ratio_var': 0.00039614076376892626, 'val/num_eos_tokens': 0, 'lr': 2.8074473297403235e-05, 'episode': 3584, 'epoch': 0.44}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:21:06<1:05:39, 131kB/s]
 44%|████▍     | 897/2041 [1:17:47<1:39:03,  5.20s/it][A

{'eps': 0, 'objective/kl': 110.47076416015625, 'objective/entropy': 70.29713439941406, 'objective/non_score_reward': -5.523538589477539, 'objective/rlhf_reward': -7.794247150421143, 'objective/scores': -2.2707085609436035, 'policy/approxkl_avg': 0.42480194568634033, 'policy/clipfrac_avg': 0.1450471729040146, 'loss/policy_avg': -0.041741374880075455, 'loss/value_avg': 1.1027202606201172, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2895917892456055, 'val/ratio': 1.038573980331421, 'val/ratio_var': 0.000783360272180289, 'val/num_eos_tokens': 0, 'lr': 2.80499755022048e-05, 'episode': 3588, 'epoch': 0.44}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:21:11<1:05:39, 131kB/s]
 44%|████▍     | 898/2041 [1:17:52<1:38:37,  5.18s/it][A

{'eps': 0, 'objective/kl': 95.9135971069336, 'objective/entropy': 73.45765686035156, 'objective/non_score_reward': -4.795680046081543, 'objective/rlhf_reward': -6.996156692504883, 'objective/scores': -2.2004764080047607, 'policy/approxkl_avg': 0.042548373341560364, 'policy/clipfrac_avg': 0.17806604504585266, 'loss/policy_avg': -0.04299265891313553, 'loss/value_avg': 0.6961522698402405, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.4783172607421875, 'val/ratio': 0.9957873821258545, 'val/ratio_var': 1.450422132620588e-05, 'val/num_eos_tokens': 0, 'lr': 2.802547770700637e-05, 'episode': 3592, 'epoch': 0.44}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:21:16<1:05:39, 131kB/s]
 44%|████▍     | 899/2041 [1:17:58<1:38:24,  5.17s/it][A

{'eps': 0, 'objective/kl': 99.21609497070312, 'objective/entropy': 72.72559356689453, 'objective/non_score_reward': -4.9608049392700195, 'objective/rlhf_reward': -7.52078914642334, 'objective/scores': -2.5599842071533203, 'policy/approxkl_avg': 0.1648029237985611, 'policy/clipfrac_avg': 0.16391509771347046, 'loss/policy_avg': -0.04911945015192032, 'loss/value_avg': 0.9293528199195862, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.3040343523025513, 'val/ratio': 1.009355068206787, 'val/ratio_var': 0.0005835756310261786, 'val/num_eos_tokens': 0, 'lr': 2.800097991180794e-05, 'episode': 3596, 'epoch': 0.44}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:21:21<1:05:39, 131kB/s]
 44%|████▍     | 900/2041 [1:18:03<1:38:13,  5.17s/it][A

{'eps': 0, 'objective/kl': 99.64131164550781, 'objective/entropy': 84.29065704345703, 'objective/non_score_reward': -4.9820661544799805, 'objective/rlhf_reward': -6.135843753814697, 'objective/scores': -1.1537774801254272, 'policy/approxkl_avg': 0.13644284009933472, 'policy/clipfrac_avg': 0.14740566909313202, 'loss/policy_avg': -0.04442790150642395, 'loss/value_avg': 0.6418603658676147, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.4735647439956665, 'val/ratio': 0.9919717311859131, 'val/ratio_var': 7.696577813476324e-05, 'val/num_eos_tokens': 0, 'lr': 2.7976482116609508e-05, 'episode': 3600, 'epoch': 0.44}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:21:26<1:05:39, 131kB/s]
 44%|████▍     | 901/2041 [1:18:08<1:37:56,  5.15s/it][A

{'eps': 0, 'objective/kl': 92.2061767578125, 'objective/entropy': 57.75617599487305, 'objective/non_score_reward': -4.610308647155762, 'objective/rlhf_reward': -6.846632957458496, 'objective/scores': -2.2363245487213135, 'policy/approxkl_avg': 0.2119773030281067, 'policy/clipfrac_avg': 0.12735848128795624, 'loss/policy_avg': -0.036838170140981674, 'loss/value_avg': 0.8557723760604858, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.117783546447754, 'val/ratio': 0.9693104028701782, 'val/ratio_var': 0.000551715842448175, 'val/num_eos_tokens': 0, 'lr': 2.7951984321411072e-05, 'episode': 3604, 'epoch': 0.44}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:21:31<1:05:39, 131kB/s]
 44%|████▍     | 902/2041 [1:18:13<1:37:35,  5.14s/it][A

{'eps': 0, 'objective/kl': 100.31342315673828, 'objective/entropy': 61.03984069824219, 'objective/non_score_reward': -5.0156707763671875, 'objective/rlhf_reward': -7.273038864135742, 'objective/scores': -2.2573680877685547, 'policy/approxkl_avg': 0.05770430713891983, 'policy/clipfrac_avg': 0.1179245263338089, 'loss/policy_avg': -0.034136395901441574, 'loss/value_avg': 0.8426721692085266, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.221750020980835, 'val/ratio': 0.9736396670341492, 'val/ratio_var': 0.0004481944488361478, 'val/num_eos_tokens': 0, 'lr': 2.7927486526212644e-05, 'episode': 3608, 'epoch': 0.44}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:21:37<1:05:39, 131kB/s]
 44%|████▍     | 903/2041 [1:18:18<1:37:50,  5.16s/it][A

{'eps': 0, 'objective/kl': 89.35189056396484, 'objective/entropy': 78.29470825195312, 'objective/non_score_reward': -4.467594623565674, 'objective/rlhf_reward': -6.903018951416016, 'objective/scores': -2.435424327850342, 'policy/approxkl_avg': 0.09196008741855621, 'policy/clipfrac_avg': 0.11438679695129395, 'loss/policy_avg': -0.0325951874256134, 'loss/value_avg': 0.6720041036605835, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.3534525632858276, 'val/ratio': 0.9929550886154175, 'val/ratio_var': 2.8474241844378412e-05, 'val/num_eos_tokens': 0, 'lr': 2.7902988731014212e-05, 'episode': 3612, 'epoch': 0.44}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:21:42<1:05:39, 131kB/s]
 44%|████▍     | 904/2041 [1:18:23<1:38:27,  5.20s/it][A

{'eps': 0, 'objective/kl': 108.70626831054688, 'objective/entropy': 64.95265197753906, 'objective/non_score_reward': -5.435313701629639, 'objective/rlhf_reward': -6.30995512008667, 'objective/scores': -0.8746415972709656, 'policy/approxkl_avg': 0.6456131935119629, 'policy/clipfrac_avg': 0.11910377442836761, 'loss/policy_avg': -0.03439873829483986, 'loss/value_avg': 0.6788687109947205, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2014154195785522, 'val/ratio': 0.9986944198608398, 'val/ratio_var': 6.415913230739534e-05, 'val/num_eos_tokens': 0, 'lr': 2.7878490935815776e-05, 'episode': 3616, 'epoch': 0.44}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:21:47<1:05:39, 131kB/s]
 44%|████▍     | 905/2041 [1:18:29<1:38:29,  5.20s/it][A

{'eps': 0, 'objective/kl': 107.83119201660156, 'objective/entropy': 71.43659973144531, 'objective/non_score_reward': -5.391560077667236, 'objective/rlhf_reward': -7.1023406982421875, 'objective/scores': -1.7107806205749512, 'policy/approxkl_avg': 0.04531749337911606, 'policy/clipfrac_avg': 0.14976415038108826, 'loss/policy_avg': -0.04934421926736832, 'loss/value_avg': 0.8880753517150879, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2941944599151611, 'val/ratio': 1.020456075668335, 'val/ratio_var': 0.00029789056861773133, 'val/num_eos_tokens': 0, 'lr': 2.7853993140617344e-05, 'episode': 3620, 'epoch': 0.44}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:21:52<1:05:39, 131kB/s]
 44%|████▍     | 906/2041 [1:18:34<1:38:16,  5.19s/it][A

{'eps': 0, 'objective/kl': 106.09446716308594, 'objective/entropy': 71.26399230957031, 'objective/non_score_reward': -5.304723262786865, 'objective/rlhf_reward': -6.991606712341309, 'objective/scores': -1.6868832111358643, 'policy/approxkl_avg': 0.10426435619592667, 'policy/clipfrac_avg': 0.1320754736661911, 'loss/policy_avg': -0.0339554063975811, 'loss/value_avg': 0.7386935353279114, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1690583229064941, 'val/ratio': 0.9811900854110718, 'val/ratio_var': 0.00016405688074883074, 'val/num_eos_tokens': 0, 'lr': 2.7829495345418916e-05, 'episode': 3624, 'epoch': 0.44}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:21:57<1:05:39, 131kB/s]
 44%|████▍     | 907/2041 [1:18:39<1:38:09,  5.19s/it][A

{'eps': 0, 'objective/kl': 131.01422119140625, 'objective/entropy': 62.2995719909668, 'objective/non_score_reward': -6.550711631774902, 'objective/rlhf_reward': -7.397911548614502, 'objective/scores': -0.8471999168395996, 'policy/approxkl_avg': 0.08748380839824677, 'policy/clipfrac_avg': 0.18160377442836761, 'loss/policy_avg': -0.04604768007993698, 'loss/value_avg': 0.889327883720398, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1002790927886963, 'val/ratio': 0.9688125848770142, 'val/ratio_var': 0.0005577881238423288, 'val/num_eos_tokens': 0, 'lr': 2.780499755022048e-05, 'episode': 3628, 'epoch': 0.44}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:22:03<1:05:39, 131kB/s]
 44%|████▍     | 908/2041 [1:18:44<1:37:58,  5.19s/it][A

{'eps': 0, 'objective/kl': 126.10041046142578, 'objective/entropy': 62.552364349365234, 'objective/non_score_reward': -6.305021286010742, 'objective/rlhf_reward': -7.240905284881592, 'objective/scores': -0.9358840584754944, 'policy/approxkl_avg': 1.91102933883667, 'policy/clipfrac_avg': 0.12971699237823486, 'loss/policy_avg': -0.03964472562074661, 'loss/value_avg': 0.8101019859313965, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.112668752670288, 'val/ratio': 0.9738844037055969, 'val/ratio_var': 0.0003472051175776869, 'val/num_eos_tokens': 0, 'lr': 2.778049975502205e-05, 'episode': 3632, 'epoch': 0.44}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:22:08<1:05:39, 131kB/s]
 45%|████▍     | 909/2041 [1:18:49<1:37:30,  5.17s/it][A

{'eps': 0, 'objective/kl': 101.28533935546875, 'objective/entropy': 76.1920166015625, 'objective/non_score_reward': -5.064267158508301, 'objective/rlhf_reward': -5.291193008422852, 'objective/scores': -0.22692596912384033, 'policy/approxkl_avg': 0.051641419529914856, 'policy/clipfrac_avg': 0.1320754736661911, 'loss/policy_avg': -0.03784084692597389, 'loss/value_avg': 0.40464597940444946, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.3678920269012451, 'val/ratio': 1.0452854633331299, 'val/ratio_var': 0.002994789741933346, 'val/num_eos_tokens': 0, 'lr': 2.775600195982362e-05, 'episode': 3636, 'epoch': 0.45}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:22:13<1:05:39, 131kB/s]
 45%|████▍     | 910/2041 [1:18:54<1:37:17,  5.16s/it][A

{'eps': 0, 'objective/kl': 96.5292739868164, 'objective/entropy': 66.21871948242188, 'objective/non_score_reward': -4.82646369934082, 'objective/rlhf_reward': -5.7832417488098145, 'objective/scores': -0.9567779898643494, 'policy/approxkl_avg': 0.0717100277543068, 'policy/clipfrac_avg': 0.14033019542694092, 'loss/policy_avg': -0.04125621169805527, 'loss/value_avg': 0.707854688167572, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2748727798461914, 'val/ratio': 0.9712138175964355, 'val/ratio_var': 0.00043741727131418884, 'val/num_eos_tokens': 0, 'lr': 2.773150416462518e-05, 'episode': 3640, 'epoch': 0.45}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:22:18<1:05:39, 131kB/s]
 45%|████▍     | 911/2041 [1:19:00<1:37:28,  5.18s/it][A

{'eps': 0, 'objective/kl': 104.52810668945312, 'objective/entropy': 73.40211486816406, 'objective/non_score_reward': -5.226405620574951, 'objective/rlhf_reward': -6.384022235870361, 'objective/scores': -1.1576164960861206, 'policy/approxkl_avg': 0.033829621970653534, 'policy/clipfrac_avg': 0.1391509473323822, 'loss/policy_avg': -0.04198485240340233, 'loss/value_avg': 0.616611123085022, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.298009991645813, 'val/ratio': 0.9936947822570801, 'val/ratio_var': 3.10209252347704e-05, 'val/num_eos_tokens': 0, 'lr': 2.7707006369426753e-05, 'episode': 3644, 'epoch': 0.45}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:22:23<1:05:39, 131kB/s]
 45%|████▍     | 912/2041 [1:19:05<1:37:32,  5.18s/it][A

{'eps': 0, 'objective/kl': 109.83631134033203, 'objective/entropy': 73.02915954589844, 'objective/non_score_reward': -5.491815567016602, 'objective/rlhf_reward': -7.633575439453125, 'objective/scores': -2.1417598724365234, 'policy/approxkl_avg': 0.13679036498069763, 'policy/clipfrac_avg': 0.17216980457305908, 'loss/policy_avg': -0.04250309616327286, 'loss/value_avg': 0.7353742122650146, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2895489931106567, 'val/ratio': 0.988353431224823, 'val/ratio_var': 0.0001095325787900947, 'val/num_eos_tokens': 0, 'lr': 2.768250857422832e-05, 'episode': 3648, 'epoch': 0.45}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:22:29<1:05:39, 131kB/s]
 45%|████▍     | 913/2041 [1:19:10<1:38:11,  5.22s/it][A

{'eps': 0, 'objective/kl': 99.46034240722656, 'objective/entropy': 58.22069549560547, 'objective/non_score_reward': -4.973017692565918, 'objective/rlhf_reward': -5.794033050537109, 'objective/scores': -0.8210153579711914, 'policy/approxkl_avg': 0.06444623321294785, 'policy/clipfrac_avg': 0.13561320304870605, 'loss/policy_avg': -0.04425564780831337, 'loss/value_avg': 0.5707107186317444, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0930794477462769, 'val/ratio': 0.9966598749160767, 'val/ratio_var': 3.614498200477101e-05, 'val/num_eos_tokens': 0, 'lr': 2.7658010779029885e-05, 'episode': 3652, 'epoch': 0.45}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:22:34<1:05:39, 131kB/s]
 45%|████▍     | 914/2041 [1:19:15<1:38:09,  5.23s/it][A

{'eps': 0, 'objective/kl': 96.43132019042969, 'objective/entropy': 51.798622131347656, 'objective/non_score_reward': -4.821566104888916, 'objective/rlhf_reward': -6.546771049499512, 'objective/scores': -1.7252051830291748, 'policy/approxkl_avg': 0.3962556719779968, 'policy/clipfrac_avg': 0.12853772938251495, 'loss/policy_avg': -0.038159579038619995, 'loss/value_avg': 0.472227543592453, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9882792830467224, 'val/ratio': 0.9787996411323547, 'val/ratio_var': 0.0002402015234110877, 'val/num_eos_tokens': 0, 'lr': 2.7633512983831457e-05, 'episode': 3656, 'epoch': 0.45}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:22:39<1:05:39, 131kB/s]
 45%|████▍     | 915/2041 [1:19:21<1:37:53,  5.22s/it][A

{'eps': 0, 'objective/kl': 104.81898498535156, 'objective/entropy': 40.67713165283203, 'objective/non_score_reward': -5.240949630737305, 'objective/rlhf_reward': -5.879997253417969, 'objective/scores': -0.639047384262085, 'policy/approxkl_avg': 0.0881224274635315, 'policy/clipfrac_avg': 0.14858490228652954, 'loss/policy_avg': -0.03488421067595482, 'loss/value_avg': 0.38925135135650635, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8600407242774963, 'val/ratio': 1.0331649780273438, 'val/ratio_var': 0.0027836535591632128, 'val/num_eos_tokens': 0, 'lr': 2.7609015188633025e-05, 'episode': 3660, 'epoch': 0.45}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:22:44<1:05:39, 131kB/s]
 45%|████▍     | 916/2041 [1:19:26<1:37:36,  5.21s/it][A

{'eps': 0, 'objective/kl': 101.62879943847656, 'objective/entropy': 46.26386260986328, 'objective/non_score_reward': -5.081439971923828, 'objective/rlhf_reward': -6.070383071899414, 'objective/scores': -0.9889430999755859, 'policy/approxkl_avg': 0.06262058764696121, 'policy/clipfrac_avg': 0.10849056392908096, 'loss/policy_avg': -0.028952745720744133, 'loss/value_avg': 0.6671596169471741, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8398470282554626, 'val/ratio': 0.957525908946991, 'val/ratio_var': 0.0012098115403205156, 'val/num_eos_tokens': 0, 'lr': 2.7584517393434596e-05, 'episode': 3664, 'epoch': 0.45}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:22:49<1:05:39, 131kB/s]
 45%|████▍     | 917/2041 [1:19:31<1:37:17,  5.19s/it][A

{'eps': 0, 'objective/kl': 104.96084594726562, 'objective/entropy': 57.3291015625, 'objective/non_score_reward': -5.248042106628418, 'objective/rlhf_reward': -6.376277923583984, 'objective/scores': -1.1282358169555664, 'policy/approxkl_avg': 0.04010250046849251, 'policy/clipfrac_avg': 0.12382075190544128, 'loss/policy_avg': -0.03898978233337402, 'loss/value_avg': 0.6231828927993774, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0625686645507812, 'val/ratio': 1.014878511428833, 'val/ratio_var': 0.00022117665503174067, 'val/num_eos_tokens': 0, 'lr': 2.7560019598236158e-05, 'episode': 3668, 'epoch': 0.45}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:22:55<1:05:39, 131kB/s]
 45%|████▍     | 918/2041 [1:19:36<1:36:58,  5.18s/it][A

{'eps': 0, 'objective/kl': 98.12599182128906, 'objective/entropy': 43.84201431274414, 'objective/non_score_reward': -4.906299591064453, 'objective/rlhf_reward': -6.672072410583496, 'objective/scores': -1.7657725811004639, 'policy/approxkl_avg': 0.04798955097794533, 'policy/clipfrac_avg': 0.10495282709598541, 'loss/policy_avg': -0.031999800354242325, 'loss/value_avg': 0.5177568793296814, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8703405857086182, 'val/ratio': 0.9754987359046936, 'val/ratio_var': 0.0004367501533124596, 'val/num_eos_tokens': 0, 'lr': 2.753552180303773e-05, 'episode': 3672, 'epoch': 0.45}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:23:00<1:05:39, 131kB/s]
 45%|████▌     | 919/2041 [1:19:41<1:36:54,  5.18s/it][A

{'eps': 0, 'objective/kl': 107.13395690917969, 'objective/entropy': 70.74234771728516, 'objective/non_score_reward': -5.356698036193848, 'objective/rlhf_reward': -6.680421829223633, 'objective/scores': -1.3237239122390747, 'policy/approxkl_avg': 0.047828130424022675, 'policy/clipfrac_avg': 0.14268867671489716, 'loss/policy_avg': -0.04068705439567566, 'loss/value_avg': 0.5714811086654663, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.4010471105575562, 'val/ratio': 0.9541850090026855, 'val/ratio_var': 0.0015432792715728283, 'val/num_eos_tokens': 0, 'lr': 2.7511024007839297e-05, 'episode': 3676, 'epoch': 0.45}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:23:05<1:05:39, 131kB/s]
 45%|████▌     | 920/2041 [1:19:46<1:37:00,  5.19s/it][A

{'eps': 0, 'objective/kl': 105.49705505371094, 'objective/entropy': 50.174644470214844, 'objective/non_score_reward': -5.274852752685547, 'objective/rlhf_reward': -5.572863578796387, 'objective/scores': -0.29801082611083984, 'policy/approxkl_avg': 0.052964597940444946, 'policy/clipfrac_avg': 0.13443395495414734, 'loss/policy_avg': -0.033665094524621964, 'loss/value_avg': 0.4263862371444702, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9055147767066956, 'val/ratio': 0.9536323547363281, 'val/ratio_var': 0.0013346992200240493, 'val/num_eos_tokens': 0, 'lr': 2.7486526212640862e-05, 'episode': 3680, 'epoch': 0.45}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:23:10<1:05:39, 131kB/s]
 45%|████▌     | 921/2041 [1:19:52<1:36:41,  5.18s/it][A

{'eps': 0, 'objective/kl': 129.08447265625, 'objective/entropy': 53.517723083496094, 'objective/non_score_reward': -6.454224109649658, 'objective/rlhf_reward': -6.898514747619629, 'objective/scores': -0.44429072737693787, 'policy/approxkl_avg': 0.4589214324951172, 'policy/clipfrac_avg': 0.14150944352149963, 'loss/policy_avg': -0.03721238672733307, 'loss/value_avg': 0.7396379709243774, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0276073217391968, 'val/ratio': 1.045844554901123, 'val/ratio_var': 0.0009581025806255639, 'val/num_eos_tokens': 0, 'lr': 2.746202841744243e-05, 'episode': 3684, 'epoch': 0.45}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:23:15<1:05:39, 131kB/s]
 45%|████▌     | 922/2041 [1:19:57<1:36:42,  5.19s/it][A

{'eps': 0, 'objective/kl': 108.5181884765625, 'objective/entropy': 53.81813049316406, 'objective/non_score_reward': -5.425909042358398, 'objective/rlhf_reward': -5.999814510345459, 'objective/scores': -0.5739054083824158, 'policy/approxkl_avg': 0.060622718185186386, 'policy/clipfrac_avg': 0.13089622557163239, 'loss/policy_avg': -0.03742991387844086, 'loss/value_avg': 0.5118785500526428, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.005548119544983, 'val/ratio': 0.9533160924911499, 'val/ratio_var': 0.0013972133165225387, 'val/num_eos_tokens': 0, 'lr': 2.7437530622244e-05, 'episode': 3688, 'epoch': 0.45}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:23:20<1:05:39, 131kB/s]
 45%|████▌     | 923/2041 [1:20:02<1:36:28,  5.18s/it][A

{'eps': 0, 'objective/kl': 105.48184204101562, 'objective/entropy': 52.022762298583984, 'objective/non_score_reward': -5.274092674255371, 'objective/rlhf_reward': -6.205214023590088, 'objective/scores': -0.9311214685440063, 'policy/approxkl_avg': 0.23378147184848785, 'policy/clipfrac_avg': 0.12028302252292633, 'loss/policy_avg': -0.03312746807932854, 'loss/value_avg': 0.8059960603713989, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.968216061592102, 'val/ratio': 0.9755162000656128, 'val/ratio_var': 0.0003007808409165591, 'val/num_eos_tokens': 0, 'lr': 2.7413032827045566e-05, 'episode': 3692, 'epoch': 0.45}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:23:26<1:05:39, 131kB/s]
 45%|████▌     | 924/2041 [1:20:07<1:36:31,  5.19s/it][A

{'eps': 0, 'objective/kl': 118.72547149658203, 'objective/entropy': 46.603271484375, 'objective/non_score_reward': -5.936273574829102, 'objective/rlhf_reward': -6.939542770385742, 'objective/scores': -1.0032689571380615, 'policy/approxkl_avg': 0.126080721616745, 'policy/clipfrac_avg': 0.10495283454656601, 'loss/policy_avg': -0.02971445955336094, 'loss/value_avg': 0.6067498326301575, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9471379518508911, 'val/ratio': 0.9787715673446655, 'val/ratio_var': 0.00023637227423023432, 'val/num_eos_tokens': 0, 'lr': 2.7388535031847134e-05, 'episode': 3696, 'epoch': 0.45}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:23:31<1:05:39, 131kB/s]
 45%|████▌     | 925/2041 [1:20:12<1:35:54,  5.16s/it][A

{'eps': 0, 'objective/kl': 104.04878234863281, 'objective/entropy': 35.41396713256836, 'objective/non_score_reward': -5.202439308166504, 'objective/rlhf_reward': -6.830904483795166, 'objective/scores': -1.6284652948379517, 'policy/approxkl_avg': 0.0832514539361, 'policy/clipfrac_avg': 0.11556603759527206, 'loss/policy_avg': -0.0340174064040184, 'loss/value_avg': 0.6210969686508179, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9625769853591919, 'val/ratio': 0.9592716097831726, 'val/ratio_var': 0.0011357308831065893, 'val/num_eos_tokens': 0, 'lr': 2.7364037236648705e-05, 'episode': 3700, 'epoch': 0.45}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:23:36<1:05:39, 131kB/s]
 45%|████▌     | 926/2041 [1:20:17<1:35:56,  5.16s/it][A

{'eps': 0, 'objective/kl': 87.12539672851562, 'objective/entropy': 66.5427017211914, 'objective/non_score_reward': -4.356269836425781, 'objective/rlhf_reward': -5.785992622375488, 'objective/scores': -1.429722785949707, 'policy/approxkl_avg': 0.048400893807411194, 'policy/clipfrac_avg': 0.18985849618911743, 'loss/policy_avg': -0.04738321155309677, 'loss/value_avg': 0.4498172104358673, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2589342594146729, 'val/ratio': 0.9276707172393799, 'val/ratio_var': 0.003616209840402007, 'val/num_eos_tokens': 0, 'lr': 2.7339539441450267e-05, 'episode': 3704, 'epoch': 0.45}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:23:41<1:05:39, 131kB/s]
 45%|████▌     | 927/2041 [1:20:23<1:35:57,  5.17s/it][A

{'eps': 0, 'objective/kl': 109.70797729492188, 'objective/entropy': 57.951141357421875, 'objective/non_score_reward': -5.48539924621582, 'objective/rlhf_reward': -7.2203497886657715, 'objective/scores': -1.7349506616592407, 'policy/approxkl_avg': 0.08748006075620651, 'policy/clipfrac_avg': 0.15801887214183807, 'loss/policy_avg': -0.04335867241024971, 'loss/value_avg': 0.735682487487793, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0832823514938354, 'val/ratio': 1.0652344226837158, 'val/ratio_var': 0.003848570166155696, 'val/num_eos_tokens': 0, 'lr': 2.7315041646251838e-05, 'episode': 3708, 'epoch': 0.45}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:23:46<1:05:39, 131kB/s]
 45%|████▌     | 928/2041 [1:20:28<1:35:32,  5.15s/it][A

{'eps': 0, 'objective/kl': 118.82810974121094, 'objective/entropy': 39.164276123046875, 'objective/non_score_reward': -5.94140625, 'objective/rlhf_reward': -6.682064056396484, 'objective/scores': -0.7406576871871948, 'policy/approxkl_avg': 0.02169021964073181, 'policy/clipfrac_avg': 0.10495282709598541, 'loss/policy_avg': -0.028635257855057716, 'loss/value_avg': 0.580528974533081, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9870323538780212, 'val/ratio': 0.9661499857902527, 'val/ratio_var': 0.0009396179229952395, 'val/num_eos_tokens': 0, 'lr': 2.7290543851053406e-05, 'episode': 3712, 'epoch': 0.45}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:23:51<1:05:39, 131kB/s]
 46%|████▌     | 929/2041 [1:20:33<1:35:46,  5.17s/it][A

{'eps': 0, 'objective/kl': 112.88247680664062, 'objective/entropy': 55.809471130371094, 'objective/non_score_reward': -5.6441240310668945, 'objective/rlhf_reward': -6.180559158325195, 'objective/scores': -0.5364348888397217, 'policy/approxkl_avg': 0.06009277328848839, 'policy/clipfrac_avg': 0.14268867671489716, 'loss/policy_avg': -0.038429901003837585, 'loss/value_avg': 0.42917317152023315, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1516685485839844, 'val/ratio': 0.9782218933105469, 'val/ratio_var': 0.00023931592295411974, 'val/num_eos_tokens': 0, 'lr': 2.7266046055854977e-05, 'episode': 3716, 'epoch': 0.46}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:23:57<1:05:39, 131kB/s]
 46%|████▌     | 930/2041 [1:20:38<1:35:31,  5.16s/it][A

{'eps': 0, 'objective/kl': 108.36936950683594, 'objective/entropy': 52.53683090209961, 'objective/non_score_reward': -5.418468475341797, 'objective/rlhf_reward': -6.445662021636963, 'objective/scores': -1.0271934270858765, 'policy/approxkl_avg': 0.0488736554980278, 'policy/clipfrac_avg': 0.1450471729040146, 'loss/policy_avg': -0.03722047805786133, 'loss/value_avg': 0.5696327090263367, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9306328296661377, 'val/ratio': 0.9521603584289551, 'val/ratio_var': 0.0015060490695759654, 'val/num_eos_tokens': 0, 'lr': 2.7241548260656542e-05, 'episode': 3720, 'epoch': 0.46}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:24:02<1:05:39, 131kB/s]
 46%|████▌     | 931/2041 [1:20:43<1:35:03,  5.14s/it][A

{'eps': 0, 'objective/kl': 108.71966552734375, 'objective/entropy': 46.777008056640625, 'objective/non_score_reward': -5.435983657836914, 'objective/rlhf_reward': -6.720880508422852, 'objective/scores': -1.284896969795227, 'policy/approxkl_avg': 0.4844188690185547, 'policy/clipfrac_avg': 0.11202830076217651, 'loss/policy_avg': -0.032358862459659576, 'loss/value_avg': 0.6850854754447937, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7968856692314148, 'val/ratio': 0.9605361819267273, 'val/ratio_var': 0.0008718899334780872, 'val/num_eos_tokens': 0, 'lr': 2.721705046545811e-05, 'episode': 3724, 'epoch': 0.46}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:24:07<1:05:39, 131kB/s]
 46%|████▌     | 932/2041 [1:20:48<1:35:12,  5.15s/it][A

{'eps': 0, 'objective/kl': 84.58876037597656, 'objective/entropy': 47.32264709472656, 'objective/non_score_reward': -4.229437828063965, 'objective/rlhf_reward': -6.1525187492370605, 'objective/scores': -1.9230809211730957, 'policy/approxkl_avg': 0.29050374031066895, 'policy/clipfrac_avg': 0.12853772938251495, 'loss/policy_avg': -0.039050739258527756, 'loss/value_avg': 0.540103554725647, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8634402751922607, 'val/ratio': 0.9801129698753357, 'val/ratio_var': 0.0004427713283803314, 'val/num_eos_tokens': 0, 'lr': 2.719255267025968e-05, 'episode': 3728, 'epoch': 0.46}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:24:12<1:05:39, 131kB/s]
 46%|████▌     | 933/2041 [1:20:53<1:34:56,  5.14s/it][A

{'eps': 0, 'objective/kl': 102.37274932861328, 'objective/entropy': 58.6107177734375, 'objective/non_score_reward': -5.1186370849609375, 'objective/rlhf_reward': -6.339983940124512, 'objective/scores': -1.2213470935821533, 'policy/approxkl_avg': 0.03951612487435341, 'policy/clipfrac_avg': 0.16391509771347046, 'loss/policy_avg': -0.044675372540950775, 'loss/value_avg': 0.5804402828216553, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1639091968536377, 'val/ratio': 1.001879096031189, 'val/ratio_var': 0.0001411727862432599, 'val/num_eos_tokens': 0, 'lr': 2.7168054875061243e-05, 'episode': 3732, 'epoch': 0.46}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:24:17<1:05:39, 131kB/s]
 46%|████▌     | 934/2041 [1:20:59<1:35:07,  5.16s/it][A

{'eps': 0, 'objective/kl': 95.53886413574219, 'objective/entropy': 62.46781539916992, 'objective/non_score_reward': -4.776942729949951, 'objective/rlhf_reward': -6.186343193054199, 'objective/scores': -1.409400224685669, 'policy/approxkl_avg': 0.15269835293293, 'policy/clipfrac_avg': 0.13443396985530853, 'loss/policy_avg': -0.0371704176068306, 'loss/value_avg': 0.49925434589385986, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9880620241165161, 'val/ratio': 0.9751706719398499, 'val/ratio_var': 0.0003004171885550022, 'val/num_eos_tokens': 0, 'lr': 2.7143557079862814e-05, 'episode': 3736, 'epoch': 0.46}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:24:22<1:05:39, 131kB/s]
 46%|████▌     | 935/2041 [1:21:04<1:34:44,  5.14s/it][A

{'eps': 0, 'objective/kl': 105.5157241821289, 'objective/entropy': 55.28236770629883, 'objective/non_score_reward': -5.275786399841309, 'objective/rlhf_reward': -5.668335437774658, 'objective/scores': -0.3925490379333496, 'policy/approxkl_avg': 0.055546436458826065, 'policy/clipfrac_avg': 0.1745283007621765, 'loss/policy_avg': -0.044616468250751495, 'loss/value_avg': 0.508646547794342, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1251440048217773, 'val/ratio': 0.9840090870857239, 'val/ratio_var': 0.00015827559400349855, 'val/num_eos_tokens': 0, 'lr': 2.7119059284664382e-05, 'episode': 3740, 'epoch': 0.46}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:24:27<1:05:39, 131kB/s]
 46%|████▌     | 936/2041 [1:21:09<1:34:09,  5.11s/it][A

{'eps': 0, 'objective/kl': 100.00164794921875, 'objective/entropy': 55.84698486328125, 'objective/non_score_reward': -5.000082015991211, 'objective/rlhf_reward': -7.190875053405762, 'objective/scores': -2.190793037414551, 'policy/approxkl_avg': 0.05483080819249153, 'policy/clipfrac_avg': 0.1533018797636032, 'loss/policy_avg': -0.03665198013186455, 'loss/value_avg': 0.6610111594200134, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0656288862228394, 'val/ratio': 1.0655171871185303, 'val/ratio_var': 0.0027240384370088577, 'val/num_eos_tokens': 0, 'lr': 2.7094561489465947e-05, 'episode': 3744, 'epoch': 0.46}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:24:32<1:05:39, 131kB/s]
 46%|████▌     | 937/2041 [1:21:14<1:33:56,  5.11s/it][A

{'eps': 0, 'objective/kl': 94.14688873291016, 'objective/entropy': 46.88127517700195, 'objective/non_score_reward': -4.7073445320129395, 'objective/rlhf_reward': -7.031766891479492, 'objective/scores': -2.3244221210479736, 'policy/approxkl_avg': 0.029621988534927368, 'policy/clipfrac_avg': 0.11910377442836761, 'loss/policy_avg': -0.03553250432014465, 'loss/value_avg': 0.7099400758743286, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8943766951560974, 'val/ratio': 0.9649529457092285, 'val/ratio_var': 0.0008236877038143575, 'val/num_eos_tokens': 0, 'lr': 2.707006369426752e-05, 'episode': 3748, 'epoch': 0.46}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:24:38<1:05:39, 131kB/s]
 46%|████▌     | 938/2041 [1:21:19<1:34:14,  5.13s/it][A

{'eps': 0, 'objective/kl': 105.42903137207031, 'objective/entropy': 79.87455749511719, 'objective/non_score_reward': -5.271451950073242, 'objective/rlhf_reward': -6.2856245040893555, 'objective/scores': -1.0141725540161133, 'policy/approxkl_avg': 0.2011876404285431, 'policy/clipfrac_avg': 0.21816039085388184, 'loss/policy_avg': -0.053376100957393646, 'loss/value_avg': 0.5838654041290283, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.547194004058838, 'val/ratio': 0.961232602596283, 'val/ratio_var': 0.0007561221136711538, 'val/num_eos_tokens': 0, 'lr': 2.7045565899069086e-05, 'episode': 3752, 'epoch': 0.46}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:24:43<1:05:39, 131kB/s]
 46%|████▌     | 939/2041 [1:21:24<1:34:06,  5.12s/it][A

{'eps': 0, 'objective/kl': 99.33915710449219, 'objective/entropy': 65.35407257080078, 'objective/non_score_reward': -4.966958045959473, 'objective/rlhf_reward': -6.3008317947387695, 'objective/scores': -1.3338735103607178, 'policy/approxkl_avg': 0.16875821352005005, 'policy/clipfrac_avg': 0.16863209009170532, 'loss/policy_avg': -0.0409458689391613, 'loss/value_avg': 0.4918808043003082, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2282588481903076, 'val/ratio': 0.9335377216339111, 'val/ratio_var': 0.0027689833659678698, 'val/num_eos_tokens': 0, 'lr': 2.702106810387065e-05, 'episode': 3756, 'epoch': 0.46}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:24:48<1:05:39, 131kB/s]
 46%|████▌     | 940/2041 [1:21:29<1:34:15,  5.14s/it][A

{'eps': 0, 'objective/kl': 97.82627868652344, 'objective/entropy': 61.41309356689453, 'objective/non_score_reward': -4.891314506530762, 'objective/rlhf_reward': -5.53055477142334, 'objective/scores': -0.6392402052879333, 'policy/approxkl_avg': 0.08753377944231033, 'policy/clipfrac_avg': 0.1603773534297943, 'loss/policy_avg': -0.04354642704129219, 'loss/value_avg': 0.2967069149017334, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1304670572280884, 'val/ratio': 0.965376615524292, 'val/ratio_var': 0.0006940548191778362, 'val/num_eos_tokens': 0, 'lr': 2.699657030867222e-05, 'episode': 3760, 'epoch': 0.46}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:24:53<1:05:39, 131kB/s]
 46%|████▌     | 941/2041 [1:21:35<1:34:10,  5.14s/it][A

{'eps': 0, 'objective/kl': 111.92578887939453, 'objective/entropy': 44.44685745239258, 'objective/non_score_reward': -5.59628963470459, 'objective/rlhf_reward': -7.085939407348633, 'objective/scores': -1.489649772644043, 'policy/approxkl_avg': 0.048852358013391495, 'policy/clipfrac_avg': 0.12971699237823486, 'loss/policy_avg': -0.03291185200214386, 'loss/value_avg': 0.5637094378471375, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8947882056236267, 'val/ratio': 0.9618335962295532, 'val/ratio_var': 0.0008775923051871359, 'val/num_eos_tokens': 0, 'lr': 2.697207251347379e-05, 'episode': 3764, 'epoch': 0.46}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:24:58<1:05:39, 131kB/s]
 46%|████▌     | 942/2041 [1:21:40<1:33:57,  5.13s/it][A

{'eps': 0, 'objective/kl': 104.7088623046875, 'objective/entropy': 69.01139068603516, 'objective/non_score_reward': -5.235443115234375, 'objective/rlhf_reward': -6.660565376281738, 'objective/scores': -1.4251221418380737, 'policy/approxkl_avg': 0.061439383774995804, 'policy/clipfrac_avg': 0.15801887214183807, 'loss/policy_avg': -0.04177038371562958, 'loss/value_avg': 0.4892970323562622, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.385792851448059, 'val/ratio': 0.9461291432380676, 'val/ratio_var': 0.0019283428555354476, 'val/num_eos_tokens': 0, 'lr': 2.694757471827536e-05, 'episode': 3768, 'epoch': 0.46}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:25:03<1:05:39, 131kB/s]
 46%|████▌     | 943/2041 [1:21:45<1:33:39,  5.12s/it][A

{'eps': 0, 'objective/kl': 96.470458984375, 'objective/entropy': 62.826507568359375, 'objective/non_score_reward': -4.823523044586182, 'objective/rlhf_reward': -6.264307498931885, 'objective/scores': -1.4407844543457031, 'policy/approxkl_avg': 0.12312638759613037, 'policy/clipfrac_avg': 0.14976415038108826, 'loss/policy_avg': -0.04153925180435181, 'loss/value_avg': 0.48279041051864624, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2167963981628418, 'val/ratio': 1.0075831413269043, 'val/ratio_var': 0.0002799722133204341, 'val/num_eos_tokens': 0, 'lr': 2.6923076923076923e-05, 'episode': 3772, 'epoch': 0.46}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:25:08<1:05:39, 131kB/s]
 46%|████▋     | 944/2041 [1:21:50<1:33:31,  5.12s/it][A

{'eps': 0, 'objective/kl': 137.30300903320312, 'objective/entropy': 76.11126708984375, 'objective/non_score_reward': -6.865150451660156, 'objective/rlhf_reward': -8.8251953125, 'objective/scores': -1.9600443840026855, 'policy/approxkl_avg': 0.23731352388858795, 'policy/clipfrac_avg': 0.1875, 'loss/policy_avg': -0.04859751835465431, 'loss/value_avg': 1.3881335258483887, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.3737056255340576, 'val/ratio': 0.9412602782249451, 'val/ratio_var': 0.0017361072823405266, 'val/num_eos_tokens': 0, 'lr': 2.689857912787849e-05, 'episode': 3776, 'epoch': 0.46}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:25:13<1:05:39, 131kB/s]
 46%|████▋     | 945/2041 [1:21:55<1:33:51,  5.14s/it][A

{'eps': 0, 'objective/kl': 94.53909301757812, 'objective/entropy': 70.60626220703125, 'objective/non_score_reward': -4.726954460144043, 'objective/rlhf_reward': -6.696579933166504, 'objective/scores': -1.9696255922317505, 'policy/approxkl_avg': 0.09915491938591003, 'policy/clipfrac_avg': 0.16863207519054413, 'loss/policy_avg': -0.035882286727428436, 'loss/value_avg': 0.6024951338768005, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0347458124160767, 'val/ratio': 0.9600592851638794, 'val/ratio_var': 0.0008603644091635942, 'val/num_eos_tokens': 0, 'lr': 2.6874081332680063e-05, 'episode': 3780, 'epoch': 0.46}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:25:19<1:05:39, 131kB/s]
 46%|████▋     | 946/2041 [1:22:00<1:34:19,  5.17s/it][A

{'eps': 0, 'objective/kl': 101.21531677246094, 'objective/entropy': 50.9935188293457, 'objective/non_score_reward': -5.060765743255615, 'objective/rlhf_reward': -6.36661434173584, 'objective/scores': -1.3058483600616455, 'policy/approxkl_avg': 0.06517687439918518, 'policy/clipfrac_avg': 0.11320754140615463, 'loss/policy_avg': -0.032457880675792694, 'loss/value_avg': 0.5958566665649414, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.050260066986084, 'val/ratio': 0.9645004868507385, 'val/ratio_var': 0.0008458808879368007, 'val/num_eos_tokens': 0, 'lr': 2.6849583537481627e-05, 'episode': 3784, 'epoch': 0.46}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:25:24<1:05:39, 131kB/s]
 46%|████▋     | 947/2041 [1:22:05<1:34:10,  5.17s/it][A

{'eps': 0, 'objective/kl': 108.25714874267578, 'objective/entropy': 48.813941955566406, 'objective/non_score_reward': -5.4128570556640625, 'objective/rlhf_reward': -6.744213104248047, 'objective/scores': -1.3313560485839844, 'policy/approxkl_avg': 0.057530585676431656, 'policy/clipfrac_avg': 0.125, 'loss/policy_avg': -0.03241656720638275, 'loss/value_avg': 0.49166339635849, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.029160737991333, 'val/ratio': 0.9704447984695435, 'val/ratio_var': 0.0006069337832741439, 'val/num_eos_tokens': 0, 'lr': 2.6825085742283195e-05, 'episode': 3788, 'epoch': 0.46}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:25:29<1:05:39, 131kB/s]
 46%|████▋     | 948/2041 [1:22:11<1:34:03,  5.16s/it][A

{'eps': 0, 'objective/kl': 114.844970703125, 'objective/entropy': 48.98847198486328, 'objective/non_score_reward': -5.74224853515625, 'objective/rlhf_reward': -6.673890113830566, 'objective/scores': -0.9316415786743164, 'policy/approxkl_avg': 0.022509820759296417, 'policy/clipfrac_avg': 0.1179245263338089, 'loss/policy_avg': -0.0353497713804245, 'loss/value_avg': 0.5648679733276367, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1161344051361084, 'val/ratio': 0.9621261358261108, 'val/ratio_var': 0.0010391356190666556, 'val/num_eos_tokens': 0, 'lr': 2.6800587947084767e-05, 'episode': 3792, 'epoch': 0.46}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:25:34<1:05:39, 131kB/s]
 46%|████▋     | 949/2041 [1:22:16<1:33:57,  5.16s/it][A

{'eps': 0, 'objective/kl': 127.94574737548828, 'objective/entropy': 66.88939666748047, 'objective/non_score_reward': -6.397287368774414, 'objective/rlhf_reward': -8.498069763183594, 'objective/scores': -2.1007823944091797, 'policy/approxkl_avg': 0.10750465095043182, 'policy/clipfrac_avg': 0.16981132328510284, 'loss/policy_avg': -0.04680883139371872, 'loss/value_avg': 1.134649395942688, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2797119617462158, 'val/ratio': 1.1313064098358154, 'val/ratio_var': 0.013910460285842419, 'val/num_eos_tokens': 0, 'lr': 2.6776090151886328e-05, 'episode': 3796, 'epoch': 0.47}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:25:39<1:05:39, 131kB/s]
 47%|████▋     | 950/2041 [1:22:21<1:33:57,  5.17s/it][A

{'eps': 0, 'objective/kl': 125.8370361328125, 'objective/entropy': 73.74081420898438, 'objective/non_score_reward': -6.291851997375488, 'objective/rlhf_reward': -7.6076154708862305, 'objective/scores': -1.3157637119293213, 'policy/approxkl_avg': 0.2950047254562378, 'policy/clipfrac_avg': 0.15094339847564697, 'loss/policy_avg': -0.04437229037284851, 'loss/value_avg': 0.6855692863464355, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2613712549209595, 'val/ratio': 1.054484248161316, 'val/ratio_var': 0.0019185113487765193, 'val/num_eos_tokens': 0, 'lr': 2.67515923566879e-05, 'episode': 3800, 'epoch': 0.47}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:25:45<1:05:39, 131kB/s]
 47%|████▋     | 951/2041 [1:22:26<1:33:54,  5.17s/it][A

{'eps': 0, 'objective/kl': 131.5360870361328, 'objective/entropy': 49.6133918762207, 'objective/non_score_reward': -6.5768046379089355, 'objective/rlhf_reward': -7.6039958000183105, 'objective/scores': -1.0271912813186646, 'policy/approxkl_avg': 0.038806233555078506, 'policy/clipfrac_avg': 0.12028302252292633, 'loss/policy_avg': -0.03549947962164879, 'loss/value_avg': 0.856079638004303, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.101837158203125, 'val/ratio': 0.9775882959365845, 'val/ratio_var': 0.00028529195697046816, 'val/num_eos_tokens': 0, 'lr': 2.6727094561489468e-05, 'episode': 3804, 'epoch': 0.47}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:25:50<1:05:39, 131kB/s]
 47%|████▋     | 952/2041 [1:22:31<1:33:34,  5.16s/it][A

{'eps': 0, 'objective/kl': 117.78594970703125, 'objective/entropy': 78.39422607421875, 'objective/non_score_reward': -5.889297962188721, 'objective/rlhf_reward': -7.4027628898620605, 'objective/scores': -1.5134649276733398, 'policy/approxkl_avg': 0.036066170781850815, 'policy/clipfrac_avg': 0.1591981202363968, 'loss/policy_avg': -0.042510151863098145, 'loss/value_avg': 0.700175404548645, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2573132514953613, 'val/ratio': 0.9688913822174072, 'val/ratio_var': 0.0006551788537763059, 'val/num_eos_tokens': 0, 'lr': 2.6702596766291032e-05, 'episode': 3808, 'epoch': 0.47}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:25:55<1:05:39, 131kB/s]
 47%|████▋     | 953/2041 [1:22:36<1:33:19,  5.15s/it][A

{'eps': 0, 'objective/kl': 104.06520080566406, 'objective/entropy': 77.36053466796875, 'objective/non_score_reward': -5.2032599449157715, 'objective/rlhf_reward': -6.55839204788208, 'objective/scores': -1.3551321029663086, 'policy/approxkl_avg': 0.03701788932085037, 'policy/clipfrac_avg': 0.16509434580802917, 'loss/policy_avg': -0.046190325170755386, 'loss/value_avg': 0.5626626014709473, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.5074669122695923, 'val/ratio': 0.9896754026412964, 'val/ratio_var': 5.68125797144603e-05, 'val/num_eos_tokens': 0, 'lr': 2.6678098971092604e-05, 'episode': 3812, 'epoch': 0.47}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:26:00<1:05:39, 131kB/s]
 47%|████▋     | 954/2041 [1:22:41<1:33:04,  5.14s/it][A

{'eps': 0, 'objective/kl': 136.028564453125, 'objective/entropy': 79.4274673461914, 'objective/non_score_reward': -6.801427841186523, 'objective/rlhf_reward': -7.765053749084473, 'objective/scores': -0.9636260271072388, 'policy/approxkl_avg': 0.12219039350748062, 'policy/clipfrac_avg': 0.17806604504585266, 'loss/policy_avg': -0.047769997268915176, 'loss/value_avg': 0.7656664848327637, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.43879234790802, 'val/ratio': 1.9843418598175049, 'val/ratio_var': 1.752808928489685, 'val/num_eos_tokens': 0, 'lr': 2.6653601175894172e-05, 'episode': 3816, 'epoch': 0.47}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:26:05<1:05:39, 131kB/s]
 47%|████▋     | 955/2041 [1:22:47<1:33:31,  5.17s/it][A

{'eps': 0, 'objective/kl': 90.53993225097656, 'objective/entropy': 62.68272018432617, 'objective/non_score_reward': -4.526996612548828, 'objective/rlhf_reward': -5.857807159423828, 'objective/scores': -1.330810308456421, 'policy/approxkl_avg': 0.024011196568608284, 'policy/clipfrac_avg': 0.12735848128795624, 'loss/policy_avg': -0.03431260958313942, 'loss/value_avg': 0.4282901883125305, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1525366306304932, 'val/ratio': 0.962759256362915, 'val/ratio_var': 0.0009684961405582726, 'val/num_eos_tokens': 0, 'lr': 2.662910338069574e-05, 'episode': 3820, 'epoch': 0.47}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:26:10<1:05:39, 131kB/s]
 47%|████▋     | 956/2041 [1:22:52<1:33:04,  5.15s/it][A

{'eps': 0, 'objective/kl': 98.10978698730469, 'objective/entropy': 70.04724884033203, 'objective/non_score_reward': -4.905488967895508, 'objective/rlhf_reward': -7.001365661621094, 'objective/scores': -2.095876455307007, 'policy/approxkl_avg': 0.044706493616104126, 'policy/clipfrac_avg': 0.139150932431221, 'loss/policy_avg': -0.03867674991488457, 'loss/value_avg': 0.5916118621826172, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2637134790420532, 'val/ratio': 1.0049848556518555, 'val/ratio_var': 0.000169005521456711, 'val/num_eos_tokens': 0, 'lr': 2.6604605585497305e-05, 'episode': 3824, 'epoch': 0.47}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:26:15<1:05:39, 131kB/s]
 47%|████▋     | 957/2041 [1:22:57<1:32:57,  5.15s/it][A

{'eps': 0, 'objective/kl': 110.62094116210938, 'objective/entropy': 75.2490234375, 'objective/non_score_reward': -5.531047344207764, 'objective/rlhf_reward': -6.147483825683594, 'objective/scores': -0.6164363622665405, 'policy/approxkl_avg': 0.03961913287639618, 'policy/clipfrac_avg': 0.14976415038108826, 'loss/policy_avg': -0.04147816449403763, 'loss/value_avg': 0.4115488529205322, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.4183804988861084, 'val/ratio': 1.0000324249267578, 'val/ratio_var': 5.2086241339566186e-05, 'val/num_eos_tokens': 0, 'lr': 2.6580107790298876e-05, 'episode': 3828, 'epoch': 0.47}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:26:21<1:05:39, 131kB/s]
 47%|████▋     | 958/2041 [1:23:02<1:32:57,  5.15s/it][A

{'eps': 0, 'objective/kl': 110.61392974853516, 'objective/entropy': 94.36485290527344, 'objective/non_score_reward': -5.530696868896484, 'objective/rlhf_reward': -7.236081600189209, 'objective/scores': -1.705384612083435, 'policy/approxkl_avg': 0.06426476687192917, 'policy/clipfrac_avg': 0.17099055647850037, 'loss/policy_avg': -0.047454703599214554, 'loss/value_avg': 0.9236744046211243, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.6617016792297363, 'val/ratio': 0.9773129224777222, 'val/ratio_var': 0.00027754242182709277, 'val/num_eos_tokens': 0, 'lr': 2.6555609995100444e-05, 'episode': 3832, 'epoch': 0.47}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:26:26<1:05:39, 131kB/s]
 47%|████▋     | 959/2041 [1:23:07<1:32:39,  5.14s/it][A

{'eps': 0, 'objective/kl': 92.10655212402344, 'objective/entropy': 66.5067138671875, 'objective/non_score_reward': -4.605327606201172, 'objective/rlhf_reward': -6.6880998611450195, 'objective/scores': -2.0827722549438477, 'policy/approxkl_avg': 0.8979259133338928, 'policy/clipfrac_avg': 0.1733490526676178, 'loss/policy_avg': -0.04662406072020531, 'loss/value_avg': 0.46822619438171387, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.3673795461654663, 'val/ratio': 1.0241622924804688, 'val/ratio_var': 0.0007196340593509376, 'val/num_eos_tokens': 0, 'lr': 2.653111219990201e-05, 'episode': 3836, 'epoch': 0.47}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:26:31<1:05:39, 131kB/s]
 47%|████▋     | 960/2041 [1:23:12<1:32:30,  5.13s/it][A

{'eps': 0, 'objective/kl': 99.61601257324219, 'objective/entropy': 74.72467041015625, 'objective/non_score_reward': -4.980801105499268, 'objective/rlhf_reward': -7.20820951461792, 'objective/scores': -2.2274084091186523, 'policy/approxkl_avg': 0.039160214364528656, 'policy/clipfrac_avg': 0.1745283007621765, 'loss/policy_avg': -0.04294813796877861, 'loss/value_avg': 0.9087470173835754, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.408337950706482, 'val/ratio': 0.9448952674865723, 'val/ratio_var': 0.002162234392017126, 'val/num_eos_tokens': 0, 'lr': 2.6506614404703577e-05, 'episode': 3840, 'epoch': 0.47}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:26:36<1:05:39, 131kB/s]
 47%|████▋     | 961/2041 [1:23:18<1:32:41,  5.15s/it][A

{'eps': 0, 'objective/kl': 106.82008361816406, 'objective/entropy': 105.55039978027344, 'objective/non_score_reward': -5.341004371643066, 'objective/rlhf_reward': -7.286654949188232, 'objective/scores': -1.945650577545166, 'policy/approxkl_avg': 0.03355143219232559, 'policy/clipfrac_avg': 0.1733490526676178, 'loss/policy_avg': -0.045071791857481, 'loss/value_avg': 0.5468394756317139, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.9510226249694824, 'val/ratio': 0.951533317565918, 'val/ratio_var': 0.0016650729812681675, 'val/num_eos_tokens': 0, 'lr': 2.6482116609505148e-05, 'episode': 3844, 'epoch': 0.47}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:26:41<1:05:39, 131kB/s]
 47%|████▋     | 962/2041 [1:23:23<1:32:16,  5.13s/it][A

{'eps': 0, 'objective/kl': 101.77206420898438, 'objective/entropy': 91.52787017822266, 'objective/non_score_reward': -5.0886030197143555, 'objective/rlhf_reward': -6.616206169128418, 'objective/scores': -1.527603268623352, 'policy/approxkl_avg': 0.128767728805542, 'policy/clipfrac_avg': 0.19339624047279358, 'loss/policy_avg': -0.05302976071834564, 'loss/value_avg': 0.6080932021141052, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.6341558694839478, 'val/ratio': 1.0209393501281738, 'val/ratio_var': 0.0004698796255979687, 'val/num_eos_tokens': 0, 'lr': 2.6457618814306713e-05, 'episode': 3848, 'epoch': 0.47}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:26:46<1:05:39, 131kB/s]
 47%|████▋     | 963/2041 [1:23:28<1:32:13,  5.13s/it][A

{'eps': 0, 'objective/kl': 105.85389709472656, 'objective/entropy': 101.64962768554688, 'objective/non_score_reward': -5.292695045471191, 'objective/rlhf_reward': -6.727367401123047, 'objective/scores': -1.4346723556518555, 'policy/approxkl_avg': 0.0651925727725029, 'policy/clipfrac_avg': 0.13325472176074982, 'loss/policy_avg': -0.03886425867676735, 'loss/value_avg': 0.5123186111450195, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.8413689136505127, 'val/ratio': 0.9752250909805298, 'val/ratio_var': 0.0005744284717366099, 'val/num_eos_tokens': 0, 'lr': 2.643312101910828e-05, 'episode': 3852, 'epoch': 0.47}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:26:51<1:05:39, 131kB/s]
 47%|████▋     | 964/2041 [1:23:33<1:31:55,  5.12s/it][A

{'eps': 0, 'objective/kl': 82.82032775878906, 'objective/entropy': 53.21772003173828, 'objective/non_score_reward': -4.141016960144043, 'objective/rlhf_reward': -6.536642074584961, 'objective/scores': -2.395625352859497, 'policy/approxkl_avg': 0.021808253601193428, 'policy/clipfrac_avg': 0.13325472176074982, 'loss/policy_avg': -0.03835825249552727, 'loss/value_avg': 0.6659618616104126, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1120522022247314, 'val/ratio': 0.9671292304992676, 'val/ratio_var': 0.000799630128312856, 'val/num_eos_tokens': 0, 'lr': 2.6408623223909852e-05, 'episode': 3856, 'epoch': 0.47}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:26:56<1:05:39, 131kB/s]
 47%|████▋     | 965/2041 [1:23:38<1:32:07,  5.14s/it][A

{'eps': 0, 'objective/kl': 96.2559814453125, 'objective/entropy': 67.51783752441406, 'objective/non_score_reward': -4.812799453735352, 'objective/rlhf_reward': -6.8610358238220215, 'objective/scores': -2.04823637008667, 'policy/approxkl_avg': 0.04332423955202103, 'policy/clipfrac_avg': 0.12971697747707367, 'loss/policy_avg': -0.04083084315061569, 'loss/value_avg': 0.6515418887138367, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.3004419803619385, 'val/ratio': 1.0362944602966309, 'val/ratio_var': 0.0013583885738626122, 'val/num_eos_tokens': 0, 'lr': 2.6384125428711414e-05, 'episode': 3860, 'epoch': 0.47}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:27:02<1:05:39, 131kB/s]
 47%|████▋     | 966/2041 [1:23:43<1:32:06,  5.14s/it][A

{'eps': 0, 'objective/kl': 94.51212310791016, 'objective/entropy': 68.06422424316406, 'objective/non_score_reward': -4.7256059646606445, 'objective/rlhf_reward': -7.256023406982422, 'objective/scores': -2.5304176807403564, 'policy/approxkl_avg': 0.0692964717745781, 'policy/clipfrac_avg': 0.16273584961891174, 'loss/policy_avg': -0.049105070531368256, 'loss/value_avg': 0.9134975075721741, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2859318256378174, 'val/ratio': 0.9851601123809814, 'val/ratio_var': 0.00010325998300686479, 'val/num_eos_tokens': 0, 'lr': 2.6359627633512985e-05, 'episode': 3864, 'epoch': 0.47}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:27:07<1:05:39, 131kB/s]
 47%|████▋     | 967/2041 [1:23:48<1:32:22,  5.16s/it][A

{'eps': 0, 'objective/kl': 73.37956237792969, 'objective/entropy': 49.417991638183594, 'objective/non_score_reward': -3.668977737426758, 'objective/rlhf_reward': -6.031639099121094, 'objective/scores': -2.362661361694336, 'policy/approxkl_avg': 0.04050760343670845, 'policy/clipfrac_avg': 0.1320754736661911, 'loss/policy_avg': -0.039075180888175964, 'loss/value_avg': 0.3779151737689972, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0407860279083252, 'val/ratio': 0.9754331111907959, 'val/ratio_var': 0.0004105794650968164, 'val/num_eos_tokens': 0, 'lr': 2.6335129838314553e-05, 'episode': 3868, 'epoch': 0.47}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:27:12<1:05:39, 131kB/s]
 47%|████▋     | 968/2041 [1:23:54<1:32:14,  5.16s/it][A

{'eps': 0, 'objective/kl': 118.62913513183594, 'objective/entropy': 105.11805725097656, 'objective/non_score_reward': -5.931456565856934, 'objective/rlhf_reward': -8.305810928344727, 'objective/scores': -2.374354362487793, 'policy/approxkl_avg': 0.05627933144569397, 'policy/clipfrac_avg': 0.17806603014469147, 'loss/policy_avg': -0.045586053282022476, 'loss/value_avg': 1.0182262659072876, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 2.024304151535034, 'val/ratio': 0.9514161944389343, 'val/ratio_var': 0.0015069771325215697, 'val/num_eos_tokens': 0, 'lr': 2.6310632043116124e-05, 'episode': 3872, 'epoch': 0.47}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:27:17<1:05:39, 131kB/s]
 47%|████▋     | 969/2041 [1:23:59<1:31:38,  5.13s/it][A

{'eps': 0, 'objective/kl': 62.647281646728516, 'objective/entropy': 65.82672119140625, 'objective/non_score_reward': -3.132364273071289, 'objective/rlhf_reward': -4.865562915802002, 'objective/scores': -1.7331985235214233, 'policy/approxkl_avg': 0.0422099269926548, 'policy/clipfrac_avg': 0.14150942862033844, 'loss/policy_avg': -0.036972738802433014, 'loss/value_avg': 0.36168819665908813, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.165205717086792, 'val/ratio': 0.977438747882843, 'val/ratio_var': 0.0002674473507795483, 'val/num_eos_tokens': 0, 'lr': 2.628613424791769e-05, 'episode': 3876, 'epoch': 0.47}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:27:22<1:05:39, 131kB/s]
 48%|████▊     | 970/2041 [1:24:04<1:31:32,  5.13s/it][A

{'eps': 0, 'objective/kl': 95.77165222167969, 'objective/entropy': 50.045204162597656, 'objective/non_score_reward': -4.788582801818848, 'objective/rlhf_reward': -7.117067337036133, 'objective/scores': -2.328484296798706, 'policy/approxkl_avg': 0.056664127856492996, 'policy/clipfrac_avg': 0.11438679695129395, 'loss/policy_avg': -0.03356068953871727, 'loss/value_avg': 0.5724627375602722, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9333672523498535, 'val/ratio': 0.9832826256752014, 'val/ratio_var': 0.00014694256242364645, 'val/num_eos_tokens': 0, 'lr': 2.6261636452719257e-05, 'episode': 3880, 'epoch': 0.48}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:27:27<1:05:39, 131kB/s]
 48%|████▊     | 971/2041 [1:24:09<1:31:46,  5.15s/it][A

{'eps': 0, 'objective/kl': 89.96348571777344, 'objective/entropy': 72.46910858154297, 'objective/non_score_reward': -4.498174667358398, 'objective/rlhf_reward': -5.4829254150390625, 'objective/scores': -0.9847509860992432, 'policy/approxkl_avg': 0.11730677634477615, 'policy/clipfrac_avg': 0.16155660152435303, 'loss/policy_avg': -0.045303985476493835, 'loss/value_avg': 0.385084331035614, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.3341193199157715, 'val/ratio': 0.9775611162185669, 'val/ratio_var': 0.00028588101849891245, 'val/num_eos_tokens': 0, 'lr': 2.6237138657520825e-05, 'episode': 3884, 'epoch': 0.48}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:27:33<1:05:39, 131kB/s]
 48%|████▊     | 972/2041 [1:24:14<1:31:52,  5.16s/it][A

{'eps': 0, 'objective/kl': 76.71302795410156, 'objective/entropy': 55.57438278198242, 'objective/non_score_reward': -3.835651397705078, 'objective/rlhf_reward': -6.27488899230957, 'objective/scores': -2.439237594604492, 'policy/approxkl_avg': 0.04012040048837662, 'policy/clipfrac_avg': 0.14858491718769073, 'loss/policy_avg': -0.04321138933300972, 'loss/value_avg': 0.36919790506362915, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0822279453277588, 'val/ratio': 0.974352240562439, 'val/ratio_var': 0.00042306992691010237, 'val/num_eos_tokens': 0, 'lr': 2.621264086232239e-05, 'episode': 3888, 'epoch': 0.48}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:27:38<1:05:39, 131kB/s]
 48%|████▊     | 973/2041 [1:24:19<1:31:53,  5.16s/it][A

{'eps': 0, 'objective/kl': 80.36392974853516, 'objective/entropy': 74.28787231445312, 'objective/non_score_reward': -4.0181965827941895, 'objective/rlhf_reward': -6.460973262786865, 'objective/scores': -2.442776679992676, 'policy/approxkl_avg': 0.07044107466936111, 'policy/clipfrac_avg': 0.20518869161605835, 'loss/policy_avg': -0.05372391268610954, 'loss/value_avg': 0.3746874928474426, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.3941121101379395, 'val/ratio': 1.0502586364746094, 'val/ratio_var': 0.001727528520859778, 'val/num_eos_tokens': 0, 'lr': 2.618814306712396e-05, 'episode': 3892, 'epoch': 0.48}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:27:43<1:05:39, 131kB/s]
 48%|████▊     | 974/2041 [1:24:24<1:31:37,  5.15s/it][A

{'eps': 0, 'objective/kl': 83.853515625, 'objective/entropy': 58.43080139160156, 'objective/non_score_reward': -4.192675590515137, 'objective/rlhf_reward': -6.599998950958252, 'objective/scores': -2.4073233604431152, 'policy/approxkl_avg': 0.03951840475201607, 'policy/clipfrac_avg': 0.14740565419197083, 'loss/policy_avg': -0.042077597230672836, 'loss/value_avg': 0.4150930643081665, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1374181509017944, 'val/ratio': 0.9608297348022461, 'val/ratio_var': 0.0010605533607304096, 'val/num_eos_tokens': 0, 'lr': 2.616364527192553e-05, 'episode': 3896, 'epoch': 0.48}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:27:48<1:05:39, 131kB/s]
 48%|████▊     | 975/2041 [1:24:30<1:31:34,  5.15s/it][A

{'eps': 0, 'objective/kl': 94.64064025878906, 'objective/entropy': 74.18397521972656, 'objective/non_score_reward': -4.73203182220459, 'objective/rlhf_reward': -6.773846626281738, 'objective/scores': -2.0418150424957275, 'policy/approxkl_avg': 0.06420380622148514, 'policy/clipfrac_avg': 0.18514150381088257, 'loss/policy_avg': -0.04712919145822525, 'loss/value_avg': 0.5184791684150696, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.3876347541809082, 'val/ratio': 0.9954959750175476, 'val/ratio_var': 0.00010288241901434958, 'val/num_eos_tokens': 0, 'lr': 2.6139147476727094e-05, 'episode': 3900, 'epoch': 0.48}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:27:53<1:05:39, 131kB/s]
 48%|████▊     | 976/2041 [1:24:35<1:31:24,  5.15s/it][A

{'eps': 0, 'objective/kl': 77.35697937011719, 'objective/entropy': 57.5026741027832, 'objective/non_score_reward': -3.867849111557007, 'objective/rlhf_reward': -5.701540946960449, 'objective/scores': -1.8336915969848633, 'policy/approxkl_avg': 0.06013929471373558, 'policy/clipfrac_avg': 0.15094339847564697, 'loss/policy_avg': -0.043880537152290344, 'loss/value_avg': 0.4287560284137726, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1356559991836548, 'val/ratio': 0.9556337594985962, 'val/ratio_var': 0.0012080952292308211, 'val/num_eos_tokens': 0, 'lr': 2.6114649681528662e-05, 'episode': 3904, 'epoch': 0.48}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:27:58<1:05:39, 131kB/s]
 48%|████▊     | 977/2041 [1:24:40<1:31:13,  5.14s/it][A

{'eps': 0, 'objective/kl': 69.65448760986328, 'objective/entropy': 48.286922454833984, 'objective/non_score_reward': -3.48272442817688, 'objective/rlhf_reward': -5.572031021118164, 'objective/scores': -2.0893068313598633, 'policy/approxkl_avg': 0.028163662180304527, 'policy/clipfrac_avg': 0.11438678950071335, 'loss/policy_avg': -0.03938776254653931, 'loss/value_avg': 0.512846827507019, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1376334428787231, 'val/ratio': 0.9832897186279297, 'val/ratio_var': 0.00015877177065704018, 'val/num_eos_tokens': 0, 'lr': 2.6090151886330233e-05, 'episode': 3908, 'epoch': 0.48}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:28:03<1:05:39, 131kB/s]
 48%|████▊     | 978/2041 [1:24:45<1:31:04,  5.14s/it][A

{'eps': 0, 'objective/kl': 91.76371765136719, 'objective/entropy': 77.41433715820312, 'objective/non_score_reward': -4.5881853103637695, 'objective/rlhf_reward': -6.942529678344727, 'objective/scores': -2.354344606399536, 'policy/approxkl_avg': 0.0658850148320198, 'policy/clipfrac_avg': 0.1662735790014267, 'loss/policy_avg': -0.046943679451942444, 'loss/value_avg': 0.6807438731193542, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.34562349319458, 'val/ratio': 0.9918452501296997, 'val/ratio_var': 3.202223524567671e-05, 'val/num_eos_tokens': 0, 'lr': 2.6065654091131798e-05, 'episode': 3912, 'epoch': 0.48}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:28:09<1:05:39, 131kB/s]
 48%|████▊     | 979/2041 [1:24:50<1:31:15,  5.16s/it][A

{'eps': 0, 'objective/kl': 83.45516967773438, 'objective/entropy': 75.99736022949219, 'objective/non_score_reward': -4.172758102416992, 'objective/rlhf_reward': -6.155691623687744, 'objective/scores': -1.9829336404800415, 'policy/approxkl_avg': 0.045711759477853775, 'policy/clipfrac_avg': 0.1745283007621765, 'loss/policy_avg': -0.04615113511681557, 'loss/value_avg': 0.3011791706085205, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.420631766319275, 'val/ratio': 0.9893273115158081, 'val/ratio_var': 5.501537452801131e-05, 'val/num_eos_tokens': 0, 'lr': 2.6041156295933366e-05, 'episode': 3916, 'epoch': 0.48}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:28:14<1:05:39, 131kB/s]
 48%|████▊     | 980/2041 [1:24:55<1:30:57,  5.14s/it][A

{'eps': 0, 'objective/kl': 83.40264892578125, 'objective/entropy': 57.3620719909668, 'objective/non_score_reward': -4.170131683349609, 'objective/rlhf_reward': -6.187828063964844, 'objective/scores': -2.0176963806152344, 'policy/approxkl_avg': 0.04375161975622177, 'policy/clipfrac_avg': 0.1462264209985733, 'loss/policy_avg': -0.044603437185287476, 'loss/value_avg': 0.2850487232208252, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.028336524963379, 'val/ratio': 0.999728798866272, 'val/ratio_var': 1.0385307405158528e-06, 'val/num_eos_tokens': 0, 'lr': 2.6016658500734938e-05, 'episode': 3920, 'epoch': 0.48}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:28:19<1:05:39, 131kB/s]
 48%|████▊     | 981/2041 [1:25:00<1:30:45,  5.14s/it][A

{'eps': 0, 'objective/kl': 85.4232406616211, 'objective/entropy': 61.428680419921875, 'objective/non_score_reward': -4.271162033081055, 'objective/rlhf_reward': -6.278670787811279, 'objective/scores': -2.0075087547302246, 'policy/approxkl_avg': 0.049841009080410004, 'policy/clipfrac_avg': 0.12971697747707367, 'loss/policy_avg': -0.03467483073472977, 'loss/value_avg': 0.5164300203323364, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1919301748275757, 'val/ratio': 0.965590238571167, 'val/ratio_var': 0.0006995974690653384, 'val/num_eos_tokens': 0, 'lr': 2.5992160705536506e-05, 'episode': 3924, 'epoch': 0.48}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:28:24<1:05:39, 131kB/s]
 48%|████▊     | 982/2041 [1:25:05<1:30:38,  5.14s/it][A

{'eps': 0, 'objective/kl': 83.37487030029297, 'objective/entropy': 46.316673278808594, 'objective/non_score_reward': -4.168744087219238, 'objective/rlhf_reward': -7.527439117431641, 'objective/scores': -3.3586950302124023, 'policy/approxkl_avg': 1.6366286277770996, 'policy/clipfrac_avg': 0.10731131583452225, 'loss/policy_avg': -0.03283577412366867, 'loss/value_avg': 0.5906398892402649, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9001895785331726, 'val/ratio': 0.9596117734909058, 'val/ratio_var': 0.0008655814453959465, 'val/num_eos_tokens': 0, 'lr': 2.596766291033807e-05, 'episode': 3928, 'epoch': 0.48}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:28:29<1:05:39, 131kB/s]
 48%|████▊     | 983/2041 [1:25:11<1:30:35,  5.14s/it][A

{'eps': 0, 'objective/kl': 83.70484924316406, 'objective/entropy': 91.09873962402344, 'objective/non_score_reward': -4.185242652893066, 'objective/rlhf_reward': -6.40024995803833, 'objective/scores': -2.2150073051452637, 'policy/approxkl_avg': 0.02507871948182583, 'policy/clipfrac_avg': 0.1379716992378235, 'loss/policy_avg': -0.042899295687675476, 'loss/value_avg': 0.5616881847381592, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.8026072978973389, 'val/ratio': 0.9922264218330383, 'val/ratio_var': 3.2538202503928915e-05, 'val/num_eos_tokens': 0, 'lr': 2.5943165115139638e-05, 'episode': 3932, 'epoch': 0.48}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:28:34<1:05:39, 131kB/s]
 48%|████▊     | 984/2041 [1:25:16<1:30:16,  5.12s/it][A

{'eps': 0, 'objective/kl': 86.00839233398438, 'objective/entropy': 62.77368927001953, 'objective/non_score_reward': -4.300419807434082, 'objective/rlhf_reward': -7.01680850982666, 'objective/scores': -2.716388463973999, 'policy/approxkl_avg': 0.044174324721097946, 'policy/clipfrac_avg': 0.1533018946647644, 'loss/policy_avg': -0.04607279226183891, 'loss/value_avg': 0.3945843279361725, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2530338764190674, 'val/ratio': 0.9668157696723938, 'val/ratio_var': 0.0006468613282777369, 'val/num_eos_tokens': 0, 'lr': 2.591866731994121e-05, 'episode': 3936, 'epoch': 0.48}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:28:39<1:05:39, 131kB/s]
 48%|████▊     | 985/2041 [1:25:21<1:30:22,  5.13s/it][A

{'eps': 0, 'objective/kl': 84.31651306152344, 'objective/entropy': 70.30986785888672, 'objective/non_score_reward': -4.21582555770874, 'objective/rlhf_reward': -6.538775444030762, 'objective/scores': -2.3229498863220215, 'policy/approxkl_avg': 0.05926847457885742, 'policy/clipfrac_avg': 0.14150944352149963, 'loss/policy_avg': -0.03958597034215927, 'loss/value_avg': 0.4409651756286621, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.248741865158081, 'val/ratio': 0.9580854177474976, 'val/ratio_var': 0.0009756071376614273, 'val/num_eos_tokens': 0, 'lr': 2.5894169524742774e-05, 'episode': 3940, 'epoch': 0.48}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:28:44<1:05:39, 131kB/s]
 48%|████▊     | 986/2041 [1:25:26<1:30:15,  5.13s/it][A

{'eps': 0, 'objective/kl': 87.79515075683594, 'objective/entropy': 54.67691421508789, 'objective/non_score_reward': -4.3897576332092285, 'objective/rlhf_reward': -6.649641990661621, 'objective/scores': -2.2598843574523926, 'policy/approxkl_avg': 0.17398029565811157, 'policy/clipfrac_avg': 0.14976415038108826, 'loss/policy_avg': -0.038910504430532455, 'loss/value_avg': 0.434081494808197, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.186872124671936, 'val/ratio': 0.9700535535812378, 'val/ratio_var': 0.0004868759715463966, 'val/num_eos_tokens': 0, 'lr': 2.5869671729544342e-05, 'episode': 3944, 'epoch': 0.48}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:28:50<1:05:39, 131kB/s]
 48%|████▊     | 987/2041 [1:25:31<1:30:48,  5.17s/it][A

{'eps': 0, 'objective/kl': 72.79423522949219, 'objective/entropy': 75.36272430419922, 'objective/non_score_reward': -3.639711856842041, 'objective/rlhf_reward': -6.342493057250977, 'objective/scores': -2.7027812004089355, 'policy/approxkl_avg': 0.04385881498456001, 'policy/clipfrac_avg': 0.17688679695129395, 'loss/policy_avg': -0.050614796578884125, 'loss/value_avg': 0.4271083176136017, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.4278349876403809, 'val/ratio': 0.9881892800331116, 'val/ratio_var': 6.715919153066352e-05, 'val/num_eos_tokens': 0, 'lr': 2.584517393434591e-05, 'episode': 3948, 'epoch': 0.48}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:28:55<1:05:39, 131kB/s]
 48%|████▊     | 988/2041 [1:25:36<1:30:15,  5.14s/it][A

{'eps': 0, 'objective/kl': 74.7653579711914, 'objective/entropy': 56.37556457519531, 'objective/non_score_reward': -3.7382681369781494, 'objective/rlhf_reward': -5.601613998413086, 'objective/scores': -1.8633458614349365, 'policy/approxkl_avg': 0.060406144708395004, 'policy/clipfrac_avg': 0.14150942862033844, 'loss/policy_avg': -0.03596024960279465, 'loss/value_avg': 0.46976515650749207, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1119521856307983, 'val/ratio': 1.0055960416793823, 'val/ratio_var': 0.0001657981629250571, 'val/num_eos_tokens': 0, 'lr': 2.5820676139147475e-05, 'episode': 3952, 'epoch': 0.48}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:29:00<1:05:39, 131kB/s]
 48%|████▊     | 989/2041 [1:25:41<1:29:59,  5.13s/it][A

{'eps': 0, 'objective/kl': 87.47496032714844, 'objective/entropy': 97.18842315673828, 'objective/non_score_reward': -4.373747825622559, 'objective/rlhf_reward': -6.149825572967529, 'objective/scores': -1.7760778665542603, 'policy/approxkl_avg': 0.05176572874188423, 'policy/clipfrac_avg': 0.18278302252292633, 'loss/policy_avg': -0.050230853259563446, 'loss/value_avg': 0.5031222105026245, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.7639024257659912, 'val/ratio': 0.9865000247955322, 'val/ratio_var': 0.00010877634485950693, 'val/num_eos_tokens': 0, 'lr': 2.5796178343949047e-05, 'episode': 3956, 'epoch': 0.48}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:29:05<1:05:39, 131kB/s]
 49%|████▊     | 990/2041 [1:25:47<1:29:47,  5.13s/it][A

{'eps': 0, 'objective/kl': 72.40250396728516, 'objective/entropy': 58.28617858886719, 'objective/non_score_reward': -3.6201252937316895, 'objective/rlhf_reward': -6.191368103027344, 'objective/scores': -2.571242570877075, 'policy/approxkl_avg': 0.08090350776910782, 'policy/clipfrac_avg': 0.13679245114326477, 'loss/policy_avg': -0.04041576385498047, 'loss/value_avg': 0.42720991373062134, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2650259733200073, 'val/ratio': 0.9727116227149963, 'val/ratio_var': 0.00048698417958803475, 'val/num_eos_tokens': 0, 'lr': 2.5771680548750615e-05, 'episode': 3960, 'epoch': 0.49}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:29:10<1:05:39, 131kB/s]
 49%|████▊     | 991/2041 [1:25:52<1:30:19,  5.16s/it][A

{'eps': 0, 'objective/kl': 82.72296142578125, 'objective/entropy': 59.10492706298828, 'objective/non_score_reward': -4.136147975921631, 'objective/rlhf_reward': -6.653412818908691, 'objective/scores': -2.5172648429870605, 'policy/approxkl_avg': 0.0555768683552742, 'policy/clipfrac_avg': 0.13679245114326477, 'loss/policy_avg': -0.04100368916988373, 'loss/value_avg': 0.406732976436615, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2157090902328491, 'val/ratio': 0.9459178447723389, 'val/ratio_var': 0.0018522058380767703, 'val/num_eos_tokens': 0, 'lr': 2.574718275355218e-05, 'episode': 3964, 'epoch': 0.49}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:29:15<1:05:39, 131kB/s]
 49%|████▊     | 992/2041 [1:25:57<1:30:28,  5.17s/it][A

{'eps': 0, 'objective/kl': 68.90143585205078, 'objective/entropy': 74.30381774902344, 'objective/non_score_reward': -3.4450719356536865, 'objective/rlhf_reward': -5.5604119300842285, 'objective/scores': -2.115339994430542, 'policy/approxkl_avg': 0.02349863573908806, 'policy/clipfrac_avg': 0.13089622557163239, 'loss/policy_avg': -0.04030057042837143, 'loss/value_avg': 0.30990898609161377, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.4320305585861206, 'val/ratio': 0.9771698713302612, 'val/ratio_var': 0.0003095078282058239, 'val/num_eos_tokens': 0, 'lr': 2.5722684958353747e-05, 'episode': 3968, 'epoch': 0.49}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:29:21<1:05:39, 131kB/s]
 49%|████▊     | 993/2041 [1:26:02<1:30:56,  5.21s/it][A

{'eps': 0, 'objective/kl': 83.32515716552734, 'objective/entropy': 76.22845458984375, 'objective/non_score_reward': -4.166257858276367, 'objective/rlhf_reward': -6.404708385467529, 'objective/scores': -2.238450527191162, 'policy/approxkl_avg': 0.030941084027290344, 'policy/clipfrac_avg': 0.1462264060974121, 'loss/policy_avg': -0.04082510992884636, 'loss/value_avg': 0.4696628451347351, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.4941564798355103, 'val/ratio': 0.9632426500320435, 'val/ratio_var': 0.000936224649194628, 'val/num_eos_tokens': 0, 'lr': 2.569818716315532e-05, 'episode': 3972, 'epoch': 0.49}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:29:26<1:05:39, 131kB/s]
 49%|████▊     | 994/2041 [1:26:07<1:30:20,  5.18s/it][A

{'eps': 0, 'objective/kl': 84.18415832519531, 'objective/entropy': 65.00581359863281, 'objective/non_score_reward': -4.209207534790039, 'objective/rlhf_reward': -6.391218185424805, 'objective/scores': -2.1820108890533447, 'policy/approxkl_avg': 0.12308584153652191, 'policy/clipfrac_avg': 0.16391509771347046, 'loss/policy_avg': -0.04586271941661835, 'loss/value_avg': 0.46537503600120544, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.3246890306472778, 'val/ratio': 0.9762824773788452, 'val/ratio_var': 0.00026879904908128083, 'val/num_eos_tokens': 0, 'lr': 2.5673689367956883e-05, 'episode': 3976, 'epoch': 0.49}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:29:31<1:05:39, 131kB/s]
 49%|████▉     | 995/2041 [1:26:13<1:30:21,  5.18s/it][A

{'eps': 0, 'objective/kl': 66.58104705810547, 'objective/entropy': 59.036441802978516, 'objective/non_score_reward': -3.329052448272705, 'objective/rlhf_reward': -5.668769836425781, 'objective/scores': -2.339717149734497, 'policy/approxkl_avg': 0.028145546093583107, 'policy/clipfrac_avg': 0.10613207519054413, 'loss/policy_avg': -0.03573612496256828, 'loss/value_avg': 0.26969075202941895, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1625967025756836, 'val/ratio': 0.9725608825683594, 'val/ratio_var': 0.00044505856931209564, 'val/num_eos_tokens': 0, 'lr': 2.564919157275845e-05, 'episode': 3980, 'epoch': 0.49}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:29:36<1:05:39, 131kB/s]
 49%|████▉     | 996/2041 [1:26:18<1:30:27,  5.19s/it][A

{'eps': 0, 'objective/kl': 92.1553955078125, 'objective/entropy': 55.88676452636719, 'objective/non_score_reward': -4.607769966125488, 'objective/rlhf_reward': -5.862612247467041, 'objective/scores': -1.2548422813415527, 'policy/approxkl_avg': 0.04887189716100693, 'policy/clipfrac_avg': 0.12853774428367615, 'loss/policy_avg': -0.03485536202788353, 'loss/value_avg': 0.294933557510376, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2258391380310059, 'val/ratio': 0.9539490342140198, 'val/ratio_var': 0.0013355360133573413, 'val/num_eos_tokens': 0, 'lr': 2.5624693777560023e-05, 'episode': 3984, 'epoch': 0.49}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:29:41<1:05:39, 131kB/s]
 49%|████▉     | 997/2041 [1:26:23<1:29:50,  5.16s/it][A

{'eps': 0, 'objective/kl': 86.53235626220703, 'objective/entropy': 79.17912292480469, 'objective/non_score_reward': -4.326618194580078, 'objective/rlhf_reward': -6.490917682647705, 'objective/scores': -2.164299488067627, 'policy/approxkl_avg': 0.044101495295763016, 'policy/clipfrac_avg': 0.16391508281230927, 'loss/policy_avg': -0.050807222723960876, 'loss/value_avg': 0.31766682863235474, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.5573456287384033, 'val/ratio': 0.9913095831871033, 'val/ratio_var': 3.679060682770796e-05, 'val/num_eos_tokens': 0, 'lr': 2.560019598236159e-05, 'episode': 3988, 'epoch': 0.49}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:29:47<1:05:39, 131kB/s]
 49%|████▉     | 998/2041 [1:26:28<1:30:03,  5.18s/it][A

{'eps': 0, 'objective/kl': 100.38580322265625, 'objective/entropy': 79.09178161621094, 'objective/non_score_reward': -5.019289970397949, 'objective/rlhf_reward': -7.289745807647705, 'objective/scores': -2.270455837249756, 'policy/approxkl_avg': 0.05360229313373566, 'policy/clipfrac_avg': 0.19575472176074982, 'loss/policy_avg': -0.053526610136032104, 'loss/value_avg': 0.5317845344543457, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.5179173946380615, 'val/ratio': 0.9520822763442993, 'val/ratio_var': 0.0013961497461423278, 'val/num_eos_tokens': 0, 'lr': 2.5575698187163156e-05, 'episode': 3992, 'epoch': 0.49}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:29:52<1:05:39, 131kB/s]
 49%|████▉     | 999/2041 [1:26:33<1:30:12,  5.19s/it][A

{'eps': 0, 'objective/kl': 73.9102783203125, 'objective/entropy': 76.02891540527344, 'objective/non_score_reward': -3.695513963699341, 'objective/rlhf_reward': -6.407824993133545, 'objective/scores': -2.712311029434204, 'policy/approxkl_avg': 0.11675121635198593, 'policy/clipfrac_avg': 0.15801885724067688, 'loss/policy_avg': -0.04514941945672035, 'loss/value_avg': 0.369179368019104, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.4046859741210938, 'val/ratio': 1.5356626510620117, 'val/ratio_var': 0.3595505654811859, 'val/num_eos_tokens': 0, 'lr': 2.5551200391964724e-05, 'episode': 3996, 'epoch': 0.49}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:29:57<1:05:39, 131kB/s]
 49%|████▉     | 1000/2041 [1:26:38<1:29:41,  5.17s/it][A

{'eps': 0, 'objective/kl': 74.75711059570312, 'objective/entropy': 90.86053466796875, 'objective/non_score_reward': -3.7378556728363037, 'objective/rlhf_reward': -5.3833723068237305, 'objective/scores': -1.6455166339874268, 'policy/approxkl_avg': 0.1349865049123764, 'policy/clipfrac_avg': 0.13325470685958862, 'loss/policy_avg': -0.04000372067093849, 'loss/value_avg': 0.4158344864845276, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.5991982221603394, 'val/ratio': 0.9575742483139038, 'val/ratio_var': 0.000969001033809036, 'val/num_eos_tokens': 0, 'lr': 2.5526702596766295e-05, 'episode': 4000, 'epoch': 0.49}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:30:20<1:05:39, 131kB/s]
 49%|████▉     | 1001/2041 [1:27:02<3:03:27, 10.58s/it][A

{'eps': 0, 'objective/kl': 82.36299133300781, 'objective/entropy': 63.747779846191406, 'objective/non_score_reward': -4.118149757385254, 'objective/rlhf_reward': -6.312277317047119, 'objective/scores': -2.1941275596618652, 'policy/approxkl_avg': 0.053861603140830994, 'policy/clipfrac_avg': 0.15801887214183807, 'loss/policy_avg': -0.044291552156209946, 'loss/value_avg': 0.3202934265136719, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.3034558296203613, 'val/ratio': 1.0136250257492065, 'val/ratio_var': 0.00025190439191646874, 'val/num_eos_tokens': 0, 'lr': 2.550220480156786e-05, 'episode': 4004, 'epoch': 0.49}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:30:25<1:05:39, 131kB/s]
 49%|████▉     | 1002/2041 [1:27:07<2:35:18,  8.97s/it][A

{'eps': 0, 'objective/kl': 74.28486633300781, 'objective/entropy': 73.38228607177734, 'objective/non_score_reward': -3.7142434120178223, 'objective/rlhf_reward': -5.239443778991699, 'objective/scores': -1.525200605392456, 'policy/approxkl_avg': 0.0320875383913517, 'policy/clipfrac_avg': 0.10966980457305908, 'loss/policy_avg': -0.03743509203195572, 'loss/value_avg': 0.3074408173561096, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.3864109516143799, 'val/ratio': 0.9844890236854553, 'val/ratio_var': 0.00012033335951855406, 'val/num_eos_tokens': 0, 'lr': 2.5477707006369428e-05, 'episode': 4008, 'epoch': 0.49}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:30:31<1:05:39, 131kB/s]
 49%|████▉     | 1003/2041 [1:27:12<2:15:25,  7.83s/it][A

{'eps': 0, 'objective/kl': 72.68327331542969, 'objective/entropy': 59.29787826538086, 'objective/non_score_reward': -3.6341638565063477, 'objective/rlhf_reward': -6.055164337158203, 'objective/scores': -2.4210004806518555, 'policy/approxkl_avg': 0.0998067855834961, 'policy/clipfrac_avg': 0.10966981947422028, 'loss/policy_avg': -0.03438802435994148, 'loss/value_avg': 0.37452226877212524, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1834828853607178, 'val/ratio': 0.9940986633300781, 'val/ratio_var': 1.83215452125296e-05, 'val/num_eos_tokens': 0, 'lr': 2.5453209211170996e-05, 'episode': 4012, 'epoch': 0.49}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:30:36<1:05:39, 131kB/s]
 49%|████▉     | 1004/2041 [1:27:17<2:00:54,  7.00s/it][A

{'eps': 0, 'objective/kl': 76.61058044433594, 'objective/entropy': 71.56410217285156, 'objective/non_score_reward': -3.830528497695923, 'objective/rlhf_reward': -5.994436264038086, 'objective/scores': -2.163908004760742, 'policy/approxkl_avg': 0.2710173726081848, 'policy/clipfrac_avg': 0.14386792480945587, 'loss/policy_avg': -0.03586370125412941, 'loss/value_avg': 0.31055134534835815, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2383023500442505, 'val/ratio': 0.9721874594688416, 'val/ratio_var': 0.0004018110630568117, 'val/num_eos_tokens': 0, 'lr': 2.542871141597256e-05, 'episode': 4016, 'epoch': 0.49}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:30:41<1:05:39, 131kB/s]
 49%|████▉     | 1005/2041 [1:27:22<1:51:04,  6.43s/it][A

{'eps': 0, 'objective/kl': 81.63851928710938, 'objective/entropy': 60.08533477783203, 'objective/non_score_reward': -4.081925868988037, 'objective/rlhf_reward': -5.459770679473877, 'objective/scores': -1.3778448104858398, 'policy/approxkl_avg': 0.06807556748390198, 'policy/clipfrac_avg': 0.13443395495414734, 'loss/policy_avg': -0.04493013769388199, 'loss/value_avg': 0.36807191371917725, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2098238468170166, 'val/ratio': 0.9685112237930298, 'val/ratio_var': 0.0005067861638963223, 'val/num_eos_tokens': 0, 'lr': 2.5404213620774132e-05, 'episode': 4020, 'epoch': 0.49}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:30:46<1:05:39, 131kB/s]
 49%|████▉     | 1006/2041 [1:27:27<1:43:59,  6.03s/it][A

{'eps': 0, 'objective/kl': 83.00782775878906, 'objective/entropy': 75.12895202636719, 'objective/non_score_reward': -4.150391578674316, 'objective/rlhf_reward': -7.00665283203125, 'objective/scores': -2.8562612533569336, 'policy/approxkl_avg': 0.048675812780857086, 'policy/clipfrac_avg': 0.1674528270959854, 'loss/policy_avg': -0.051393527537584305, 'loss/value_avg': 0.4223911166191101, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.4609590768814087, 'val/ratio': 0.9969320297241211, 'val/ratio_var': 3.719871892826632e-05, 'val/num_eos_tokens': 0, 'lr': 2.53797158255757e-05, 'episode': 4024, 'epoch': 0.49}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:30:51<1:05:39, 131kB/s]
 49%|████▉     | 1007/2041 [1:27:32<1:39:12,  5.76s/it][A

{'eps': 0, 'objective/kl': 68.16087341308594, 'objective/entropy': 46.62432098388672, 'objective/non_score_reward': -3.408043622970581, 'objective/rlhf_reward': -6.619863510131836, 'objective/scores': -3.211819648742676, 'policy/approxkl_avg': 0.027151044458150864, 'policy/clipfrac_avg': 0.12146226316690445, 'loss/policy_avg': -0.03981471061706543, 'loss/value_avg': 0.37346166372299194, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0024315118789673, 'val/ratio': 0.9940301775932312, 'val/ratio_var': 1.9495173546602018e-05, 'val/num_eos_tokens': 0, 'lr': 2.5355218030377265e-05, 'episode': 4028, 'epoch': 0.49}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:30:56<1:05:39, 131kB/s]
 49%|████▉     | 1008/2041 [1:27:38<1:36:12,  5.59s/it][A

{'eps': 0, 'objective/kl': 84.23146057128906, 'objective/entropy': 74.61625671386719, 'objective/non_score_reward': -4.211572647094727, 'objective/rlhf_reward': -5.96047306060791, 'objective/scores': -1.7489001750946045, 'policy/approxkl_avg': 0.058315251022577286, 'policy/clipfrac_avg': 0.1533018946647644, 'loss/policy_avg': -0.0429227352142334, 'loss/value_avg': 0.28670600056648254, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.3055521249771118, 'val/ratio': 0.9756152033805847, 'val/ratio_var': 0.0003207943227607757, 'val/num_eos_tokens': 0, 'lr': 2.5330720235178833e-05, 'episode': 4032, 'epoch': 0.49}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:31:01<1:05:39, 131kB/s]
 49%|████▉     | 1009/2041 [1:27:43<1:33:53,  5.46s/it][A

{'eps': 0, 'objective/kl': 85.76969909667969, 'objective/entropy': 55.17108917236328, 'objective/non_score_reward': -4.288484573364258, 'objective/rlhf_reward': -6.709650993347168, 'objective/scores': -2.42116641998291, 'policy/approxkl_avg': 0.028062811121344566, 'policy/clipfrac_avg': 0.10023584961891174, 'loss/policy_avg': -0.035210657864809036, 'loss/value_avg': 0.5477291941642761, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2170825004577637, 'val/ratio': 0.9785752892494202, 'val/ratio_var': 0.0003147005336359143, 'val/num_eos_tokens': 0, 'lr': 2.5306222439980404e-05, 'episode': 4036, 'epoch': 0.49}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:31:06<1:05:39, 131kB/s]
 49%|████▉     | 1010/2041 [1:27:48<1:32:45,  5.40s/it][A

{'eps': 0, 'objective/kl': 109.779541015625, 'objective/entropy': 63.24343490600586, 'objective/non_score_reward': -5.488976955413818, 'objective/rlhf_reward': -7.108063697814941, 'objective/scores': -1.619086742401123, 'policy/approxkl_avg': 0.0743400901556015, 'policy/clipfrac_avg': 0.14858490228652954, 'loss/policy_avg': -0.047274135053157806, 'loss/value_avg': 0.4686603546142578, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.318526268005371, 'val/ratio': 1.0182234048843384, 'val/ratio_var': 0.000745678145904094, 'val/num_eos_tokens': 0, 'lr': 2.5281724644781972e-05, 'episode': 4040, 'epoch': 0.49}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:31:12<1:05:39, 131kB/s]
 50%|████▉     | 1011/2041 [1:27:53<1:31:57,  5.36s/it][A

{'eps': 0, 'objective/kl': 96.68040466308594, 'objective/entropy': 74.90579223632812, 'objective/non_score_reward': -4.834020614624023, 'objective/rlhf_reward': -6.3875908851623535, 'objective/scores': -1.55357027053833, 'policy/approxkl_avg': 0.09136112034320831, 'policy/clipfrac_avg': 0.1521226316690445, 'loss/policy_avg': -0.03893420100212097, 'loss/value_avg': 0.41841477155685425, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.265318751335144, 'val/ratio': 0.9906156659126282, 'val/ratio_var': 3.947062941733748e-05, 'val/num_eos_tokens': 0, 'lr': 2.5257226849583537e-05, 'episode': 4044, 'epoch': 0.5}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:31:17<1:05:39, 131kB/s]
 50%|████▉     | 1012/2041 [1:27:59<1:31:08,  5.31s/it][A

{'eps': 0, 'objective/kl': 92.23399353027344, 'objective/entropy': 88.21553039550781, 'objective/non_score_reward': -4.611700057983398, 'objective/rlhf_reward': -7.415822982788086, 'objective/scores': -2.8041229248046875, 'policy/approxkl_avg': 0.08223700523376465, 'policy/clipfrac_avg': 0.09316037595272064, 'loss/policy_avg': -0.03949077054858208, 'loss/value_avg': 0.6971096992492676, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.6648342609405518, 'val/ratio': 0.9710347652435303, 'val/ratio_var': 0.0006517820875160396, 'val/num_eos_tokens': 0, 'lr': 2.5232729054385108e-05, 'episode': 4048, 'epoch': 0.5}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:31:22<1:05:39, 131kB/s]
 50%|████▉     | 1013/2041 [1:28:04<1:29:51,  5.24s/it][A

{'eps': 0, 'objective/kl': 87.03516387939453, 'objective/entropy': 73.47177124023438, 'objective/non_score_reward': -4.3517584800720215, 'objective/rlhf_reward': -7.094329833984375, 'objective/scores': -2.7425715923309326, 'policy/approxkl_avg': 0.020996814593672752, 'policy/clipfrac_avg': 0.06603773683309555, 'loss/policy_avg': -0.030650299042463303, 'loss/value_avg': 0.5525350570678711, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.508237361907959, 'val/ratio': 0.9683777689933777, 'val/ratio_var': 0.0007511104340665042, 'val/num_eos_tokens': 0, 'lr': 2.5208231259186676e-05, 'episode': 4052, 'epoch': 0.5}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:31:27<1:05:39, 131kB/s]
 50%|████▉     | 1014/2041 [1:28:09<1:29:11,  5.21s/it][A

{'eps': 0, 'objective/kl': 96.86174774169922, 'objective/entropy': 77.82789611816406, 'objective/non_score_reward': -4.843087673187256, 'objective/rlhf_reward': -6.522756576538086, 'objective/scores': -1.6796691417694092, 'policy/approxkl_avg': 0.05366133898496628, 'policy/clipfrac_avg': 0.17216980457305908, 'loss/policy_avg': -0.05037438124418259, 'loss/value_avg': 0.6692340970039368, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.449588418006897, 'val/ratio': 0.9526296257972717, 'val/ratio_var': 0.0013237171806395054, 'val/num_eos_tokens': 0, 'lr': 2.518373346398824e-05, 'episode': 4056, 'epoch': 0.5}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:31:32<1:05:39, 131kB/s]
 50%|████▉     | 1015/2041 [1:28:14<1:28:47,  5.19s/it][A

{'eps': 0, 'objective/kl': 71.35334014892578, 'objective/entropy': 58.68714141845703, 'objective/non_score_reward': -3.567667007446289, 'objective/rlhf_reward': -5.37811803817749, 'objective/scores': -1.8104510307312012, 'policy/approxkl_avg': 0.02901424467563629, 'policy/clipfrac_avg': 0.09787736088037491, 'loss/policy_avg': -0.03376031666994095, 'loss/value_avg': 0.30965662002563477, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0341778993606567, 'val/ratio': 0.9813129901885986, 'val/ratio_var': 0.0001875792513601482, 'val/num_eos_tokens': 0, 'lr': 2.515923566878981e-05, 'episode': 4060, 'epoch': 0.5}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:31:38<1:05:39, 131kB/s]
 50%|████▉     | 1016/2041 [1:28:19<1:28:50,  5.20s/it][A

{'eps': 0, 'objective/kl': 65.93265533447266, 'objective/entropy': 23.7596378326416, 'objective/non_score_reward': -3.296632766723633, 'objective/rlhf_reward': -5.051018714904785, 'objective/scores': -1.7543858289718628, 'policy/approxkl_avg': 0.013087340630590916, 'policy/clipfrac_avg': 0.04599056765437126, 'loss/policy_avg': -0.016845639795064926, 'loss/value_avg': 0.24383893609046936, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.5703516006469727, 'val/ratio': 0.9850040674209595, 'val/ratio_var': 0.0001501783262938261, 'val/num_eos_tokens': 0, 'lr': 2.513473787359138e-05, 'episode': 4064, 'epoch': 0.5}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:31:43<1:05:39, 131kB/s]
 50%|████▉     | 1017/2041 [1:28:24<1:28:31,  5.19s/it][A

{'eps': 0, 'objective/kl': 67.57496643066406, 'objective/entropy': 37.478492736816406, 'objective/non_score_reward': -3.3787479400634766, 'objective/rlhf_reward': -6.093323707580566, 'objective/scores': -2.7145755290985107, 'policy/approxkl_avg': 0.06231454387307167, 'policy/clipfrac_avg': 0.11202830076217651, 'loss/policy_avg': -0.03768068552017212, 'loss/value_avg': 0.48058509826660156, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8163271546363831, 'val/ratio': 1.012440800666809, 'val/ratio_var': 0.0007579525117762387, 'val/num_eos_tokens': 0, 'lr': 2.5110240078392945e-05, 'episode': 4068, 'epoch': 0.5}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:31:48<1:05:39, 131kB/s]
 50%|████▉     | 1018/2041 [1:28:29<1:28:01,  5.16s/it][A

{'eps': 0, 'objective/kl': 62.29350280761719, 'objective/entropy': 53.198036193847656, 'objective/non_score_reward': -3.114675283432007, 'objective/rlhf_reward': -5.393210411071777, 'objective/scores': -2.2785351276397705, 'policy/approxkl_avg': 0.023388251662254333, 'policy/clipfrac_avg': 0.08490566164255142, 'loss/policy_avg': -0.0317157581448555, 'loss/value_avg': 0.174943745136261, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8350223898887634, 'val/ratio': 0.9770370721817017, 'val/ratio_var': 0.0003649850550573319, 'val/num_eos_tokens': 0, 'lr': 2.5085742283194513e-05, 'episode': 4072, 'epoch': 0.5}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:31:53<1:05:39, 131kB/s]
 50%|████▉     | 1019/2041 [1:28:34<1:27:30,  5.14s/it][A

{'eps': 0, 'objective/kl': 92.87667846679688, 'objective/entropy': 84.05302429199219, 'objective/non_score_reward': -4.643834114074707, 'objective/rlhf_reward': -6.972320079803467, 'objective/scores': -2.3284859657287598, 'policy/approxkl_avg': 0.06006905809044838, 'policy/clipfrac_avg': 0.1745283007621765, 'loss/policy_avg': -0.04861053079366684, 'loss/value_avg': 0.4283605217933655, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.5219216346740723, 'val/ratio': 0.9973071813583374, 'val/ratio_var': 0.0001995424972847104, 'val/num_eos_tokens': 0, 'lr': 2.5061244487996084e-05, 'episode': 4076, 'epoch': 0.5}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:31:58<1:05:39, 131kB/s]
 50%|████▉     | 1020/2041 [1:28:40<1:27:24,  5.14s/it][A

{'eps': 0, 'objective/kl': 82.41223907470703, 'objective/entropy': 54.582557678222656, 'objective/non_score_reward': -4.120611667633057, 'objective/rlhf_reward': -6.1773786544799805, 'objective/scores': -2.056766986846924, 'policy/approxkl_avg': 0.017901722341775894, 'policy/clipfrac_avg': 0.08608490973711014, 'loss/policy_avg': -0.025468602776527405, 'loss/value_avg': 0.31845515966415405, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0592652559280396, 'val/ratio': 0.9736288785934448, 'val/ratio_var': 0.00048487578169442713, 'val/num_eos_tokens': 0, 'lr': 2.5036746692797646e-05, 'episode': 4080, 'epoch': 0.5}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:32:03<1:05:39, 131kB/s]
 50%|█████     | 1021/2041 [1:28:45<1:27:32,  5.15s/it][A

{'eps': 0, 'objective/kl': 64.58731079101562, 'objective/entropy': 44.06336975097656, 'objective/non_score_reward': -3.229365825653076, 'objective/rlhf_reward': -5.232889652252197, 'objective/scores': -2.003523826599121, 'policy/approxkl_avg': 0.04245608299970627, 'policy/clipfrac_avg': 0.10259434580802917, 'loss/policy_avg': -0.03498408943414688, 'loss/value_avg': 0.1682257056236267, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8271521925926208, 'val/ratio': 0.9629883170127869, 'val/ratio_var': 0.0008599140564911067, 'val/num_eos_tokens': 0, 'lr': 2.5012248897599217e-05, 'episode': 4084, 'epoch': 0.5}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:32:11<1:05:39, 131kB/s]
 50%|█████     | 1022/2041 [1:28:53<1:43:26,  6.09s/it][A

{'eps': 0, 'objective/kl': 94.08707427978516, 'objective/entropy': 52.656761169433594, 'objective/non_score_reward': -4.7043538093566895, 'objective/rlhf_reward': -6.649936676025391, 'objective/scores': -1.9455827474594116, 'policy/approxkl_avg': 0.025108523666858673, 'policy/clipfrac_avg': 0.12382075190544128, 'loss/policy_avg': -0.0354621522128582, 'loss/value_avg': 0.3850715756416321, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9824459552764893, 'val/ratio': 0.9703067541122437, 'val/ratio_var': 0.0006422853912226856, 'val/num_eos_tokens': 0, 'lr': 2.4987751102400785e-05, 'episode': 4088, 'epoch': 0.5}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:32:17<1:05:39, 131kB/s]
 50%|█████     | 1023/2041 [1:28:58<1:38:31,  5.81s/it][A

{'eps': 0, 'objective/kl': 62.59252166748047, 'objective/entropy': 27.35838508605957, 'objective/non_score_reward': -3.129626750946045, 'objective/rlhf_reward': -5.952119827270508, 'objective/scores': -2.822493076324463, 'policy/approxkl_avg': 0.025369349867105484, 'policy/clipfrac_avg': 0.09316037595272064, 'loss/policy_avg': -0.03274890035390854, 'loss/value_avg': 0.28217214345932007, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.5249380469322205, 'val/ratio': 0.977602481842041, 'val/ratio_var': 0.0003989440156146884, 'val/num_eos_tokens': 0, 'lr': 2.4963253307202353e-05, 'episode': 4092, 'epoch': 0.5}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:32:22<1:05:39, 131kB/s]
 50%|█████     | 1024/2041 [1:29:03<1:35:21,  5.63s/it][A

{'eps': 0, 'objective/kl': 83.24161529541016, 'objective/entropy': 61.00111389160156, 'objective/non_score_reward': -4.16208028793335, 'objective/rlhf_reward': -5.719103813171387, 'objective/scores': -1.557023286819458, 'policy/approxkl_avg': 0.043364860117435455, 'policy/clipfrac_avg': 0.09551887214183807, 'loss/policy_avg': -0.030248522758483887, 'loss/value_avg': 0.29951179027557373, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0833243131637573, 'val/ratio': 0.9569823741912842, 'val/ratio_var': 0.0012161674676463008, 'val/num_eos_tokens': 0, 'lr': 2.4938755512003918e-05, 'episode': 4096, 'epoch': 0.5}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:32:27<1:05:39, 131kB/s]
 50%|█████     | 1025/2041 [1:29:08<1:32:35,  5.47s/it][A

{'eps': 0, 'objective/kl': 76.1972885131836, 'objective/entropy': 20.81591796875, 'objective/non_score_reward': -3.8098642826080322, 'objective/rlhf_reward': -6.179171562194824, 'objective/scores': -2.369307041168213, 'policy/approxkl_avg': 0.008670215494930744, 'policy/clipfrac_avg': 0.05188679322600365, 'loss/policy_avg': -0.020063616335392, 'loss/value_avg': 0.23205676674842834, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.5261743664741516, 'val/ratio': 0.9882585406303406, 'val/ratio_var': 9.436709660803899e-05, 'val/num_eos_tokens': 0, 'lr': 2.491425771680549e-05, 'episode': 4100, 'epoch': 0.5}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:32:32<1:05:39, 131kB/s]
 50%|█████     | 1026/2041 [1:29:14<1:30:39,  5.36s/it][A

{'eps': 0, 'objective/kl': 70.05061340332031, 'objective/entropy': 38.4784049987793, 'objective/non_score_reward': -3.502530097961426, 'objective/rlhf_reward': -5.57174825668335, 'objective/scores': -2.069218158721924, 'policy/approxkl_avg': 0.3626943528652191, 'policy/clipfrac_avg': 0.10495282709598541, 'loss/policy_avg': -0.03356058523058891, 'loss/value_avg': 0.24129286408424377, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.77897047996521, 'val/ratio': 0.9702125787734985, 'val/ratio_var': 0.000509515986777842, 'val/num_eos_tokens': 0, 'lr': 2.4889759921607057e-05, 'episode': 4104, 'epoch': 0.5}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:32:37<1:05:39, 131kB/s]
 50%|█████     | 1027/2041 [1:29:19<1:29:40,  5.31s/it][A

{'eps': 0, 'objective/kl': 78.9033203125, 'objective/entropy': 51.04755783081055, 'objective/non_score_reward': -3.9451661109924316, 'objective/rlhf_reward': -5.487631797790527, 'objective/scores': -1.5424659252166748, 'policy/approxkl_avg': 0.10762446373701096, 'policy/clipfrac_avg': 0.11202830076217651, 'loss/policy_avg': -0.03430610150098801, 'loss/value_avg': 0.33639129996299744, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9146793484687805, 'val/ratio': 1.0041377544403076, 'val/ratio_var': 2.3542874259874225e-05, 'val/num_eos_tokens': 0, 'lr': 2.4865262126408625e-05, 'episode': 4108, 'epoch': 0.5}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:32:42<1:05:39, 131kB/s]
 50%|█████     | 1028/2041 [1:29:24<1:28:50,  5.26s/it][A

{'eps': 0, 'objective/kl': 79.91627502441406, 'objective/entropy': 68.52053833007812, 'objective/non_score_reward': -3.9958136081695557, 'objective/rlhf_reward': -5.5432209968566895, 'objective/scores': -1.5474073886871338, 'policy/approxkl_avg': 0.03198300302028656, 'policy/clipfrac_avg': 0.12853772938251495, 'loss/policy_avg': -0.03790684789419174, 'loss/value_avg': 0.2876833975315094, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2654474973678589, 'val/ratio': 0.9568964838981628, 'val/ratio_var': 0.0012156037846580148, 'val/num_eos_tokens': 0, 'lr': 2.4840764331210193e-05, 'episode': 4112, 'epoch': 0.5}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:32:48<1:05:39, 131kB/s]
 50%|█████     | 1029/2041 [1:29:29<1:28:01,  5.22s/it][A

{'eps': 0, 'objective/kl': 87.85720825195312, 'objective/entropy': 67.09970092773438, 'objective/non_score_reward': -4.392860412597656, 'objective/rlhf_reward': -6.053959846496582, 'objective/scores': -1.6610993146896362, 'policy/approxkl_avg': 0.07683546096086502, 'policy/clipfrac_avg': 0.1108490601181984, 'loss/policy_avg': -0.03228569030761719, 'loss/value_avg': 0.3302830457687378, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0930585861206055, 'val/ratio': 0.9830669164657593, 'val/ratio_var': 0.00014547811588272452, 'val/num_eos_tokens': 0, 'lr': 2.4816266536011758e-05, 'episode': 4116, 'epoch': 0.5}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:32:53<1:05:39, 131kB/s]
 50%|█████     | 1030/2041 [1:29:34<1:27:25,  5.19s/it][A

{'eps': 0, 'objective/kl': 84.76887512207031, 'objective/entropy': 48.38667297363281, 'objective/non_score_reward': -4.238443851470947, 'objective/rlhf_reward': -6.018307685852051, 'objective/scores': -1.779863715171814, 'policy/approxkl_avg': 0.0366334393620491, 'policy/clipfrac_avg': 0.10259433835744858, 'loss/policy_avg': -0.03004256635904312, 'loss/value_avg': 0.2765062749385834, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8698052167892456, 'val/ratio': 0.9818546772003174, 'val/ratio_var': 0.00021210878912825137, 'val/num_eos_tokens': 0, 'lr': 2.479176874081333e-05, 'episode': 4120, 'epoch': 0.5}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:32:58<1:05:39, 131kB/s]
 51%|█████     | 1031/2041 [1:29:39<1:27:15,  5.18s/it][A

{'eps': 0, 'objective/kl': 75.908447265625, 'objective/entropy': 46.72575759887695, 'objective/non_score_reward': -3.795422315597534, 'objective/rlhf_reward': -6.170188903808594, 'objective/scores': -2.3747663497924805, 'policy/approxkl_avg': 0.1899401992559433, 'policy/clipfrac_avg': 0.10495282709598541, 'loss/policy_avg': -0.02477320469915867, 'loss/value_avg': 0.3601512312889099, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8444573283195496, 'val/ratio': 0.9591516256332397, 'val/ratio_var': 0.0009245430119335651, 'val/num_eos_tokens': 0, 'lr': 2.4767270945614894e-05, 'episode': 4124, 'epoch': 0.51}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:33:03<1:05:39, 131kB/s]
 51%|█████     | 1032/2041 [1:29:44<1:26:59,  5.17s/it][A

{'eps': 0, 'objective/kl': 58.7193603515625, 'objective/entropy': 49.56666946411133, 'objective/non_score_reward': -2.9359683990478516, 'objective/rlhf_reward': -4.9644293785095215, 'objective/scores': -2.02846097946167, 'policy/approxkl_avg': 0.06361833959817886, 'policy/clipfrac_avg': 0.10141509771347046, 'loss/policy_avg': -0.03422657772898674, 'loss/value_avg': 0.19981765747070312, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8930566906929016, 'val/ratio': 0.969327449798584, 'val/ratio_var': 0.0005448543233796954, 'val/num_eos_tokens': 0, 'lr': 2.4742773150416466e-05, 'episode': 4128, 'epoch': 0.51}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:33:08<1:05:39, 131kB/s]
 51%|█████     | 1033/2041 [1:29:50<1:26:54,  5.17s/it][A

{'eps': 0, 'objective/kl': 71.09546661376953, 'objective/entropy': 43.1801643371582, 'objective/non_score_reward': -3.5547733306884766, 'objective/rlhf_reward': -5.007080078125, 'objective/scores': -1.4523069858551025, 'policy/approxkl_avg': 0.054211121052503586, 'policy/clipfrac_avg': 0.08608490973711014, 'loss/policy_avg': -0.026491500437259674, 'loss/value_avg': 0.27098459005355835, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8255269527435303, 'val/ratio': 0.9918493032455444, 'val/ratio_var': 3.255472984164953e-05, 'val/num_eos_tokens': 0, 'lr': 2.471827535521803e-05, 'episode': 4132, 'epoch': 0.51}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:33:13<1:05:39, 131kB/s]
 51%|█████     | 1034/2041 [1:29:55<1:26:56,  5.18s/it][A

{'eps': 0, 'objective/kl': 89.66320037841797, 'objective/entropy': 53.67489242553711, 'objective/non_score_reward': -4.483160018920898, 'objective/rlhf_reward': -6.834688186645508, 'objective/scores': -2.3515279293060303, 'policy/approxkl_avg': 0.10073460638523102, 'policy/clipfrac_avg': 0.11320754885673523, 'loss/policy_avg': -0.037436775863170624, 'loss/value_avg': 0.2956884205341339, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0417330265045166, 'val/ratio': 1.2926356792449951, 'val/ratio_var': 0.0692615732550621, 'val/num_eos_tokens': 0, 'lr': 2.46937775600196e-05, 'episode': 4136, 'epoch': 0.51}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:33:18<1:05:39, 131kB/s]
 51%|█████     | 1035/2041 [1:30:00<1:26:28,  5.16s/it][A

{'eps': 0, 'objective/kl': 64.86913299560547, 'objective/entropy': 37.438804626464844, 'objective/non_score_reward': -3.2434568405151367, 'objective/rlhf_reward': -5.442099571228027, 'objective/scores': -2.1986429691314697, 'policy/approxkl_avg': 0.007061666809022427, 'policy/clipfrac_avg': 0.061320751905441284, 'loss/policy_avg': -0.024805951863527298, 'loss/value_avg': 0.22200441360473633, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.794247567653656, 'val/ratio': 0.9974883794784546, 'val/ratio_var': 8.78793616720941e-06, 'val/num_eos_tokens': 0, 'lr': 2.466927976482117e-05, 'episode': 4140, 'epoch': 0.51}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:33:24<1:05:39, 131kB/s]
 51%|█████     | 1036/2041 [1:30:05<1:26:48,  5.18s/it][A

{'eps': 0, 'objective/kl': 88.43980407714844, 'objective/entropy': 56.701751708984375, 'objective/non_score_reward': -4.421990394592285, 'objective/rlhf_reward': -7.034395217895508, 'objective/scores': -2.6124048233032227, 'policy/approxkl_avg': 0.030745627358555794, 'policy/clipfrac_avg': 0.1108490526676178, 'loss/policy_avg': -0.03649289906024933, 'loss/value_avg': 0.3967444598674774, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.115692377090454, 'val/ratio': 0.9534915685653687, 'val/ratio_var': 0.0014495804207399487, 'val/num_eos_tokens': 0, 'lr': 2.4644781969622734e-05, 'episode': 4144, 'epoch': 0.51}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:33:29<1:05:39, 131kB/s]
 51%|█████     | 1037/2041 [1:30:10<1:26:31,  5.17s/it][A

{'eps': 0, 'objective/kl': 88.40950775146484, 'objective/entropy': 54.602413177490234, 'objective/non_score_reward': -4.420475482940674, 'objective/rlhf_reward': -6.436220645904541, 'objective/scores': -2.015745162963867, 'policy/approxkl_avg': 0.022508835420012474, 'policy/clipfrac_avg': 0.1108490601181984, 'loss/policy_avg': -0.03561584651470184, 'loss/value_avg': 0.3426421880722046, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.081902027130127, 'val/ratio': 0.9834963083267212, 'val/ratio_var': 0.0001682875445112586, 'val/num_eos_tokens': 0, 'lr': 2.4620284174424302e-05, 'episode': 4148, 'epoch': 0.51}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:33:34<1:05:39, 131kB/s]
 51%|█████     | 1038/2041 [1:30:15<1:26:03,  5.15s/it][A

{'eps': 0, 'objective/kl': 103.47740173339844, 'objective/entropy': 85.27366638183594, 'objective/non_score_reward': -5.173870086669922, 'objective/rlhf_reward': -7.557884216308594, 'objective/scores': -2.3840138912200928, 'policy/approxkl_avg': 0.05074773728847504, 'policy/clipfrac_avg': 0.14858490228652954, 'loss/policy_avg': -0.04653122276067734, 'loss/value_avg': 0.4981507360935211, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.539721131324768, 'val/ratio': 0.971753716468811, 'val/ratio_var': 0.0004360996827017516, 'val/num_eos_tokens': 0, 'lr': 2.459578637922587e-05, 'episode': 4152, 'epoch': 0.51}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:33:39<1:05:39, 131kB/s]
 51%|█████     | 1039/2041 [1:30:21<1:26:20,  5.17s/it][A

{'eps': 0, 'objective/kl': 76.50221252441406, 'objective/entropy': 31.27699851989746, 'objective/non_score_reward': -3.825110912322998, 'objective/rlhf_reward': -6.627059459686279, 'objective/scores': -2.8019485473632812, 'policy/approxkl_avg': 0.03024705871939659, 'policy/clipfrac_avg': 0.09080187976360321, 'loss/policy_avg': -0.03270670771598816, 'loss/value_avg': 0.285259485244751, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6229736804962158, 'val/ratio': 0.9884109497070312, 'val/ratio_var': 0.00018304900731891394, 'val/num_eos_tokens': 0, 'lr': 2.457128858402744e-05, 'episode': 4156, 'epoch': 0.51}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:33:44<1:05:39, 131kB/s]
 51%|█████     | 1040/2041 [1:30:26<1:26:46,  5.20s/it][A

{'eps': 0, 'objective/kl': 84.09089660644531, 'objective/entropy': 28.30459976196289, 'objective/non_score_reward': -4.204545021057129, 'objective/rlhf_reward': -6.776584625244141, 'objective/scores': -2.5720396041870117, 'policy/approxkl_avg': 0.05735030025243759, 'policy/clipfrac_avg': 0.0554245300590992, 'loss/policy_avg': -0.019773587584495544, 'loss/value_avg': 0.2569877505302429, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6768572330474854, 'val/ratio': 0.9942125082015991, 'val/ratio_var': 1.5423504009959288e-05, 'val/num_eos_tokens': 0, 'lr': 2.4546790788829007e-05, 'episode': 4160, 'epoch': 0.51}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:33:50<1:05:39, 131kB/s]
 51%|█████     | 1041/2041 [1:30:31<1:26:23,  5.18s/it][A

{'eps': 0, 'objective/kl': 73.63123321533203, 'objective/entropy': 18.47418212890625, 'objective/non_score_reward': -3.6815617084503174, 'objective/rlhf_reward': -6.318959712982178, 'objective/scores': -2.6373980045318604, 'policy/approxkl_avg': 0.006967485416680574, 'policy/clipfrac_avg': 0.05070754513144493, 'loss/policy_avg': -0.01987505331635475, 'loss/value_avg': 0.2229318767786026, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.47864431142807007, 'val/ratio': 0.990642786026001, 'val/ratio_var': 5.891238106414676e-05, 'val/num_eos_tokens': 0, 'lr': 2.4522292993630575e-05, 'episode': 4164, 'epoch': 0.51}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:33:55<1:05:39, 131kB/s]
 51%|█████     | 1042/2041 [1:30:36<1:26:33,  5.20s/it][A

{'eps': 0, 'objective/kl': 84.7270278930664, 'objective/entropy': 33.712799072265625, 'objective/non_score_reward': -4.23635196685791, 'objective/rlhf_reward': -6.292070388793945, 'objective/scores': -2.0557186603546143, 'policy/approxkl_avg': 0.020006868988275528, 'policy/clipfrac_avg': 0.0837264209985733, 'loss/policy_avg': -0.02778146229684353, 'loss/value_avg': 0.40416765213012695, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7143312692642212, 'val/ratio': 0.9937133193016052, 'val/ratio_var': 2.317249345651362e-05, 'val/num_eos_tokens': 0, 'lr': 2.4497795198432143e-05, 'episode': 4168, 'epoch': 0.51}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:34:00<1:05:39, 131kB/s]
 51%|█████     | 1043/2041 [1:30:41<1:26:13,  5.18s/it][A

{'eps': 0, 'objective/kl': 74.46080017089844, 'objective/entropy': 61.638607025146484, 'objective/non_score_reward': -3.7230401039123535, 'objective/rlhf_reward': -6.270968437194824, 'objective/scores': -2.54792857170105, 'policy/approxkl_avg': 0.035013191401958466, 'policy/clipfrac_avg': 0.11202830076217651, 'loss/policy_avg': -0.04106675088405609, 'loss/value_avg': 0.2854556441307068, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1716116666793823, 'val/ratio': 0.9653063416481018, 'val/ratio_var': 0.0009011672809720039, 'val/num_eos_tokens': 0, 'lr': 2.447329740323371e-05, 'episode': 4172, 'epoch': 0.51}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:34:05<1:05:39, 131kB/s]
 51%|█████     | 1044/2041 [1:30:47<1:25:46,  5.16s/it][A

{'eps': 0, 'objective/kl': 88.04916381835938, 'objective/entropy': 56.708106994628906, 'objective/non_score_reward': -4.402458190917969, 'objective/rlhf_reward': -6.59530782699585, 'objective/scores': -2.192849636077881, 'policy/approxkl_avg': 0.3855266869068146, 'policy/clipfrac_avg': 0.10023584961891174, 'loss/policy_avg': -0.03747166693210602, 'loss/value_avg': 0.3701133728027344, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2483211755752563, 'val/ratio': 0.9676824808120728, 'val/ratio_var': 0.0006636860780417919, 'val/num_eos_tokens': 0, 'lr': 2.444879960803528e-05, 'episode': 4176, 'epoch': 0.51}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:34:10<1:05:39, 131kB/s]
 51%|█████     | 1045/2041 [1:30:52<1:25:37,  5.16s/it][A

{'eps': 0, 'objective/kl': 76.78610229492188, 'objective/entropy': 53.50447463989258, 'objective/non_score_reward': -3.8393049240112305, 'objective/rlhf_reward': -6.889928340911865, 'objective/scores': -3.0506234169006348, 'policy/approxkl_avg': 0.015311488881707191, 'policy/clipfrac_avg': 0.08018868416547775, 'loss/policy_avg': -0.034960418939590454, 'loss/value_avg': 0.3322298526763916, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1029484272003174, 'val/ratio': 0.9693616628646851, 'val/ratio_var': 0.0006638495833612978, 'val/num_eos_tokens': 0, 'lr': 2.4424301812836843e-05, 'episode': 4180, 'epoch': 0.51}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:34:15<1:05:39, 131kB/s]
 51%|█████     | 1046/2041 [1:30:57<1:25:44,  5.17s/it][A

{'eps': 0, 'objective/kl': 92.48284912109375, 'objective/entropy': 101.07025146484375, 'objective/non_score_reward': -4.624142646789551, 'objective/rlhf_reward': -7.293333053588867, 'objective/scores': -2.6691901683807373, 'policy/approxkl_avg': 0.07392092049121857, 'policy/clipfrac_avg': 0.19339622557163239, 'loss/policy_avg': -0.051473468542099, 'loss/value_avg': 0.4384711980819702, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.7852439880371094, 'val/ratio': 0.9614738821983337, 'val/ratio_var': 0.0007603983394801617, 'val/num_eos_tokens': 0, 'lr': 2.4399804017638415e-05, 'episode': 4184, 'epoch': 0.51}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:34:21<1:05:39, 131kB/s]
 51%|█████▏    | 1047/2041 [1:31:02<1:25:43,  5.17s/it][A

{'eps': 0, 'objective/kl': 73.67036437988281, 'objective/entropy': 92.4719009399414, 'objective/non_score_reward': -3.6835179328918457, 'objective/rlhf_reward': -6.35344123840332, 'objective/scores': -2.6699233055114746, 'policy/approxkl_avg': 0.1031673401594162, 'policy/clipfrac_avg': 0.11438679695129395, 'loss/policy_avg': -0.042202889919281006, 'loss/value_avg': 0.34306854009628296, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.6768959760665894, 'val/ratio': 0.9719275236129761, 'val/ratio_var': 0.00045929846237413585, 'val/num_eos_tokens': 0, 'lr': 2.437530622243998e-05, 'episode': 4188, 'epoch': 0.51}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:34:26<1:05:39, 131kB/s]
 51%|█████▏    | 1048/2041 [1:31:07<1:25:20,  5.16s/it][A

{'eps': 0, 'objective/kl': 74.02285766601562, 'objective/entropy': 69.48603057861328, 'objective/non_score_reward': -3.7011423110961914, 'objective/rlhf_reward': -5.647276878356934, 'objective/scores': -1.946134328842163, 'policy/approxkl_avg': 0.020806007087230682, 'policy/clipfrac_avg': 0.11438679695129395, 'loss/policy_avg': -0.04260225221514702, 'loss/value_avg': 0.23487147688865662, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0313341617584229, 'val/ratio': 0.9779582619667053, 'val/ratio_var': 0.000298831844702363, 'val/num_eos_tokens': 0, 'lr': 2.435080842724155e-05, 'episode': 4192, 'epoch': 0.51}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:34:31<1:05:39, 131kB/s]
 51%|█████▏    | 1049/2041 [1:31:12<1:25:17,  5.16s/it][A

{'eps': 0, 'objective/kl': 74.23390197753906, 'objective/entropy': 37.72761917114258, 'objective/non_score_reward': -3.7116951942443848, 'objective/rlhf_reward': -5.728768348693848, 'objective/scores': -2.017072916030884, 'policy/approxkl_avg': 0.07945631444454193, 'policy/clipfrac_avg': 0.09669811278581619, 'loss/policy_avg': -0.02685142122209072, 'loss/value_avg': 0.24739530682563782, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7570717334747314, 'val/ratio': 0.9835657477378845, 'val/ratio_var': 0.00012375552614685148, 'val/num_eos_tokens': 0, 'lr': 2.4326310632043116e-05, 'episode': 4196, 'epoch': 0.51}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:34:36<1:05:39, 131kB/s]
 51%|█████▏    | 1050/2041 [1:31:18<1:25:22,  5.17s/it][A

{'eps': 0, 'objective/kl': 82.05481719970703, 'objective/entropy': 94.58464813232422, 'objective/non_score_reward': -4.102741241455078, 'objective/rlhf_reward': -6.450536727905273, 'objective/scores': -2.3477957248687744, 'policy/approxkl_avg': 0.03781978785991669, 'policy/clipfrac_avg': 0.1379716992378235, 'loss/policy_avg': -0.04363861680030823, 'loss/value_avg': 0.337988018989563, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.5750408172607422, 'val/ratio': 0.97625333070755, 'val/ratio_var': 0.0002977411786559969, 'val/num_eos_tokens': 0, 'lr': 2.4301812836844684e-05, 'episode': 4200, 'epoch': 0.51}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:34:41<1:05:39, 131kB/s]
 51%|█████▏    | 1051/2041 [1:31:23<1:25:16,  5.17s/it][A

{'eps': 0, 'objective/kl': 84.36044311523438, 'objective/entropy': 91.29804992675781, 'objective/non_score_reward': -4.218021869659424, 'objective/rlhf_reward': -6.957785129547119, 'objective/scores': -2.7397632598876953, 'policy/approxkl_avg': 0.024072004482150078, 'policy/clipfrac_avg': 0.12028302997350693, 'loss/policy_avg': -0.04403363913297653, 'loss/value_avg': 0.4327457547187805, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.7236135005950928, 'val/ratio': 0.9881898164749146, 'val/ratio_var': 8.024423732422292e-05, 'val/num_eos_tokens': 0, 'lr': 2.4277315041646255e-05, 'episode': 4204, 'epoch': 0.52}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:34:46<1:05:39, 131kB/s]
 52%|█████▏    | 1052/2041 [1:31:28<1:25:13,  5.17s/it][A

{'eps': 0, 'objective/kl': 84.51471710205078, 'objective/entropy': 56.61202621459961, 'objective/non_score_reward': -4.225735664367676, 'objective/rlhf_reward': -6.35856294631958, 'objective/scores': -2.1328272819519043, 'policy/approxkl_avg': 0.019107570871710777, 'policy/clipfrac_avg': 0.10613207519054413, 'loss/policy_avg': -0.03586539626121521, 'loss/value_avg': 0.2765918970108032, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2413372993469238, 'val/ratio': 0.9844183921813965, 'val/ratio_var': 0.00015210884157568216, 'val/num_eos_tokens': 0, 'lr': 2.425281724644782e-05, 'episode': 4208, 'epoch': 0.52}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:34:52<1:05:39, 131kB/s]
 52%|█████▏    | 1053/2041 [1:31:33<1:25:16,  5.18s/it][A

{'eps': 0, 'objective/kl': 81.83921813964844, 'objective/entropy': 76.52684020996094, 'objective/non_score_reward': -4.091960906982422, 'objective/rlhf_reward': -6.421315670013428, 'objective/scores': -2.329354763031006, 'policy/approxkl_avg': 0.02135842852294445, 'policy/clipfrac_avg': 0.12735849618911743, 'loss/policy_avg': -0.044186338782310486, 'loss/value_avg': 0.35605543851852417, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.3615031242370605, 'val/ratio': 0.9890485405921936, 'val/ratio_var': 6.24860476818867e-05, 'val/num_eos_tokens': 0, 'lr': 2.422831945124939e-05, 'episode': 4212, 'epoch': 0.52}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:34:57<1:05:39, 131kB/s]
 52%|█████▏    | 1054/2041 [1:31:38<1:25:05,  5.17s/it][A

{'eps': 0, 'objective/kl': 75.42481231689453, 'objective/entropy': 79.46534729003906, 'objective/non_score_reward': -3.7712411880493164, 'objective/rlhf_reward': -6.1123366355896, 'objective/scores': -2.341095447540283, 'policy/approxkl_avg': 0.0409269705414772, 'policy/clipfrac_avg': 0.13443395495414734, 'loss/policy_avg': -0.04716833308339119, 'loss/value_avg': 0.31662312150001526, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.541012167930603, 'val/ratio': 0.9545004367828369, 'val/ratio_var': 0.001705759670585394, 'val/num_eos_tokens': 0, 'lr': 2.4203821656050956e-05, 'episode': 4216, 'epoch': 0.52}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:35:02<1:05:39, 131kB/s]
 52%|█████▏    | 1055/2041 [1:31:43<1:24:57,  5.17s/it][A

{'eps': 0, 'objective/kl': 61.78143310546875, 'objective/entropy': 27.01984405517578, 'objective/non_score_reward': -3.08907151222229, 'objective/rlhf_reward': -6.3244309425354, 'objective/scores': -3.2353594303131104, 'policy/approxkl_avg': 0.02775312401354313, 'policy/clipfrac_avg': 0.07193396240472794, 'loss/policy_avg': -0.02730325236916542, 'loss/value_avg': 0.2492782175540924, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6155558228492737, 'val/ratio': 0.9686524868011475, 'val/ratio_var': 0.0006780903786420822, 'val/num_eos_tokens': 0, 'lr': 2.4179323860852524e-05, 'episode': 4220, 'epoch': 0.52}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:35:07<1:05:39, 131kB/s]
 52%|█████▏    | 1056/2041 [1:31:49<1:24:54,  5.17s/it][A

{'eps': 0, 'objective/kl': 94.77434539794922, 'objective/entropy': 96.94739532470703, 'objective/non_score_reward': -4.738717079162598, 'objective/rlhf_reward': -6.559785842895508, 'objective/scores': -1.8210690021514893, 'policy/approxkl_avg': 0.036173537373542786, 'policy/clipfrac_avg': 0.18632075190544128, 'loss/policy_avg': -0.058768950402736664, 'loss/value_avg': 0.8118797540664673, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.7748528718948364, 'val/ratio': 0.9794690608978271, 'val/ratio_var': 0.0002251379773952067, 'val/num_eos_tokens': 0, 'lr': 2.4154826065654092e-05, 'episode': 4224, 'epoch': 0.52}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:35:12<1:05:39, 131kB/s]
 52%|█████▏    | 1057/2041 [1:31:54<1:24:41,  5.16s/it][A

{'eps': 0, 'objective/kl': 90.57669830322266, 'objective/entropy': 101.23637390136719, 'objective/non_score_reward': -4.528835296630859, 'objective/rlhf_reward': -6.917272567749023, 'objective/scores': -2.388437032699585, 'policy/approxkl_avg': 0.03313351795077324, 'policy/clipfrac_avg': 0.17806604504585266, 'loss/policy_avg': -0.05039084702730179, 'loss/value_avg': 0.5692691206932068, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.8022408485412598, 'val/ratio': 1.0092464685440063, 'val/ratio_var': 0.00025103616644628346, 'val/num_eos_tokens': 0, 'lr': 2.413032827045566e-05, 'episode': 4228, 'epoch': 0.52}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:35:17<1:05:39, 131kB/s]
 52%|█████▏    | 1058/2041 [1:31:59<1:24:51,  5.18s/it][A

{'eps': 0, 'objective/kl': 85.25943756103516, 'objective/entropy': 85.6870346069336, 'objective/non_score_reward': -4.262971878051758, 'objective/rlhf_reward': -6.837924480438232, 'objective/scores': -2.5749526023864746, 'policy/approxkl_avg': 0.04795911908149719, 'policy/clipfrac_avg': 0.17806604504585266, 'loss/policy_avg': -0.04683301970362663, 'loss/value_avg': 0.6678640842437744, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.4646083116531372, 'val/ratio': 0.9990507364273071, 'val/ratio_var': 3.803597064688802e-05, 'val/num_eos_tokens': 0, 'lr': 2.4105830475257228e-05, 'episode': 4232, 'epoch': 0.52}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:35:23<1:05:39, 131kB/s]
 52%|█████▏    | 1059/2041 [1:32:04<1:24:46,  5.18s/it][A

{'eps': 0, 'objective/kl': 68.39075469970703, 'objective/entropy': 50.926963806152344, 'objective/non_score_reward': -3.419537305831909, 'objective/rlhf_reward': -5.292218208312988, 'objective/scores': -1.8726811408996582, 'policy/approxkl_avg': 0.31801658868789673, 'policy/clipfrac_avg': 0.11438678950071335, 'loss/policy_avg': -0.04058656841516495, 'loss/value_avg': 0.3493242859840393, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.024308681488037, 'val/ratio': 0.969802975654602, 'val/ratio_var': 0.00046642482629977167, 'val/num_eos_tokens': 0, 'lr': 2.4081332680058796e-05, 'episode': 4236, 'epoch': 0.52}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:35:28<1:05:39, 131kB/s]
 52%|█████▏    | 1060/2041 [1:32:09<1:24:29,  5.17s/it][A

{'eps': 0, 'objective/kl': 86.19330596923828, 'objective/entropy': 71.53012084960938, 'objective/non_score_reward': -4.309665203094482, 'objective/rlhf_reward': -6.2864508628845215, 'objective/scores': -1.976785659790039, 'policy/approxkl_avg': 0.02880423702299595, 'policy/clipfrac_avg': 0.13561320304870605, 'loss/policy_avg': -0.041176654398441315, 'loss/value_avg': 0.43662726879119873, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.3732974529266357, 'val/ratio': 0.9810952544212341, 'val/ratio_var': 0.00019997982599306852, 'val/num_eos_tokens': 0, 'lr': 2.4056834884860364e-05, 'episode': 4240, 'epoch': 0.52}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:35:33<1:05:39, 131kB/s]
 52%|█████▏    | 1061/2041 [1:32:14<1:23:46,  5.13s/it][A

{'eps': 0, 'objective/kl': 64.19062805175781, 'objective/entropy': 92.16968536376953, 'objective/non_score_reward': -3.209531307220459, 'objective/rlhf_reward': -5.667621612548828, 'objective/scores': -2.458090305328369, 'policy/approxkl_avg': 0.06149541959166527, 'policy/clipfrac_avg': 0.1533018797636032, 'loss/policy_avg': -0.04541284590959549, 'loss/value_avg': 0.2634925842285156, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.5775508880615234, 'val/ratio': 1.1439659595489502, 'val/ratio_var': 0.030721411108970642, 'val/num_eos_tokens': 0, 'lr': 2.4032337089661932e-05, 'episode': 4244, 'epoch': 0.52}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:35:38<1:05:39, 131kB/s]
 52%|█████▏    | 1062/2041 [1:32:19<1:23:45,  5.13s/it][A

{'eps': 0, 'objective/kl': 62.094940185546875, 'objective/entropy': 26.44683837890625, 'objective/non_score_reward': -3.1047472953796387, 'objective/rlhf_reward': -5.575492858886719, 'objective/scores': -2.470745325088501, 'policy/approxkl_avg': 0.01172777358442545, 'policy/clipfrac_avg': 0.05896226316690445, 'loss/policy_avg': -0.02168862707912922, 'loss/value_avg': 0.17729336023330688, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6281195878982544, 'val/ratio': 0.9842109680175781, 'val/ratio_var': 0.00019252172205597162, 'val/num_eos_tokens': 0, 'lr': 2.40078392944635e-05, 'episode': 4248, 'epoch': 0.52}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:35:43<1:05:39, 131kB/s]
 52%|█████▏    | 1063/2041 [1:32:25<1:23:47,  5.14s/it][A

{'eps': 0, 'objective/kl': 83.55439758300781, 'objective/entropy': 82.25450897216797, 'objective/non_score_reward': -4.177720069885254, 'objective/rlhf_reward': -7.019042015075684, 'objective/scores': -2.8413219451904297, 'policy/approxkl_avg': 0.030763253569602966, 'policy/clipfrac_avg': 0.1662735790014267, 'loss/policy_avg': -0.049051787704229355, 'loss/value_avg': 0.5555107593536377, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.5888962745666504, 'val/ratio': 1.0091478824615479, 'val/ratio_var': 0.00017839022621046752, 'val/num_eos_tokens': 0, 'lr': 2.3983341499265065e-05, 'episode': 4252, 'epoch': 0.52}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:35:48<1:05:39, 131kB/s]
 52%|█████▏    | 1064/2041 [1:32:30<1:24:14,  5.17s/it][A

{'eps': 0, 'objective/kl': 71.05072021484375, 'objective/entropy': 62.54670333862305, 'objective/non_score_reward': -3.5525360107421875, 'objective/rlhf_reward': -6.381986618041992, 'objective/scores': -2.829450845718384, 'policy/approxkl_avg': 0.026771046221256256, 'policy/clipfrac_avg': 0.10141509026288986, 'loss/policy_avg': -0.03616788610816002, 'loss/value_avg': 0.24265363812446594, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.166811466217041, 'val/ratio': 0.9699904322624207, 'val/ratio_var': 0.0005891678738407791, 'val/num_eos_tokens': 0, 'lr': 2.3958843704066636e-05, 'episode': 4256, 'epoch': 0.52}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:35:54<1:05:39, 131kB/s]
 52%|█████▏    | 1065/2041 [1:32:35<1:24:10,  5.17s/it][A

{'eps': 0, 'objective/kl': 76.76538848876953, 'objective/entropy': 55.93437957763672, 'objective/non_score_reward': -3.8382697105407715, 'objective/rlhf_reward': -5.833080291748047, 'objective/scores': -1.994810700416565, 'policy/approxkl_avg': 0.019777212291955948, 'policy/clipfrac_avg': 0.11674528568983078, 'loss/policy_avg': -0.041327692568302155, 'loss/value_avg': 0.29058393836021423, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0784704685211182, 'val/ratio': 0.9815101623535156, 'val/ratio_var': 0.00021247709810268134, 'val/num_eos_tokens': 0, 'lr': 2.39343459088682e-05, 'episode': 4260, 'epoch': 0.52}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:35:59<1:05:39, 131kB/s]
 52%|█████▏    | 1066/2041 [1:32:40<1:24:20,  5.19s/it][A

{'eps': 0, 'objective/kl': 71.3861312866211, 'objective/entropy': 45.536659240722656, 'objective/non_score_reward': -3.5693068504333496, 'objective/rlhf_reward': -5.638872146606445, 'objective/scores': -2.0695650577545166, 'policy/approxkl_avg': 0.02079177089035511, 'policy/clipfrac_avg': 0.10495283454656601, 'loss/policy_avg': -0.03751083463430405, 'loss/value_avg': 0.29917168617248535, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9984374046325684, 'val/ratio': 0.9585934281349182, 'val/ratio_var': 0.0014703203924000263, 'val/num_eos_tokens': 0, 'lr': 2.3909848113669772e-05, 'episode': 4264, 'epoch': 0.52}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:36:04<1:05:39, 131kB/s]
 52%|█████▏    | 1067/2041 [1:32:45<1:24:01,  5.18s/it][A

{'eps': 0, 'objective/kl': 86.2308120727539, 'objective/entropy': 96.62527465820312, 'objective/non_score_reward': -4.311540603637695, 'objective/rlhf_reward': -6.744231224060059, 'objective/scores': -2.432690382003784, 'policy/approxkl_avg': 0.03137023001909256, 'policy/clipfrac_avg': 0.1591981202363968, 'loss/policy_avg': -0.043642885982990265, 'loss/value_avg': 0.3917842209339142, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.762202501296997, 'val/ratio': 1.0029616355895996, 'val/ratio_var': 0.00016366133058909327, 'val/num_eos_tokens': 0, 'lr': 2.388535031847134e-05, 'episode': 4268, 'epoch': 0.52}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:36:09<1:05:39, 131kB/s]
 52%|█████▏    | 1068/2041 [1:32:51<1:24:26,  5.21s/it][A

{'eps': 0, 'objective/kl': 91.1114501953125, 'objective/entropy': 144.76220703125, 'objective/non_score_reward': -4.555572509765625, 'objective/rlhf_reward': -6.281889915466309, 'objective/scores': -1.7263171672821045, 'policy/approxkl_avg': 0.088010273873806, 'policy/clipfrac_avg': 0.17099055647850037, 'loss/policy_avg': -0.04952432960271835, 'loss/value_avg': 0.34050047397613525, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 2.4800498485565186, 'val/ratio': 1.0060279369354248, 'val/ratio_var': 0.00011632390669547021, 'val/num_eos_tokens': 0, 'lr': 2.3860852523272905e-05, 'episode': 4272, 'epoch': 0.52}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:36:14<1:05:39, 131kB/s]
 52%|█████▏    | 1069/2041 [1:32:56<1:24:00,  5.19s/it][A

{'eps': 0, 'objective/kl': 85.04664611816406, 'objective/entropy': 76.94783020019531, 'objective/non_score_reward': -4.2523322105407715, 'objective/rlhf_reward': -6.4510111808776855, 'objective/scores': -2.198678970336914, 'policy/approxkl_avg': 0.03202187642455101, 'policy/clipfrac_avg': 0.1391509473323822, 'loss/policy_avg': -0.042319733649492264, 'loss/value_avg': 0.420833945274353, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.4816938638687134, 'val/ratio': 0.9739120006561279, 'val/ratio_var': 0.000400125136366114, 'val/num_eos_tokens': 0, 'lr': 2.3836354728074476e-05, 'episode': 4276, 'epoch': 0.52}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:36:20<1:05:39, 131kB/s]
 52%|█████▏    | 1070/2041 [1:33:01<1:24:15,  5.21s/it][A

{'eps': 0, 'objective/kl': 91.15025329589844, 'objective/entropy': 104.22750854492188, 'objective/non_score_reward': -4.557512283325195, 'objective/rlhf_reward': -6.6175408363342285, 'objective/scores': -2.060028553009033, 'policy/approxkl_avg': 0.025655604898929596, 'policy/clipfrac_avg': 0.1379716992378235, 'loss/policy_avg': -0.04666822403669357, 'loss/value_avg': 0.30326780676841736, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 2.024705410003662, 'val/ratio': 0.976836621761322, 'val/ratio_var': 0.00031298099202103913, 'val/num_eos_tokens': 0, 'lr': 2.381185693287604e-05, 'episode': 4280, 'epoch': 0.52}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:36:25<1:05:39, 131kB/s]
 52%|█████▏    | 1071/2041 [1:33:06<1:23:52,  5.19s/it][A

{'eps': 0, 'objective/kl': 55.90986251831055, 'objective/entropy': 18.896102905273438, 'objective/non_score_reward': -2.7954931259155273, 'objective/rlhf_reward': -5.454936981201172, 'objective/scores': -2.6594438552856445, 'policy/approxkl_avg': 0.010639294050633907, 'policy/clipfrac_avg': 0.03655660152435303, 'loss/policy_avg': -0.014881609007716179, 'loss/value_avg': 0.21982210874557495, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.3760577440261841, 'val/ratio': 0.9908305406570435, 'val/ratio_var': 5.093026629765518e-05, 'val/num_eos_tokens': 0, 'lr': 2.378735913767761e-05, 'episode': 4284, 'epoch': 0.52}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:36:30<1:05:39, 131kB/s]
 53%|█████▎    | 1072/2041 [1:33:11<1:23:30,  5.17s/it][A

{'eps': 0, 'objective/kl': 80.90352630615234, 'objective/entropy': 101.2801513671875, 'objective/non_score_reward': -4.0451765060424805, 'objective/rlhf_reward': -6.38433837890625, 'objective/scores': -2.3391621112823486, 'policy/approxkl_avg': 0.07529760897159576, 'policy/clipfrac_avg': 0.15566037595272064, 'loss/policy_avg': -0.04472939297556877, 'loss/value_avg': 0.331574410200119, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.7896168231964111, 'val/ratio': 0.958962082862854, 'val/ratio_var': 0.0009524056222289801, 'val/num_eos_tokens': 0, 'lr': 2.3762861342479177e-05, 'episode': 4288, 'epoch': 0.53}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:36:35<1:05:39, 131kB/s]
 53%|█████▎    | 1073/2041 [1:33:17<1:23:09,  5.15s/it][A

{'eps': 0, 'objective/kl': 96.5574951171875, 'objective/entropy': 62.013763427734375, 'objective/non_score_reward': -4.827875137329102, 'objective/rlhf_reward': -6.350481033325195, 'objective/scores': -1.5226057767868042, 'policy/approxkl_avg': 0.1339299976825714, 'policy/clipfrac_avg': 0.14033019542694092, 'loss/policy_avg': -0.04320815950632095, 'loss/value_avg': 0.35655462741851807, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.201245903968811, 'val/ratio': 0.9686603546142578, 'val/ratio_var': 0.0004862059431616217, 'val/num_eos_tokens': 0, 'lr': 2.3738363547280745e-05, 'episode': 4292, 'epoch': 0.53}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:36:40<1:05:39, 131kB/s]
 53%|█████▎    | 1074/2041 [1:33:22<1:23:19,  5.17s/it][A

{'eps': 0, 'objective/kl': 104.96253967285156, 'objective/entropy': 148.191650390625, 'objective/non_score_reward': -5.248126983642578, 'objective/rlhf_reward': -7.714399337768555, 'objective/scores': -2.4662725925445557, 'policy/approxkl_avg': 0.24528464674949646, 'policy/clipfrac_avg': 0.21933960914611816, 'loss/policy_avg': -0.06189227104187012, 'loss/value_avg': 0.5440875291824341, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 2.577848196029663, 'val/ratio': 1.0523982048034668, 'val/ratio_var': 0.004195273853838444, 'val/num_eos_tokens': 0, 'lr': 2.3713865752082313e-05, 'episode': 4296, 'epoch': 0.53}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:36:45<1:05:39, 131kB/s]
 53%|█████▎    | 1075/2041 [1:33:27<1:23:26,  5.18s/it][A

{'eps': 0, 'objective/kl': 106.55642700195312, 'objective/entropy': 90.2254638671875, 'objective/non_score_reward': -5.327821731567383, 'objective/rlhf_reward': -7.1542205810546875, 'objective/scores': -1.8263986110687256, 'policy/approxkl_avg': 0.038200490176677704, 'policy/clipfrac_avg': 0.16273584961891174, 'loss/policy_avg': -0.046231724321842194, 'loss/value_avg': 0.6256521940231323, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.7486884593963623, 'val/ratio': 0.9749575853347778, 'val/ratio_var': 0.00042652260162867606, 'val/num_eos_tokens': 0, 'lr': 2.368936795688388e-05, 'episode': 4300, 'epoch': 0.53}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:36:51<1:05:39, 131kB/s]
 53%|█████▎    | 1076/2041 [1:33:32<1:23:46,  5.21s/it][A

{'eps': 0, 'objective/kl': 72.88691711425781, 'objective/entropy': 62.59795379638672, 'objective/non_score_reward': -3.644345760345459, 'objective/rlhf_reward': -6.213776111602783, 'objective/scores': -2.569430351257324, 'policy/approxkl_avg': 0.018553178757429123, 'policy/clipfrac_avg': 0.09551886469125748, 'loss/policy_avg': -0.032026130706071854, 'loss/value_avg': 0.3735852539539337, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.296097755432129, 'val/ratio': 0.973263144493103, 'val/ratio_var': 0.0004767963255289942, 'val/num_eos_tokens': 0, 'lr': 2.366487016168545e-05, 'episode': 4304, 'epoch': 0.53}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:36:56<1:05:39, 131kB/s]
 53%|█████▎    | 1077/2041 [1:33:37<1:23:18,  5.19s/it][A

{'eps': 0, 'objective/kl': 82.59798431396484, 'objective/entropy': 82.50425720214844, 'objective/non_score_reward': -4.129899024963379, 'objective/rlhf_reward': -5.681808948516846, 'objective/scores': -1.5519099235534668, 'policy/approxkl_avg': 0.01489318534731865, 'policy/clipfrac_avg': 0.12853772938251495, 'loss/policy_avg': -0.0429898165166378, 'loss/value_avg': 0.5004119873046875, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.6184040307998657, 'val/ratio': 0.9819586277008057, 'val/ratio_var': 0.00023591110948473215, 'val/num_eos_tokens': 0, 'lr': 2.3640372366487017e-05, 'episode': 4308, 'epoch': 0.53}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:37:01<1:05:39, 131kB/s]
 53%|█████▎    | 1078/2041 [1:33:43<1:23:11,  5.18s/it][A

{'eps': 0, 'objective/kl': 77.7327880859375, 'objective/entropy': 90.43754577636719, 'objective/non_score_reward': -3.886639356613159, 'objective/rlhf_reward': -6.024770736694336, 'objective/scores': -2.1381313800811768, 'policy/approxkl_avg': 0.05094391852617264, 'policy/clipfrac_avg': 0.16155660152435303, 'loss/policy_avg': -0.047543592751026154, 'loss/value_avg': 0.332655131816864, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.701874017715454, 'val/ratio': 0.9797472953796387, 'val/ratio_var': 0.00021582217596005648, 'val/num_eos_tokens': 0, 'lr': 2.3615874571288585e-05, 'episode': 4312, 'epoch': 0.53}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:37:06<1:05:39, 131kB/s]
 53%|█████▎    | 1079/2041 [1:33:48<1:23:29,  5.21s/it][A

{'eps': 0, 'objective/kl': 83.25868225097656, 'objective/entropy': 73.65997314453125, 'objective/non_score_reward': -4.162934303283691, 'objective/rlhf_reward': -5.920654296875, 'objective/scores': -1.7577199935913086, 'policy/approxkl_avg': 0.03505740314722061, 'policy/clipfrac_avg': 0.12735849618911743, 'loss/policy_avg': -0.04058519005775452, 'loss/value_avg': 0.3092876672744751, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.5447860956192017, 'val/ratio': 0.9844321012496948, 'val/ratio_var': 0.000122120589367114, 'val/num_eos_tokens': 0, 'lr': 2.3591376776090154e-05, 'episode': 4316, 'epoch': 0.53}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:37:11<1:05:39, 131kB/s]
 53%|█████▎    | 1080/2041 [1:33:53<1:23:11,  5.19s/it][A

{'eps': 0, 'objective/kl': 77.8570327758789, 'objective/entropy': 69.071044921875, 'objective/non_score_reward': -3.8928515911102295, 'objective/rlhf_reward': -5.838881969451904, 'objective/scores': -1.9460303783416748, 'policy/approxkl_avg': 0.037989139556884766, 'policy/clipfrac_avg': 0.11438678950071335, 'loss/policy_avg': -0.03307979926466942, 'loss/value_avg': 0.3189425468444824, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.3216376304626465, 'val/ratio': 0.9722515344619751, 'val/ratio_var': 0.00045614290866069496, 'val/num_eos_tokens': 0, 'lr': 2.356687898089172e-05, 'episode': 4320, 'epoch': 0.53}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:37:17<1:05:39, 131kB/s]
 53%|█████▎    | 1081/2041 [1:33:58<1:23:04,  5.19s/it][A

{'eps': 0, 'objective/kl': 74.75534057617188, 'objective/entropy': 64.99253845214844, 'objective/non_score_reward': -3.737766981124878, 'objective/rlhf_reward': -5.650284767150879, 'objective/scores': -1.912517786026001, 'policy/approxkl_avg': 0.029871560633182526, 'policy/clipfrac_avg': 0.13561320304870605, 'loss/policy_avg': -0.04591258242726326, 'loss/value_avg': 0.3236848711967468, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.4030687808990479, 'val/ratio': 0.9763765335083008, 'val/ratio_var': 0.0004383774648886174, 'val/num_eos_tokens': 0, 'lr': 2.354238118569329e-05, 'episode': 4324, 'epoch': 0.53}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:37:22<1:05:39, 131kB/s]
 53%|█████▎    | 1082/2041 [1:34:03<1:22:48,  5.18s/it][A

{'eps': 0, 'objective/kl': 86.02638244628906, 'objective/entropy': 94.21773529052734, 'objective/non_score_reward': -4.301319122314453, 'objective/rlhf_reward': -6.675174236297607, 'objective/scores': -2.3738551139831543, 'policy/approxkl_avg': 0.025170570239424706, 'policy/clipfrac_avg': 0.12264150381088257, 'loss/policy_avg': -0.04279588162899017, 'loss/value_avg': 0.5220612287521362, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.725452184677124, 'val/ratio': 0.9531625509262085, 'val/ratio_var': 0.0015931882662698627, 'val/num_eos_tokens': 0, 'lr': 2.3517883390494858e-05, 'episode': 4328, 'epoch': 0.53}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:37:27<1:05:39, 131kB/s]
 53%|█████▎    | 1083/2041 [1:34:08<1:22:09,  5.15s/it][A

{'eps': 0, 'objective/kl': 72.64784240722656, 'objective/entropy': 101.02767944335938, 'objective/non_score_reward': -3.632392406463623, 'objective/rlhf_reward': -6.242411136627197, 'objective/scores': -2.610018730163574, 'policy/approxkl_avg': 0.03771108761429787, 'policy/clipfrac_avg': 0.16981132328510284, 'loss/policy_avg': -0.04799521714448929, 'loss/value_avg': 0.3857073485851288, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.879770278930664, 'val/ratio': 0.993453323841095, 'val/ratio_var': 2.657579352671746e-05, 'val/num_eos_tokens': 0, 'lr': 2.3493385595296426e-05, 'episode': 4332, 'epoch': 0.53}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:37:32<1:05:39, 131kB/s]
 53%|█████▎    | 1084/2041 [1:34:13<1:22:00,  5.14s/it][A

{'eps': 0, 'objective/kl': 60.55789566040039, 'objective/entropy': 66.5953140258789, 'objective/non_score_reward': -3.027894973754883, 'objective/rlhf_reward': -4.798130989074707, 'objective/scores': -1.7702360153198242, 'policy/approxkl_avg': 0.04071924090385437, 'policy/clipfrac_avg': 0.1391509473323822, 'loss/policy_avg': -0.04583175107836723, 'loss/value_avg': 0.26979589462280273, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.285158395767212, 'val/ratio': 1.058876633644104, 'val/ratio_var': 0.0027769103180617094, 'val/num_eos_tokens': 0, 'lr': 2.346888780009799e-05, 'episode': 4336, 'epoch': 0.53}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:37:37<1:05:39, 131kB/s]
 53%|█████▎    | 1085/2041 [1:34:19<1:22:19,  5.17s/it][A

{'eps': 0, 'objective/kl': 60.896629333496094, 'objective/entropy': 42.671302795410156, 'objective/non_score_reward': -3.0448317527770996, 'objective/rlhf_reward': -5.451972007751465, 'objective/scores': -2.407140016555786, 'policy/approxkl_avg': 0.01628243364393711, 'policy/clipfrac_avg': 0.09905660152435303, 'loss/policy_avg': -0.03538908436894417, 'loss/value_avg': 0.27003738284111023, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8496895432472229, 'val/ratio': 0.9850926399230957, 'val/ratio_var': 0.00014143514272291213, 'val/num_eos_tokens': 0, 'lr': 2.3444390004899562e-05, 'episode': 4340, 'epoch': 0.53}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:37:42<1:05:39, 131kB/s]
 53%|█████▎    | 1086/2041 [1:34:24<1:22:01,  5.15s/it][A

{'eps': 0, 'objective/kl': 66.96441650390625, 'objective/entropy': 70.83642578125, 'objective/non_score_reward': -3.3482210636138916, 'objective/rlhf_reward': -6.718331336975098, 'objective/scores': -3.370110511779785, 'policy/approxkl_avg': 0.029960718005895615, 'policy/clipfrac_avg': 0.11556603759527206, 'loss/policy_avg': -0.04131263494491577, 'loss/value_avg': 0.41424989700317383, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.355622410774231, 'val/ratio': 0.9821598529815674, 'val/ratio_var': 0.00018611596897244453, 'val/num_eos_tokens': 0, 'lr': 2.3419892209701126e-05, 'episode': 4344, 'epoch': 0.53}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:37:47<1:05:39, 131kB/s]
 53%|█████▎    | 1087/2041 [1:34:29<1:21:55,  5.15s/it][A

{'eps': 0, 'objective/kl': 69.07990264892578, 'objective/entropy': 79.88213348388672, 'objective/non_score_reward': -3.4539952278137207, 'objective/rlhf_reward': -5.213202476501465, 'objective/scores': -1.7592074871063232, 'policy/approxkl_avg': 0.10663404315710068, 'policy/clipfrac_avg': 0.14033019542694092, 'loss/policy_avg': -0.045230068266391754, 'loss/value_avg': 0.3461853265762329, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.5049161911010742, 'val/ratio': 0.9778552055358887, 'val/ratio_var': 0.00025145953986793756, 'val/num_eos_tokens': 0, 'lr': 2.3395394414502698e-05, 'episode': 4348, 'epoch': 0.53}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:37:53<1:05:39, 131kB/s]
 53%|█████▎    | 1088/2041 [1:34:34<1:22:02,  5.17s/it][A

{'eps': 0, 'objective/kl': 62.81346130371094, 'objective/entropy': 35.044925689697266, 'objective/non_score_reward': -3.1406729221343994, 'objective/rlhf_reward': -4.849259376525879, 'objective/scores': -1.7085866928100586, 'policy/approxkl_avg': 0.015893448144197464, 'policy/clipfrac_avg': 0.06839622557163239, 'loss/policy_avg': -0.03129080310463905, 'loss/value_avg': 0.3151260018348694, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8005666732788086, 'val/ratio': 0.984177827835083, 'val/ratio_var': 0.00013911713904235512, 'val/num_eos_tokens': 0, 'lr': 2.3370896619304263e-05, 'episode': 4352, 'epoch': 0.53}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:37:58<1:05:39, 131kB/s]
 53%|█████▎    | 1089/2041 [1:34:39<1:21:38,  5.15s/it][A

{'eps': 0, 'objective/kl': 65.83441162109375, 'objective/entropy': 63.4266242980957, 'objective/non_score_reward': -3.2917211055755615, 'objective/rlhf_reward': -4.630954265594482, 'objective/scores': -1.339233160018921, 'policy/approxkl_avg': 0.018903467804193497, 'policy/clipfrac_avg': 0.10023584961891174, 'loss/policy_avg': -0.035819411277770996, 'loss/value_avg': 0.24092644453048706, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0786330699920654, 'val/ratio': 0.9892323017120361, 'val/ratio_var': 6.259081419557333e-05, 'val/num_eos_tokens': 0, 'lr': 2.334639882410583e-05, 'episode': 4356, 'epoch': 0.53}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:38:03<1:05:39, 131kB/s]
 53%|█████▎    | 1090/2041 [1:34:44<1:21:49,  5.16s/it][A

{'eps': 0, 'objective/kl': 90.70060729980469, 'objective/entropy': 105.86756896972656, 'objective/non_score_reward': -4.535030364990234, 'objective/rlhf_reward': -6.479025363922119, 'objective/scores': -1.9439948797225952, 'policy/approxkl_avg': 0.07183337956666946, 'policy/clipfrac_avg': 0.19811321794986725, 'loss/policy_avg': -0.05552371218800545, 'loss/value_avg': 0.410997211933136, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.8317768573760986, 'val/ratio': 0.9862911701202393, 'val/ratio_var': 9.20772145036608e-05, 'val/num_eos_tokens': 0, 'lr': 2.33219010289074e-05, 'episode': 4360, 'epoch': 0.53}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:38:08<1:05:39, 131kB/s]
 53%|█████▎    | 1091/2041 [1:34:50<1:21:45,  5.16s/it][A

{'eps': 0, 'objective/kl': 74.70278930664062, 'objective/entropy': 54.98536682128906, 'objective/non_score_reward': -3.7351396083831787, 'objective/rlhf_reward': -5.656717300415039, 'objective/scores': -1.9215779304504395, 'policy/approxkl_avg': 0.04145178198814392, 'policy/clipfrac_avg': 0.10966981202363968, 'loss/policy_avg': -0.03380567580461502, 'loss/value_avg': 0.31575238704681396, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0729589462280273, 'val/ratio': 0.9777117371559143, 'val/ratio_var': 0.0003331834450364113, 'val/num_eos_tokens': 0, 'lr': 2.3297403233708967e-05, 'episode': 4364, 'epoch': 0.53}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:38:13<1:05:39, 131kB/s]
 54%|█████▎    | 1092/2041 [1:34:55<1:21:53,  5.18s/it][A

{'eps': 0, 'objective/kl': 52.34318542480469, 'objective/entropy': 32.85725784301758, 'objective/non_score_reward': -2.617159128189087, 'objective/rlhf_reward': -4.981255531311035, 'objective/scores': -2.3640966415405273, 'policy/approxkl_avg': 0.18457821011543274, 'policy/clipfrac_avg': 0.11202830076217651, 'loss/policy_avg': -0.038403358310461044, 'loss/value_avg': 0.2422022819519043, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8071831464767456, 'val/ratio': 0.9674760699272156, 'val/ratio_var': 0.0006195897003635764, 'val/num_eos_tokens': 0, 'lr': 2.3272905438510535e-05, 'episode': 4368, 'epoch': 0.54}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:38:18<1:05:39, 131kB/s]
 54%|█████▎    | 1093/2041 [1:35:00<1:21:32,  5.16s/it][A

{'eps': 0, 'objective/kl': 63.02457809448242, 'objective/entropy': 38.34362030029297, 'objective/non_score_reward': -3.151228904724121, 'objective/rlhf_reward': -5.489846706390381, 'objective/scores': -2.3386178016662598, 'policy/approxkl_avg': 0.029934657737612724, 'policy/clipfrac_avg': 0.08490566164255142, 'loss/policy_avg': -0.0352407768368721, 'loss/value_avg': 0.20980574190616608, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.71510249376297, 'val/ratio': 0.9844317436218262, 'val/ratio_var': 0.00011805738176917657, 'val/num_eos_tokens': 0, 'lr': 2.3248407643312103e-05, 'episode': 4372, 'epoch': 0.54}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:38:24<1:05:39, 131kB/s]
 54%|█████▎    | 1094/2041 [1:35:05<1:22:01,  5.20s/it][A

{'eps': 0, 'objective/kl': 65.84454345703125, 'objective/entropy': 61.320396423339844, 'objective/non_score_reward': -3.292227268218994, 'objective/rlhf_reward': -5.991988182067871, 'objective/scores': -2.699761152267456, 'policy/approxkl_avg': 0.06566262990236282, 'policy/clipfrac_avg': 0.13679245114326477, 'loss/policy_avg': -0.04652957245707512, 'loss/value_avg': 0.37412381172180176, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1231555938720703, 'val/ratio': 0.9903370141983032, 'val/ratio_var': 0.0001316240814048797, 'val/num_eos_tokens': 0, 'lr': 2.322390984811367e-05, 'episode': 4376, 'epoch': 0.54}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:38:29<1:05:39, 131kB/s]
 54%|█████▎    | 1095/2041 [1:35:10<1:21:45,  5.19s/it][A

{'eps': 0, 'objective/kl': 45.587013244628906, 'objective/entropy': 7.026614189147949, 'objective/non_score_reward': -2.279350757598877, 'objective/rlhf_reward': -4.068841934204102, 'objective/scores': -1.789491057395935, 'policy/approxkl_avg': 0.022546961903572083, 'policy/clipfrac_avg': 0.021226415410637856, 'loss/policy_avg': -0.01653249002993107, 'loss/value_avg': 0.13999854028224945, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.25423234701156616, 'val/ratio': 0.997204065322876, 'val/ratio_var': 5.290663466439582e-06, 'val/num_eos_tokens': 0, 'lr': 2.319941205291524e-05, 'episode': 4380, 'epoch': 0.54}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:38:34<1:05:39, 131kB/s]
 54%|█████▎    | 1096/2041 [1:35:16<1:21:18,  5.16s/it][A

{'eps': 0, 'objective/kl': 76.80506134033203, 'objective/entropy': 74.36775207519531, 'objective/non_score_reward': -3.8402533531188965, 'objective/rlhf_reward': -5.604285717010498, 'objective/scores': -1.7640324831008911, 'policy/approxkl_avg': 0.040264297276735306, 'policy/clipfrac_avg': 0.14150942862033844, 'loss/policy_avg': -0.042201194912195206, 'loss/value_avg': 0.4602265954017639, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.3061327934265137, 'val/ratio': 0.9581055641174316, 'val/ratio_var': 0.0011519064428284764, 'val/num_eos_tokens': 0, 'lr': 2.3174914257716807e-05, 'episode': 4384, 'epoch': 0.54}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:38:39<1:05:39, 131kB/s]
 54%|█████▎    | 1097/2041 [1:35:21<1:21:08,  5.16s/it][A

{'eps': 0, 'objective/kl': 75.5290756225586, 'objective/entropy': 60.54222869873047, 'objective/non_score_reward': -3.7764534950256348, 'objective/rlhf_reward': -5.719525337219238, 'objective/scores': -1.9430720806121826, 'policy/approxkl_avg': 0.13357305526733398, 'policy/clipfrac_avg': 0.1391509473323822, 'loss/policy_avg': -0.0435868464410305, 'loss/value_avg': 0.39054811000823975, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1344135999679565, 'val/ratio': 1.0037479400634766, 'val/ratio_var': 8.1819154729601e-05, 'val/num_eos_tokens': 0, 'lr': 2.3150416462518375e-05, 'episode': 4388, 'epoch': 0.54}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:38:44<1:05:39, 131kB/s]
 54%|█████▍    | 1098/2041 [1:35:26<1:21:15,  5.17s/it][A

{'eps': 0, 'objective/kl': 71.46312713623047, 'objective/entropy': 68.8196792602539, 'objective/non_score_reward': -3.5731563568115234, 'objective/rlhf_reward': -5.512243747711182, 'objective/scores': -1.9390872716903687, 'policy/approxkl_avg': 0.022312529385089874, 'policy/clipfrac_avg': 0.1320754587650299, 'loss/policy_avg': -0.04516870900988579, 'loss/value_avg': 0.2739359140396118, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2125835418701172, 'val/ratio': 0.9756274223327637, 'val/ratio_var': 0.0003351108462084085, 'val/num_eos_tokens': 0, 'lr': 2.3125918667319943e-05, 'episode': 4392, 'epoch': 0.54}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:38:49<1:05:39, 131kB/s]
 54%|█████▍    | 1099/2041 [1:35:31<1:21:19,  5.18s/it][A

{'eps': 0, 'objective/kl': 56.36383056640625, 'objective/entropy': 26.03729248046875, 'objective/non_score_reward': -2.8181915283203125, 'objective/rlhf_reward': -5.230959415435791, 'objective/scores': -2.4127678871154785, 'policy/approxkl_avg': 0.01946183852851391, 'policy/clipfrac_avg': 0.05896226689219475, 'loss/policy_avg': -0.022400636225938797, 'loss/value_avg': 0.18788400292396545, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.5211687088012695, 'val/ratio': 0.9829515218734741, 'val/ratio_var': 0.00018667825497686863, 'val/num_eos_tokens': 0, 'lr': 2.310142087212151e-05, 'episode': 4396, 'epoch': 0.54}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:38:55<1:05:39, 131kB/s]
 54%|█████▍    | 1100/2041 [1:35:36<1:21:31,  5.20s/it][A

{'eps': 0, 'objective/kl': 75.87126159667969, 'objective/entropy': 59.13303756713867, 'objective/non_score_reward': -3.7935633659362793, 'objective/rlhf_reward': -6.044830322265625, 'objective/scores': -2.2512667179107666, 'policy/approxkl_avg': 0.012183993123471737, 'policy/clipfrac_avg': 0.08726415038108826, 'loss/policy_avg': -0.035025276243686676, 'loss/value_avg': 0.49500954151153564, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1199815273284912, 'val/ratio': 0.9834101796150208, 'val/ratio_var': 0.00018958245345856994, 'val/num_eos_tokens': 0, 'lr': 2.307692307692308e-05, 'episode': 4400, 'epoch': 0.54}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:39:00<1:05:39, 131kB/s]
 54%|█████▍    | 1101/2041 [1:35:41<1:21:15,  5.19s/it][A

{'eps': 0, 'objective/kl': 75.25379180908203, 'objective/entropy': 51.837127685546875, 'objective/non_score_reward': -3.7626895904541016, 'objective/rlhf_reward': -6.477081775665283, 'objective/scores': -2.7143921852111816, 'policy/approxkl_avg': 0.05762114375829697, 'policy/clipfrac_avg': 0.15094339847564697, 'loss/policy_avg': -0.04215308651328087, 'loss/value_avg': 0.3807990252971649, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0155166387557983, 'val/ratio': 0.9765961170196533, 'val/ratio_var': 0.0002769275160972029, 'val/num_eos_tokens': 0, 'lr': 2.3052425281724647e-05, 'episode': 4404, 'epoch': 0.54}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:39:05<1:05:39, 131kB/s]
 54%|█████▍    | 1102/2041 [1:35:47<1:21:26,  5.20s/it][A

{'eps': 0, 'objective/kl': 78.7530517578125, 'objective/entropy': 67.15225982666016, 'objective/non_score_reward': -3.937653064727783, 'objective/rlhf_reward': -6.877985954284668, 'objective/scores': -2.9403328895568848, 'policy/approxkl_avg': 0.02534281648695469, 'policy/clipfrac_avg': 0.13443395495414734, 'loss/policy_avg': -0.04040016233921051, 'loss/value_avg': 0.4985482096672058, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2016806602478027, 'val/ratio': 0.9955330491065979, 'val/ratio_var': 2.4907240003813058e-05, 'val/num_eos_tokens': 0, 'lr': 2.3027927486526212e-05, 'episode': 4408, 'epoch': 0.54}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:39:10<1:05:39, 131kB/s]
 54%|█████▍    | 1103/2041 [1:35:52<1:20:42,  5.16s/it][A

{'eps': 0, 'objective/kl': 67.87065887451172, 'objective/entropy': 32.82426452636719, 'objective/non_score_reward': -3.393533229827881, 'objective/rlhf_reward': -5.150915622711182, 'objective/scores': -1.7573823928833008, 'policy/approxkl_avg': 0.025215324014425278, 'policy/clipfrac_avg': 0.07075472176074982, 'loss/policy_avg': -0.02516445517539978, 'loss/value_avg': 0.3376352787017822, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.738529622554779, 'val/ratio': 0.9839299917221069, 'val/ratio_var': 0.00015489461657125503, 'val/num_eos_tokens': 0, 'lr': 2.3003429691327783e-05, 'episode': 4412, 'epoch': 0.54}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:39:15<1:05:39, 131kB/s]
 54%|█████▍    | 1104/2041 [1:35:57<1:20:19,  5.14s/it][A

{'eps': 0, 'objective/kl': 82.4150390625, 'objective/entropy': 54.49560546875, 'objective/non_score_reward': -4.120752334594727, 'objective/rlhf_reward': -6.407401084899902, 'objective/scores': -2.286648750305176, 'policy/approxkl_avg': 0.016655687242746353, 'policy/clipfrac_avg': 0.10613208264112473, 'loss/policy_avg': -0.03695398196578026, 'loss/value_avg': 0.41561567783355713, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1546862125396729, 'val/ratio': 0.9758739471435547, 'val/ratio_var': 0.0003840189892798662, 'val/num_eos_tokens': 0, 'lr': 2.2978931896129348e-05, 'episode': 4416, 'epoch': 0.54}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:39:20<1:05:39, 131kB/s]
 54%|█████▍    | 1105/2041 [1:36:02<1:20:26,  5.16s/it][A

{'eps': 0, 'objective/kl': 65.01272583007812, 'objective/entropy': 36.0026969909668, 'objective/non_score_reward': -3.250636339187622, 'objective/rlhf_reward': -6.037096977233887, 'objective/scores': -2.7864608764648438, 'policy/approxkl_avg': 0.016198577359318733, 'policy/clipfrac_avg': 0.07429245859384537, 'loss/policy_avg': -0.030158789828419685, 'loss/value_avg': 0.19342733919620514, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6599485874176025, 'val/ratio': 0.9828436374664307, 'val/ratio_var': 0.00018712440214585513, 'val/num_eos_tokens': 0, 'lr': 2.2954434100930916e-05, 'episode': 4420, 'epoch': 0.54}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:39:26<1:05:39, 131kB/s]
 54%|█████▍    | 1106/2041 [1:36:07<1:20:06,  5.14s/it][A

{'eps': 0, 'objective/kl': 69.59903717041016, 'objective/entropy': 59.130855560302734, 'objective/non_score_reward': -3.479951858520508, 'objective/rlhf_reward': -5.364899158477783, 'objective/scores': -1.8849471807479858, 'policy/approxkl_avg': 0.0341135710477829, 'policy/clipfrac_avg': 0.10613207519054413, 'loss/policy_avg': -0.039666544646024704, 'loss/value_avg': 0.3954571485519409, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0060855150222778, 'val/ratio': 0.9692022204399109, 'val/ratio_var': 0.0006417679251171649, 'val/num_eos_tokens': 0, 'lr': 2.2929936305732484e-05, 'episode': 4424, 'epoch': 0.54}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:39:31<1:05:39, 131kB/s]
 54%|█████▍    | 1107/2041 [1:36:12<1:19:58,  5.14s/it][A

{'eps': 0, 'objective/kl': 42.970428466796875, 'objective/entropy': 12.79261589050293, 'objective/non_score_reward': -2.1485214233398438, 'objective/rlhf_reward': -4.252131938934326, 'objective/scores': -2.1036105155944824, 'policy/approxkl_avg': 0.005596729926764965, 'policy/clipfrac_avg': 0.02358490601181984, 'loss/policy_avg': -0.012007341720163822, 'loss/value_avg': 0.08603028953075409, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.2299756407737732, 'val/ratio': 1.005774736404419, 'val/ratio_var': 5.335258902050555e-05, 'val/num_eos_tokens': 0, 'lr': 2.2905438510534052e-05, 'episode': 4428, 'epoch': 0.54}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:39:36<1:05:39, 131kB/s]
 54%|█████▍    | 1108/2041 [1:36:17<1:20:12,  5.16s/it][A

{'eps': 0, 'objective/kl': 64.67813110351562, 'objective/entropy': 39.069847106933594, 'objective/non_score_reward': -3.2339067459106445, 'objective/rlhf_reward': -5.605569839477539, 'objective/scores': -2.3716630935668945, 'policy/approxkl_avg': 0.023919228464365005, 'policy/clipfrac_avg': 0.08136792480945587, 'loss/policy_avg': -0.03255656361579895, 'loss/value_avg': 0.19276124238967896, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7794808149337769, 'val/ratio': 0.9751260280609131, 'val/ratio_var': 0.00036330250441096723, 'val/num_eos_tokens': 0, 'lr': 2.2880940715335623e-05, 'episode': 4432, 'epoch': 0.54}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:39:41<1:05:39, 131kB/s]
 54%|█████▍    | 1109/2041 [1:36:23<1:20:36,  5.19s/it][A

{'eps': 0, 'objective/kl': 67.80445098876953, 'objective/entropy': 38.76662826538086, 'objective/non_score_reward': -3.3902225494384766, 'objective/rlhf_reward': -5.303065776824951, 'objective/scores': -1.9128432273864746, 'policy/approxkl_avg': 0.01725444383919239, 'policy/clipfrac_avg': 0.09433962404727936, 'loss/policy_avg': -0.037474535405635834, 'loss/value_avg': 0.31269824504852295, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.857384443283081, 'val/ratio': 0.965200662612915, 'val/ratio_var': 0.0008721500053070486, 'val/num_eos_tokens': 0, 'lr': 2.2856442920137188e-05, 'episode': 4436, 'epoch': 0.54}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:39:46<1:05:39, 131kB/s]
 54%|█████▍    | 1110/2041 [1:36:28<1:20:35,  5.19s/it][A

{'eps': 0, 'objective/kl': 58.5364990234375, 'objective/entropy': 18.147008895874023, 'objective/non_score_reward': -2.9268250465393066, 'objective/rlhf_reward': -5.030638217926025, 'objective/scores': -2.1038131713867188, 'policy/approxkl_avg': 0.006162671372294426, 'policy/clipfrac_avg': 0.04363207519054413, 'loss/policy_avg': -0.018227335065603256, 'loss/value_avg': 0.13474220037460327, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.45010918378829956, 'val/ratio': 0.9971139430999756, 'val/ratio_var': 4.100862497580238e-06, 'val/num_eos_tokens': 0, 'lr': 2.2831945124938756e-05, 'episode': 4440, 'epoch': 0.54}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:39:52<1:05:39, 131kB/s]
 54%|█████▍    | 1111/2041 [1:36:33<1:20:08,  5.17s/it][A

{'eps': 0, 'objective/kl': 57.782203674316406, 'objective/entropy': 27.891639709472656, 'objective/non_score_reward': -2.8891100883483887, 'objective/rlhf_reward': -5.0368733406066895, 'objective/scores': -2.147763252258301, 'policy/approxkl_avg': 0.01909741386771202, 'policy/clipfrac_avg': 0.05778301879763603, 'loss/policy_avg': -0.03278949856758118, 'loss/value_avg': 0.18322743475437164, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.5463035106658936, 'val/ratio': 0.9788360595703125, 'val/ratio_var': 0.0002685456129256636, 'val/num_eos_tokens': 0, 'lr': 2.2807447329740324e-05, 'episode': 4444, 'epoch': 0.54}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:39:57<1:05:39, 131kB/s]
 54%|█████▍    | 1112/2041 [1:36:38<1:20:03,  5.17s/it][A

{'eps': 0, 'objective/kl': 97.88362884521484, 'objective/entropy': 118.52481079101562, 'objective/non_score_reward': -4.894181728363037, 'objective/rlhf_reward': -7.567107200622559, 'objective/scores': -2.6729257106781006, 'policy/approxkl_avg': 0.18013381958007812, 'policy/clipfrac_avg': 0.18160377442836761, 'loss/policy_avg': -0.05222351849079132, 'loss/value_avg': 0.8678761720657349, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 2.073765277862549, 'val/ratio': 0.9653445482254028, 'val/ratio_var': 0.0006360707920975983, 'val/num_eos_tokens': 0, 'lr': 2.2782949534541892e-05, 'episode': 4448, 'epoch': 0.54}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:40:02<1:05:39, 131kB/s]
 55%|█████▍    | 1113/2041 [1:36:43<1:19:54,  5.17s/it][A

{'eps': 0, 'objective/kl': 58.36067581176758, 'objective/entropy': 67.6708984375, 'objective/non_score_reward': -2.9180335998535156, 'objective/rlhf_reward': -4.941636085510254, 'objective/scores': -2.023602247238159, 'policy/approxkl_avg': 0.038788091391325, 'policy/clipfrac_avg': 0.08254716545343399, 'loss/policy_avg': -0.03437333554029465, 'loss/value_avg': 0.16674304008483887, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.3005049228668213, 'val/ratio': 0.9677826166152954, 'val/ratio_var': 0.0008030328317545354, 'val/num_eos_tokens': 0, 'lr': 2.275845173934346e-05, 'episode': 4452, 'epoch': 0.55}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:40:07<1:05:39, 131kB/s]
 55%|█████▍    | 1114/2041 [1:36:49<1:20:00,  5.18s/it][A

{'eps': 0, 'objective/kl': 64.75425720214844, 'objective/entropy': 44.4586181640625, 'objective/non_score_reward': -3.237712860107422, 'objective/rlhf_reward': -5.81812858581543, 'objective/scores': -2.580415725708008, 'policy/approxkl_avg': 0.03006870485842228, 'policy/clipfrac_avg': 0.10495283454656601, 'loss/policy_avg': -0.037865594029426575, 'loss/value_avg': 0.2689454257488251, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8791276812553406, 'val/ratio': 0.98502516746521, 'val/ratio_var': 0.0001451642601750791, 'val/num_eos_tokens': 0, 'lr': 2.2733953944145028e-05, 'episode': 4456, 'epoch': 0.55}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:40:12<1:05:39, 131kB/s]
 55%|█████▍    | 1115/2041 [1:36:54<1:19:23,  5.14s/it][A

{'eps': 0, 'objective/kl': 61.96381759643555, 'objective/entropy': 89.64524841308594, 'objective/non_score_reward': -3.098191261291504, 'objective/rlhf_reward': -5.383604526519775, 'objective/scores': -2.2854132652282715, 'policy/approxkl_avg': 0.20999889075756073, 'policy/clipfrac_avg': 0.1320754736661911, 'loss/policy_avg': -0.04431650787591934, 'loss/value_avg': 0.5113729238510132, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.6104457378387451, 'val/ratio': 0.9524057507514954, 'val/ratio_var': 0.001371328136883676, 'val/num_eos_tokens': 0, 'lr': 2.2709456148946596e-05, 'episode': 4460, 'epoch': 0.55}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:40:17<1:05:39, 131kB/s]
 55%|█████▍    | 1116/2041 [1:36:59<1:19:38,  5.17s/it][A

{'eps': 0, 'objective/kl': 68.41670227050781, 'objective/entropy': 41.656551361083984, 'objective/non_score_reward': -3.420835256576538, 'objective/rlhf_reward': -5.0547404289245605, 'objective/scores': -1.633905053138733, 'policy/approxkl_avg': 0.023097213357686996, 'policy/clipfrac_avg': 0.08726415038108826, 'loss/policy_avg': -0.036861710250377655, 'loss/value_avg': 0.18193241953849792, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.828933596611023, 'val/ratio': 0.9796961545944214, 'val/ratio_var': 0.0002557017432991415, 'val/num_eos_tokens': 0, 'lr': 2.2684958353748164e-05, 'episode': 4464, 'epoch': 0.55}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:40:23<1:05:39, 131kB/s]
 55%|█████▍    | 1117/2041 [1:37:04<1:19:54,  5.19s/it][A

{'eps': 0, 'objective/kl': 56.27116394042969, 'objective/entropy': 23.18195343017578, 'objective/non_score_reward': -2.813558340072632, 'objective/rlhf_reward': -4.739972114562988, 'objective/scores': -1.9264137744903564, 'policy/approxkl_avg': 0.020444415509700775, 'policy/clipfrac_avg': 0.04716981202363968, 'loss/policy_avg': -0.030063726007938385, 'loss/value_avg': 0.24095214903354645, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.46423014998435974, 'val/ratio': 0.9785472750663757, 'val/ratio_var': 0.00027862426941283047, 'val/num_eos_tokens': 0, 'lr': 2.2660460558549732e-05, 'episode': 4468, 'epoch': 0.55}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:40:28<1:05:39, 131kB/s]
 55%|█████▍    | 1118/2041 [1:37:09<1:19:22,  5.16s/it][A

{'eps': 0, 'objective/kl': 64.48136138916016, 'objective/entropy': 59.09741973876953, 'objective/non_score_reward': -3.2240681648254395, 'objective/rlhf_reward': -4.424403190612793, 'objective/scores': -1.2003347873687744, 'policy/approxkl_avg': 0.020333213731646538, 'policy/clipfrac_avg': 0.11202830076217651, 'loss/policy_avg': -0.04397377744317055, 'loss/value_avg': 0.2459755688905716, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1754395961761475, 'val/ratio': 0.9866591691970825, 'val/ratio_var': 8.53023084346205e-05, 'val/num_eos_tokens': 0, 'lr': 2.2635962763351297e-05, 'episode': 4472, 'epoch': 0.55}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:40:33<1:05:39, 131kB/s]
 55%|█████▍    | 1119/2041 [1:37:14<1:18:51,  5.13s/it][A

{'eps': 0, 'objective/kl': 67.76054382324219, 'objective/entropy': 45.80141067504883, 'objective/non_score_reward': -3.3880274295806885, 'objective/rlhf_reward': -5.223574638366699, 'objective/scores': -1.8355472087860107, 'policy/approxkl_avg': 0.049815624952316284, 'policy/clipfrac_avg': 0.11438679695129395, 'loss/policy_avg': -0.04455142095685005, 'loss/value_avg': 0.29353266954421997, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9281585216522217, 'val/ratio': 0.9578638672828674, 'val/ratio_var': 0.0010795517591759562, 'val/num_eos_tokens': 0, 'lr': 2.261146496815287e-05, 'episode': 4476, 'epoch': 0.55}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:40:38<1:05:39, 131kB/s]
 55%|█████▍    | 1120/2041 [1:37:19<1:18:49,  5.13s/it][A

{'eps': 0, 'objective/kl': 61.00571823120117, 'objective/entropy': 31.79715919494629, 'objective/non_score_reward': -3.050286054611206, 'objective/rlhf_reward': -4.739037036895752, 'objective/scores': -1.688750982284546, 'policy/approxkl_avg': 0.006991637405008078, 'policy/clipfrac_avg': 0.053066037595272064, 'loss/policy_avg': -0.027773894369602203, 'loss/value_avg': 0.16075557470321655, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.492839515209198, 'val/ratio': 0.9852341413497925, 'val/ratio_var': 0.00017342495266348124, 'val/num_eos_tokens': 0, 'lr': 2.2586967172954433e-05, 'episode': 4480, 'epoch': 0.55}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:40:43<1:05:39, 131kB/s]
 55%|█████▍    | 1121/2041 [1:37:25<1:18:41,  5.13s/it][A

{'eps': 0, 'objective/kl': 55.818939208984375, 'objective/entropy': 9.93803882598877, 'objective/non_score_reward': -2.790947198867798, 'objective/rlhf_reward': -4.93656063079834, 'objective/scores': -2.145613670349121, 'policy/approxkl_avg': 0.003846833249554038, 'policy/clipfrac_avg': 0.025943396613001823, 'loss/policy_avg': -0.010610030964016914, 'loss/value_avg': 0.11634068936109543, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.21339048445224762, 'val/ratio': 0.9992998838424683, 'val/ratio_var': 2.1914227090746863e-06, 'val/num_eos_tokens': 0, 'lr': 2.2562469377756005e-05, 'episode': 4484, 'epoch': 0.55}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:40:48<1:05:39, 131kB/s]
 55%|█████▍    | 1122/2041 [1:37:30<1:18:41,  5.14s/it][A

{'eps': 0, 'objective/kl': 72.21670532226562, 'objective/entropy': 50.307979583740234, 'objective/non_score_reward': -3.610835313796997, 'objective/rlhf_reward': -5.480964660644531, 'objective/scores': -1.8701293468475342, 'policy/approxkl_avg': 0.04464787244796753, 'policy/clipfrac_avg': 0.09787735342979431, 'loss/policy_avg': -0.040337927639484406, 'loss/value_avg': 0.3132033944129944, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6854786276817322, 'val/ratio': 0.9651967287063599, 'val/ratio_var': 0.0007272620568983257, 'val/num_eos_tokens': 0, 'lr': 2.2537971582557573e-05, 'episode': 4488, 'epoch': 0.55}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:40:53<1:05:39, 131kB/s]
 55%|█████▌    | 1123/2041 [1:37:35<1:18:23,  5.12s/it][A

{'eps': 0, 'objective/kl': 59.45088577270508, 'objective/entropy': 21.094646453857422, 'objective/non_score_reward': -2.9725441932678223, 'objective/rlhf_reward': -5.151635646820068, 'objective/scores': -2.179091453552246, 'policy/approxkl_avg': 0.019771529361605644, 'policy/clipfrac_avg': 0.04716981202363968, 'loss/policy_avg': -0.025344394147396088, 'loss/value_avg': 0.26751694083213806, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.3721160590648651, 'val/ratio': 0.9861181974411011, 'val/ratio_var': 0.0001165322246379219, 'val/num_eos_tokens': 0, 'lr': 2.2513473787359137e-05, 'episode': 4492, 'epoch': 0.55}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:40:58<1:05:39, 131kB/s]
 55%|█████▌    | 1124/2041 [1:37:40<1:18:38,  5.15s/it][A

{'eps': 0, 'objective/kl': 59.85006332397461, 'objective/entropy': 20.782514572143555, 'objective/non_score_reward': -2.9925034046173096, 'objective/rlhf_reward': -4.963459014892578, 'objective/scores': -1.9709558486938477, 'policy/approxkl_avg': 0.050714388489723206, 'policy/clipfrac_avg': 0.06839622557163239, 'loss/policy_avg': -0.030835533514618874, 'loss/value_avg': 0.16303637623786926, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.466521680355072, 'val/ratio': 0.9679306745529175, 'val/ratio_var': 0.0006160769262351096, 'val/num_eos_tokens': 0, 'lr': 2.248897599216071e-05, 'episode': 4496, 'epoch': 0.55}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:41:04<1:05:39, 131kB/s]
 55%|█████▌    | 1125/2041 [1:37:45<1:18:33,  5.15s/it][A

{'eps': 0, 'objective/kl': 54.213565826416016, 'objective/entropy': 0.4541933536529541, 'objective/non_score_reward': -2.7106785774230957, 'objective/rlhf_reward': -4.352847099304199, 'objective/scores': -1.6421682834625244, 'policy/approxkl_avg': 1.1273282325419132e-05, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.0007270929054357111, 'loss/value_avg': 0.08322605490684509, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.05675182491540909, 'val/ratio': 0.9997724294662476, 'val/ratio_var': 1.0445597808939056e-07, 'val/num_eos_tokens': 0, 'lr': 2.2464478196962273e-05, 'episode': 4500, 'epoch': 0.55}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:41:09<1:05:39, 131kB/s]
 55%|█████▌    | 1126/2041 [1:37:50<1:18:48,  5.17s/it][A

{'eps': 0, 'objective/kl': 58.495201110839844, 'objective/entropy': 25.616077423095703, 'objective/non_score_reward': -2.924760341644287, 'objective/rlhf_reward': -4.137729644775391, 'objective/scores': -1.2129690647125244, 'policy/approxkl_avg': 0.008365316316485405, 'policy/clipfrac_avg': 0.049528300762176514, 'loss/policy_avg': -0.023223139345645905, 'loss/value_avg': 0.126789391040802, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.5243926048278809, 'val/ratio': 1.0017714500427246, 'val/ratio_var': 2.6389780032332055e-06, 'val/num_eos_tokens': 0, 'lr': 2.243998040176384e-05, 'episode': 4504, 'epoch': 0.55}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:41:14<1:05:39, 131kB/s]
 55%|█████▌    | 1127/2041 [1:37:55<1:18:16,  5.14s/it][A

{'eps': 0, 'objective/kl': 52.273780822753906, 'objective/entropy': 2.879572868347168, 'objective/non_score_reward': -2.6136889457702637, 'objective/rlhf_reward': -4.432436943054199, 'objective/scores': -1.8187477588653564, 'policy/approxkl_avg': 0.004853488877415657, 'policy/clipfrac_avg': 0.01179245300590992, 'loss/policy_avg': -0.007352905347943306, 'loss/value_avg': 0.08421307802200317, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.08850830793380737, 'val/ratio': 1.008231282234192, 'val/ratio_var': 7.789617666276172e-05, 'val/num_eos_tokens': 0, 'lr': 2.241548260656541e-05, 'episode': 4508, 'epoch': 0.55}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:41:19<1:05:39, 131kB/s]
 55%|█████▌    | 1128/2041 [1:38:01<1:18:08,  5.14s/it][A

{'eps': 0, 'objective/kl': 57.83977127075195, 'objective/entropy': 54.474369049072266, 'objective/non_score_reward': -2.891988515853882, 'objective/rlhf_reward': -4.6679840087890625, 'objective/scores': -1.7759954929351807, 'policy/approxkl_avg': 0.009170063771307468, 'policy/clipfrac_avg': 0.07075471431016922, 'loss/policy_avg': -0.03059772402048111, 'loss/value_avg': 0.1958892047405243, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.022355556488037, 'val/ratio': 0.9865577220916748, 'val/ratio_var': 0.00014992822252679616, 'val/num_eos_tokens': 0, 'lr': 2.2390984811366978e-05, 'episode': 4512, 'epoch': 0.55}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:41:24<1:05:39, 131kB/s]
 55%|█████▌    | 1129/2041 [1:38:06<1:17:47,  5.12s/it][A

{'eps': 0, 'objective/kl': 50.59064865112305, 'objective/entropy': 7.035040378570557, 'objective/non_score_reward': -2.5295324325561523, 'objective/rlhf_reward': -4.760457515716553, 'objective/scores': -2.2309250831604004, 'policy/approxkl_avg': 0.010704156942665577, 'policy/clipfrac_avg': 0.028301887214183807, 'loss/policy_avg': -0.01159963570535183, 'loss/value_avg': 0.085954450070858, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.21166229248046875, 'val/ratio': 0.9870058298110962, 'val/ratio_var': 0.00011982873547822237, 'val/num_eos_tokens': 0, 'lr': 2.2366487016168546e-05, 'episode': 4516, 'epoch': 0.55}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:41:29<1:05:39, 131kB/s]
 55%|█████▌    | 1130/2041 [1:38:11<1:18:03,  5.14s/it][A

{'eps': 0, 'objective/kl': 64.91938018798828, 'objective/entropy': 41.19995880126953, 'objective/non_score_reward': -3.245968818664551, 'objective/rlhf_reward': -4.832169055938721, 'objective/scores': -1.5862003564834595, 'policy/approxkl_avg': 0.009123630821704865, 'policy/clipfrac_avg': 0.07075472176074982, 'loss/policy_avg': -0.024558521807193756, 'loss/value_avg': 0.2261699140071869, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.741021990776062, 'val/ratio': 0.9941965341567993, 'val/ratio_var': 2.4578761440352537e-05, 'val/num_eos_tokens': 0, 'lr': 2.2341989220970114e-05, 'episode': 4520, 'epoch': 0.55}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:41:34<1:05:39, 131kB/s]
 55%|█████▌    | 1131/2041 [1:38:16<1:17:50,  5.13s/it][A

{'eps': 0, 'objective/kl': 77.14717102050781, 'objective/entropy': 52.25716781616211, 'objective/non_score_reward': -3.857358694076538, 'objective/rlhf_reward': -4.94431209564209, 'objective/scores': -1.0869534015655518, 'policy/approxkl_avg': 0.01626311056315899, 'policy/clipfrac_avg': 0.1037735864520073, 'loss/policy_avg': -0.034370262175798416, 'loss/value_avg': 0.35156071186065674, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.893591582775116, 'val/ratio': 0.9814327955245972, 'val/ratio_var': 0.00023415825853589922, 'val/num_eos_tokens': 0, 'lr': 2.231749142577168e-05, 'episode': 4524, 'epoch': 0.55}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:41:40<1:05:39, 131kB/s]
 55%|█████▌    | 1132/2041 [1:38:21<1:18:03,  5.15s/it][A

{'eps': 0, 'objective/kl': 59.6140251159668, 'objective/entropy': 15.082330703735352, 'objective/non_score_reward': -2.980701446533203, 'objective/rlhf_reward': -5.009102821350098, 'objective/scores': -2.0284013748168945, 'policy/approxkl_avg': 0.0410948321223259, 'policy/clipfrac_avg': 0.05424528196454048, 'loss/policy_avg': -0.02041192725300789, 'loss/value_avg': 0.1540805697441101, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.38442277908325195, 'val/ratio': 0.98203045129776, 'val/ratio_var': 0.00019056563905905932, 'val/num_eos_tokens': 0, 'lr': 2.229299363057325e-05, 'episode': 4528, 'epoch': 0.55}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:41:45<1:05:39, 131kB/s]
 56%|█████▌    | 1133/2041 [1:38:26<1:17:59,  5.15s/it][A

{'eps': 0, 'objective/kl': 68.1065673828125, 'objective/entropy': 28.108531951904297, 'objective/non_score_reward': -3.4053282737731934, 'objective/rlhf_reward': -5.0483293533325195, 'objective/scores': -1.6430009603500366, 'policy/approxkl_avg': 0.017027007415890694, 'policy/clipfrac_avg': 0.06132075563073158, 'loss/policy_avg': -0.028280962258577347, 'loss/value_avg': 0.2318364977836609, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.5293517112731934, 'val/ratio': 0.9750162959098816, 'val/ratio_var': 0.0004173172346781939, 'val/num_eos_tokens': 0, 'lr': 2.2268495835374818e-05, 'episode': 4532, 'epoch': 0.56}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:41:50<1:05:39, 131kB/s]
 56%|█████▌    | 1134/2041 [1:38:31<1:17:57,  5.16s/it][A

{'eps': 0, 'objective/kl': 61.043174743652344, 'objective/entropy': 22.48143196105957, 'objective/non_score_reward': -3.052158832550049, 'objective/rlhf_reward': -5.354108810424805, 'objective/scores': -2.301950216293335, 'policy/approxkl_avg': 0.009597750380635262, 'policy/clipfrac_avg': 0.041273586452007294, 'loss/policy_avg': -0.01901593618094921, 'loss/value_avg': 0.18770051002502441, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.39883050322532654, 'val/ratio': 0.9866260886192322, 'val/ratio_var': 0.0001412164856446907, 'val/num_eos_tokens': 0, 'lr': 2.2243998040176386e-05, 'episode': 4536, 'epoch': 0.56}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:41:55<1:05:39, 131kB/s]
 56%|█████▌    | 1135/2041 [1:38:37<1:17:43,  5.15s/it][A

{'eps': 0, 'objective/kl': 59.59461212158203, 'objective/entropy': 24.671995162963867, 'objective/non_score_reward': -2.9797306060791016, 'objective/rlhf_reward': -5.150010585784912, 'objective/scores': -2.1702799797058105, 'policy/approxkl_avg': 0.013180457055568695, 'policy/clipfrac_avg': 0.04363207519054413, 'loss/policy_avg': -0.02182936668395996, 'loss/value_avg': 0.14887741208076477, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.3436795771121979, 'val/ratio': 0.9859291315078735, 'val/ratio_var': 0.00012554455315694213, 'val/num_eos_tokens': 0, 'lr': 2.2219500244977954e-05, 'episode': 4540, 'epoch': 0.56}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:42:00<1:05:39, 131kB/s]
 56%|█████▌    | 1136/2041 [1:38:42<1:17:30,  5.14s/it][A

{'eps': 0, 'objective/kl': 76.37921142578125, 'objective/entropy': 25.230714797973633, 'objective/non_score_reward': -3.8189609050750732, 'objective/rlhf_reward': -5.08830451965332, 'objective/scores': -1.269343376159668, 'policy/approxkl_avg': 0.01632154919207096, 'policy/clipfrac_avg': 0.06367924809455872, 'loss/policy_avg': -0.025522850453853607, 'loss/value_avg': 0.2639700770378113, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.5034797191619873, 'val/ratio': 0.9889593124389648, 'val/ratio_var': 7.629935862496495e-05, 'val/num_eos_tokens': 0, 'lr': 2.219500244977952e-05, 'episode': 4544, 'epoch': 0.56}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:42:05<1:05:39, 131kB/s]
 56%|█████▌    | 1137/2041 [1:38:47<1:17:46,  5.16s/it][A

{'eps': 0, 'objective/kl': 71.753662109375, 'objective/entropy': 26.284639358520508, 'objective/non_score_reward': -3.5876832008361816, 'objective/rlhf_reward': -4.698770523071289, 'objective/scores': -1.1110875606536865, 'policy/approxkl_avg': 0.012502573430538177, 'policy/clipfrac_avg': 0.06014150753617287, 'loss/policy_avg': -0.01886764168739319, 'loss/value_avg': 0.12959977984428406, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.4906753897666931, 'val/ratio': 0.9795456528663635, 'val/ratio_var': 0.0002641109167598188, 'val/num_eos_tokens': 0, 'lr': 2.217050465458109e-05, 'episode': 4548, 'epoch': 0.56}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:42:10<1:05:39, 131kB/s]
 56%|█████▌    | 1138/2041 [1:38:52<1:17:17,  5.14s/it][A

{'eps': 0, 'objective/kl': 71.95314025878906, 'objective/entropy': 25.56500244140625, 'objective/non_score_reward': -3.5976572036743164, 'objective/rlhf_reward': -5.757752895355225, 'objective/scores': -2.160095691680908, 'policy/approxkl_avg': 0.009716849774122238, 'policy/clipfrac_avg': 0.05896226689219475, 'loss/policy_avg': -0.019313141703605652, 'loss/value_avg': 0.17880159616470337, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.4803285300731659, 'val/ratio': 0.9936659932136536, 'val/ratio_var': 2.2552450900548138e-05, 'val/num_eos_tokens': 0, 'lr': 2.2146006859382658e-05, 'episode': 4552, 'epoch': 0.56}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:42:16<1:05:39, 131kB/s]
 56%|█████▌    | 1139/2041 [1:38:57<1:17:32,  5.16s/it][A

{'eps': 0, 'objective/kl': 79.39674377441406, 'objective/entropy': 41.494258880615234, 'objective/non_score_reward': -3.9698376655578613, 'objective/rlhf_reward': -6.187068939208984, 'objective/scores': -2.217231035232544, 'policy/approxkl_avg': 0.02173413336277008, 'policy/clipfrac_avg': 0.09787736088037491, 'loss/policy_avg': -0.03052576817572117, 'loss/value_avg': 0.46306514739990234, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7627524137496948, 'val/ratio': 0.9593082070350647, 'val/ratio_var': 0.0011067179730162024, 'val/num_eos_tokens': 0, 'lr': 2.2121509064184223e-05, 'episode': 4556, 'epoch': 0.56}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:42:21<1:05:39, 131kB/s]
 56%|█████▌    | 1140/2041 [1:39:02<1:17:50,  5.18s/it][A

{'eps': 0, 'objective/kl': 59.46570587158203, 'objective/entropy': 7.848850250244141, 'objective/non_score_reward': -2.97328519821167, 'objective/rlhf_reward': -5.066498756408691, 'objective/scores': -2.0932133197784424, 'policy/approxkl_avg': 0.010266651399433613, 'policy/clipfrac_avg': 0.025943398475646973, 'loss/policy_avg': -0.009632411412894726, 'loss/value_avg': 0.10059671103954315, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.19026882946491241, 'val/ratio': 0.9906213283538818, 'val/ratio_var': 5.270852489047684e-05, 'val/num_eos_tokens': 0, 'lr': 2.2097011268985794e-05, 'episode': 4560, 'epoch': 0.56}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:42:26<1:05:39, 131kB/s]
 56%|█████▌    | 1141/2041 [1:39:08<1:17:41,  5.18s/it][A

{'eps': 0, 'objective/kl': 59.13003921508789, 'objective/entropy': 6.576756954193115, 'objective/non_score_reward': -2.9565019607543945, 'objective/rlhf_reward': -4.516102313995361, 'objective/scores': -1.5596004724502563, 'policy/approxkl_avg': 0.0021358204539865255, 'policy/clipfrac_avg': 0.015330188907682896, 'loss/policy_avg': -0.006312653422355652, 'loss/value_avg': 0.08102399855852127, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.16041474044322968, 'val/ratio': 0.9949045181274414, 'val/ratio_var': 1.886445352283772e-05, 'val/num_eos_tokens': 0, 'lr': 2.207251347378736e-05, 'episode': 4564, 'epoch': 0.56}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:42:31<1:05:39, 131kB/s]
 56%|█████▌    | 1142/2041 [1:39:13<1:17:29,  5.17s/it][A

{'eps': 0, 'objective/kl': 64.489501953125, 'objective/entropy': 29.277366638183594, 'objective/non_score_reward': -3.224475145339966, 'objective/rlhf_reward': -4.877220153808594, 'objective/scores': -1.652745008468628, 'policy/approxkl_avg': 0.010932830162346363, 'policy/clipfrac_avg': 0.0554245263338089, 'loss/policy_avg': -0.02101065218448639, 'loss/value_avg': 0.12306518107652664, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.4656817317008972, 'val/ratio': 0.9737920761108398, 'val/ratio_var': 0.00050061458023265, 'val/num_eos_tokens': 0, 'lr': 2.204801567858893e-05, 'episode': 4568, 'epoch': 0.56}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:42:36<1:05:39, 131kB/s]
 56%|█████▌    | 1143/2041 [1:39:18<1:17:31,  5.18s/it][A

{'eps': 0, 'objective/kl': 55.50004577636719, 'objective/entropy': 1.4119093418121338, 'objective/non_score_reward': -2.7750020027160645, 'objective/rlhf_reward': -4.505325794219971, 'objective/scores': -1.7303237915039062, 'policy/approxkl_avg': 0.003097428474575281, 'policy/clipfrac_avg': 0.016509434208273888, 'loss/policy_avg': -0.009750447236001492, 'loss/value_avg': 0.048284780234098434, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.06679850071668625, 'val/ratio': 0.992492139339447, 'val/ratio_var': 5.143196904100478e-05, 'val/num_eos_tokens': 0, 'lr': 2.2023517883390495e-05, 'episode': 4572, 'epoch': 0.56}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:42:42<1:05:39, 131kB/s]
 56%|█████▌    | 1144/2041 [1:39:23<1:17:32,  5.19s/it][A

{'eps': 0, 'objective/kl': 60.36110305786133, 'objective/entropy': 5.492273330688477, 'objective/non_score_reward': -3.0180552005767822, 'objective/rlhf_reward': -4.725367546081543, 'objective/scores': -1.7073122262954712, 'policy/approxkl_avg': 0.0060027362778782845, 'policy/clipfrac_avg': 0.01886792480945587, 'loss/policy_avg': -0.008664443157613277, 'loss/value_avg': 0.03536148741841316, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.136807382106781, 'val/ratio': 0.9884946346282959, 'val/ratio_var': 0.00011220847954973578, 'val/num_eos_tokens': 0, 'lr': 2.1999020088192063e-05, 'episode': 4576, 'epoch': 0.56}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:42:47<1:05:39, 131kB/s]
 56%|█████▌    | 1145/2041 [1:39:28<1:17:07,  5.16s/it][A

{'eps': 0, 'objective/kl': 55.26152801513672, 'objective/entropy': 5.609065532684326, 'objective/non_score_reward': -2.7630765438079834, 'objective/rlhf_reward': -4.614536762237549, 'objective/scores': -1.851460337638855, 'policy/approxkl_avg': 0.009285527281463146, 'policy/clipfrac_avg': 0.02358490601181984, 'loss/policy_avg': -0.009994312189519405, 'loss/value_avg': 0.11305586993694305, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.13589006662368774, 'val/ratio': 1.0017647743225098, 'val/ratio_var': 4.543034719972638e-06, 'val/num_eos_tokens': 0, 'lr': 2.197452229299363e-05, 'episode': 4580, 'epoch': 0.56}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:42:52<1:05:39, 131kB/s]
 56%|█████▌    | 1146/2041 [1:39:33<1:16:47,  5.15s/it][A

{'eps': 0, 'objective/kl': 62.75345230102539, 'objective/entropy': 13.67894458770752, 'objective/non_score_reward': -3.1376726627349854, 'objective/rlhf_reward': -4.993368625640869, 'objective/scores': -1.8556959629058838, 'policy/approxkl_avg': 0.007515447214245796, 'policy/clipfrac_avg': 0.03183962404727936, 'loss/policy_avg': -0.009449576027691364, 'loss/value_avg': 0.1362709105014801, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.22734491527080536, 'val/ratio': 1.0057947635650635, 'val/ratio_var': 2.9311186153790914e-05, 'val/num_eos_tokens': 0, 'lr': 2.19500244977952e-05, 'episode': 4584, 'epoch': 0.56}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:42:57<1:05:39, 131kB/s]
 56%|█████▌    | 1147/2041 [1:39:38<1:16:27,  5.13s/it][A

{'eps': 0, 'objective/kl': 61.45869445800781, 'objective/entropy': 12.883647918701172, 'objective/non_score_reward': -3.072934627532959, 'objective/rlhf_reward': -4.946172714233398, 'objective/scores': -1.8732380867004395, 'policy/approxkl_avg': 0.018673986196517944, 'policy/clipfrac_avg': 0.05188679322600365, 'loss/policy_avg': -0.013705609366297722, 'loss/value_avg': 0.09158846735954285, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.27833038568496704, 'val/ratio': 1.0081021785736084, 'val/ratio_var': 0.0001967137650353834, 'val/num_eos_tokens': 0, 'lr': 2.192552670259677e-05, 'episode': 4588, 'epoch': 0.56}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:43:02<1:05:39, 131kB/s]
 56%|█████▌    | 1148/2041 [1:39:44<1:16:23,  5.13s/it][A

{'eps': 0, 'objective/kl': 55.79853820800781, 'objective/entropy': 2.6813011169433594, 'objective/non_score_reward': -2.7899270057678223, 'objective/rlhf_reward': -4.712438583374023, 'objective/scores': -1.9225118160247803, 'policy/approxkl_avg': 0.001407567411661148, 'policy/clipfrac_avg': 0.007075471803545952, 'loss/policy_avg': -0.0007995502091944218, 'loss/value_avg': 0.049933284521102905, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.08603212237358093, 'val/ratio': 0.9967004656791687, 'val/ratio_var': 7.169815489760367e-06, 'val/num_eos_tokens': 0, 'lr': 2.1901028907398335e-05, 'episode': 4592, 'epoch': 0.56}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:43:07<1:05:39, 131kB/s]
 56%|█████▋    | 1149/2041 [1:39:49<1:16:28,  5.14s/it][A

{'eps': 0, 'objective/kl': 55.68109893798828, 'objective/entropy': 6.38167667388916, 'objective/non_score_reward': -2.78405499458313, 'objective/rlhf_reward': -4.361177444458008, 'objective/scores': -1.5771223306655884, 'policy/approxkl_avg': 0.0037752219941467047, 'policy/clipfrac_avg': 0.024764150381088257, 'loss/policy_avg': -0.014181448146700859, 'loss/value_avg': 0.07479926943778992, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.13928452134132385, 'val/ratio': 0.9972096681594849, 'val/ratio_var': 7.085138804541202e-06, 'val/num_eos_tokens': 0, 'lr': 2.1876531112199903e-05, 'episode': 4596, 'epoch': 0.56}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:43:12<1:05:39, 131kB/s]
 56%|█████▋    | 1150/2041 [1:39:54<1:16:09,  5.13s/it][A

{'eps': 0, 'objective/kl': 51.125022888183594, 'objective/entropy': 16.766630172729492, 'objective/non_score_reward': -2.556251287460327, 'objective/rlhf_reward': -4.969230651855469, 'objective/scores': -2.4129796028137207, 'policy/approxkl_avg': 0.026021141558885574, 'policy/clipfrac_avg': 0.06367924809455872, 'loss/policy_avg': -0.022468343377113342, 'loss/value_avg': 0.16766905784606934, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.29006415605545044, 'val/ratio': 0.977092981338501, 'val/ratio_var': 0.00033826552680693567, 'val/num_eos_tokens': 0, 'lr': 2.185203331700147e-05, 'episode': 4600, 'epoch': 0.56}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:43:18<1:05:39, 131kB/s]
 56%|█████▋    | 1151/2041 [1:39:59<1:16:30,  5.16s/it][A

{'eps': 0, 'objective/kl': 60.639564514160156, 'objective/entropy': 4.357626914978027, 'objective/non_score_reward': -3.031978130340576, 'objective/rlhf_reward': -5.279726982116699, 'objective/scores': -2.247748851776123, 'policy/approxkl_avg': 0.0010002459166571498, 'policy/clipfrac_avg': 0.007075471803545952, 'loss/policy_avg': -0.0013587289722636342, 'loss/value_avg': 0.05798064544796944, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.13882219791412354, 'val/ratio': 0.9990186095237732, 'val/ratio_var': 1.2641918374356464e-06, 'val/num_eos_tokens': 0, 'lr': 2.182753552180304e-05, 'episode': 4604, 'epoch': 0.56}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:43:23<1:05:39, 131kB/s]
 56%|█████▋    | 1152/2041 [1:40:04<1:16:31,  5.16s/it][A

{'eps': 0, 'objective/kl': 64.01295471191406, 'objective/entropy': 9.449106216430664, 'objective/non_score_reward': -3.2006478309631348, 'objective/rlhf_reward': -5.342601299285889, 'objective/scores': -2.141953468322754, 'policy/approxkl_avg': 0.004785760305821896, 'policy/clipfrac_avg': 0.01886792480945587, 'loss/policy_avg': -0.008219791576266289, 'loss/value_avg': 0.10954327881336212, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1998094916343689, 'val/ratio': 0.9899724125862122, 'val/ratio_var': 7.889088738011196e-05, 'val/num_eos_tokens': 0, 'lr': 2.1803037726604604e-05, 'episode': 4608, 'epoch': 0.56}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:43:28<1:05:39, 131kB/s]
 56%|█████▋    | 1153/2041 [1:40:09<1:15:46,  5.12s/it][A

{'eps': 0, 'objective/kl': 57.60772705078125, 'objective/entropy': 8.345726013183594, 'objective/non_score_reward': -2.8803863525390625, 'objective/rlhf_reward': -5.068120956420898, 'objective/scores': -2.187734365463257, 'policy/approxkl_avg': 0.002192920306697488, 'policy/clipfrac_avg': 0.01886792480945587, 'loss/policy_avg': -0.014561343938112259, 'loss/value_avg': 0.06327098608016968, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.2296057492494583, 'val/ratio': 0.9849850535392761, 'val/ratio_var': 0.00024371228937525302, 'val/num_eos_tokens': 0, 'lr': 2.1778539931406175e-05, 'episode': 4612, 'epoch': 0.56}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:43:33<1:05:39, 131kB/s]
 57%|█████▋    | 1154/2041 [1:40:15<1:16:21,  5.17s/it][A

{'eps': 0, 'objective/kl': 70.99979400634766, 'objective/entropy': 31.199901580810547, 'objective/non_score_reward': -3.549989700317383, 'objective/rlhf_reward': -5.470366954803467, 'objective/scores': -1.920377254486084, 'policy/approxkl_avg': 0.018755897879600525, 'policy/clipfrac_avg': 0.08844339847564697, 'loss/policy_avg': -0.022744957357645035, 'loss/value_avg': 0.17262116074562073, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.5124534368515015, 'val/ratio': 0.9719420075416565, 'val/ratio_var': 0.0005715371808037162, 'val/num_eos_tokens': 0, 'lr': 2.1754042136207743e-05, 'episode': 4616, 'epoch': 0.57}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:43:38<1:05:39, 131kB/s]
 57%|█████▋    | 1155/2041 [1:40:20<1:16:26,  5.18s/it][A

{'eps': 0, 'objective/kl': 86.581787109375, 'objective/entropy': 33.78017807006836, 'objective/non_score_reward': -4.329089164733887, 'objective/rlhf_reward': -6.155072212219238, 'objective/scores': -1.8259832859039307, 'policy/approxkl_avg': 0.026614755392074585, 'policy/clipfrac_avg': 0.09080188721418381, 'loss/policy_avg': -0.027710694819688797, 'loss/value_avg': 0.36897599697113037, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6456189751625061, 'val/ratio': 0.9827104806900024, 'val/ratio_var': 0.00021455147361848503, 'val/num_eos_tokens': 0, 'lr': 2.172954434100931e-05, 'episode': 4620, 'epoch': 0.57}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:43:43<1:05:39, 131kB/s]
 57%|█████▋    | 1156/2041 [1:40:25<1:16:31,  5.19s/it][A

{'eps': 0, 'objective/kl': 57.64788818359375, 'objective/entropy': 13.978721618652344, 'objective/non_score_reward': -2.882394313812256, 'objective/rlhf_reward': -4.918207168579102, 'objective/scores': -2.0358126163482666, 'policy/approxkl_avg': 0.012598685920238495, 'policy/clipfrac_avg': 0.03183962404727936, 'loss/policy_avg': -0.012618133798241615, 'loss/value_avg': 0.108701191842556, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.18158377707004547, 'val/ratio': 0.9907991886138916, 'val/ratio_var': 4.5929697080282494e-05, 'val/num_eos_tokens': 0, 'lr': 2.170504654581088e-05, 'episode': 4624, 'epoch': 0.57}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:43:49<1:05:39, 131kB/s]
 57%|█████▋    | 1157/2041 [1:40:30<1:16:32,  5.19s/it][A

{'eps': 0, 'objective/kl': 58.29787063598633, 'objective/entropy': 6.458399772644043, 'objective/non_score_reward': -2.914893627166748, 'objective/rlhf_reward': -4.5377068519592285, 'objective/scores': -1.62281334400177, 'policy/approxkl_avg': 0.0040511623956263065, 'policy/clipfrac_avg': 0.024764150381088257, 'loss/policy_avg': -0.008269899524748325, 'loss/value_avg': 0.10678596794605255, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.14284195005893707, 'val/ratio': 0.9910060167312622, 'val/ratio_var': 6.576123269042e-05, 'val/num_eos_tokens': 0, 'lr': 2.1680548750612444e-05, 'episode': 4628, 'epoch': 0.57}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:43:54<1:05:39, 131kB/s]
 57%|█████▋    | 1158/2041 [1:40:35<1:16:27,  5.20s/it][A

{'eps': 0, 'objective/kl': 55.771766662597656, 'objective/entropy': 6.758991241455078, 'objective/non_score_reward': -2.788588523864746, 'objective/rlhf_reward': -4.592424392700195, 'objective/scores': -1.8038356304168701, 'policy/approxkl_avg': 0.003934110514819622, 'policy/clipfrac_avg': 0.02712264284491539, 'loss/policy_avg': -0.009123641066253185, 'loss/value_avg': 0.09085863828659058, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.11738845705986023, 'val/ratio': 0.9951196312904358, 'val/ratio_var': 1.6000924006220885e-05, 'val/num_eos_tokens': 0, 'lr': 2.1656050955414015e-05, 'episode': 4632, 'epoch': 0.57}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:43:59<1:05:39, 131kB/s]
 57%|█████▋    | 1159/2041 [1:40:40<1:15:42,  5.15s/it][A

{'eps': 0, 'objective/kl': 68.09268188476562, 'objective/entropy': 13.75832462310791, 'objective/non_score_reward': -3.4046339988708496, 'objective/rlhf_reward': -4.267789363861084, 'objective/scores': -0.8631552457809448, 'policy/approxkl_avg': 0.003260695608332753, 'policy/clipfrac_avg': 0.02712264284491539, 'loss/policy_avg': -0.011978210881352425, 'loss/value_avg': 0.27297112345695496, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.31332099437713623, 'val/ratio': 0.9966578483581543, 'val/ratio_var': 1.072212489816593e-05, 'val/num_eos_tokens': 0, 'lr': 2.163155316021558e-05, 'episode': 4636, 'epoch': 0.57}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:44:04<1:05:39, 131kB/s]
 57%|█████▋    | 1160/2041 [1:40:46<1:15:44,  5.16s/it][A

{'eps': 0, 'objective/kl': 57.730384826660156, 'objective/entropy': 5.918543815612793, 'objective/non_score_reward': -2.886519432067871, 'objective/rlhf_reward': -4.517328262329102, 'objective/scores': -1.630808711051941, 'policy/approxkl_avg': 0.0018341471441090107, 'policy/clipfrac_avg': 0.02358490601181984, 'loss/policy_avg': -0.010017799213528633, 'loss/value_avg': 0.09703052043914795, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.12960252165794373, 'val/ratio': 0.9986127018928528, 'val/ratio_var': 8.621400411357172e-07, 'val/num_eos_tokens': 0, 'lr': 2.1607055365017148e-05, 'episode': 4640, 'epoch': 0.57}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:44:09<1:05:39, 131kB/s]
 57%|█████▋    | 1161/2041 [1:40:51<1:15:43,  5.16s/it][A

{'eps': 0, 'objective/kl': 49.79123306274414, 'objective/entropy': 1.2515950202941895, 'objective/non_score_reward': -2.4895615577697754, 'objective/rlhf_reward': -4.233664035797119, 'objective/scores': -1.7441024780273438, 'policy/approxkl_avg': 0.0009642936056479812, 'policy/clipfrac_avg': 0.007075471803545952, 'loss/policy_avg': -0.0012950320960953832, 'loss/value_avg': 0.0503978431224823, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.048288773745298386, 'val/ratio': 0.9972422122955322, 'val/ratio_var': 6.221191142685711e-06, 'val/num_eos_tokens': 0, 'lr': 2.1582557569818716e-05, 'episode': 4644, 'epoch': 0.57}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:44:15<1:05:39, 131kB/s]
 57%|█████▋    | 1162/2041 [1:40:56<1:16:10,  5.20s/it][A

{'eps': 0, 'objective/kl': 56.55387878417969, 'objective/entropy': 3.6361751556396484, 'objective/non_score_reward': -2.8276939392089844, 'objective/rlhf_reward': -4.536919593811035, 'objective/scores': -1.7092254161834717, 'policy/approxkl_avg': 0.002621858846396208, 'policy/clipfrac_avg': 0.012971698306500912, 'loss/policy_avg': -0.008451549336314201, 'loss/value_avg': 0.07022431492805481, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.12888552248477936, 'val/ratio': 0.9998437166213989, 'val/ratio_var': 1.7586311074069272e-08, 'val/num_eos_tokens': 0, 'lr': 2.1558059774620284e-05, 'episode': 4648, 'epoch': 0.57}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:44:20<1:05:39, 131kB/s]
 57%|█████▋    | 1163/2041 [1:41:01<1:16:07,  5.20s/it][A

{'eps': 0, 'objective/kl': 59.47914123535156, 'objective/entropy': 22.579936981201172, 'objective/non_score_reward': -2.973957061767578, 'objective/rlhf_reward': -4.817573547363281, 'objective/scores': -1.8436163663864136, 'policy/approxkl_avg': 0.006134084425866604, 'policy/clipfrac_avg': 0.03419811278581619, 'loss/policy_avg': -0.020900536328554153, 'loss/value_avg': 0.19069398939609528, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.37956440448760986, 'val/ratio': 0.993678092956543, 'val/ratio_var': 2.4746404960751534e-05, 'val/num_eos_tokens': 0, 'lr': 2.1533561979421856e-05, 'episode': 4652, 'epoch': 0.57}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:44:25<1:05:39, 131kB/s]
 57%|█████▋    | 1164/2041 [1:41:07<1:16:13,  5.21s/it][A

{'eps': 0, 'objective/kl': 55.01765441894531, 'objective/entropy': 3.159355640411377, 'objective/non_score_reward': -2.750882863998413, 'objective/rlhf_reward': -4.593503475189209, 'objective/scores': -1.8426207304000854, 'policy/approxkl_avg': 0.000624971289653331, 'policy/clipfrac_avg': 0.008254717104136944, 'loss/policy_avg': -0.006061428692191839, 'loss/value_avg': 0.06839436292648315, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.10057555139064789, 'val/ratio': 0.9975510239601135, 'val/ratio_var': 7.354958142968826e-06, 'val/num_eos_tokens': 0, 'lr': 2.150906418422342e-05, 'episode': 4656, 'epoch': 0.57}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:44:30<1:05:39, 131kB/s]
 57%|█████▋    | 1165/2041 [1:41:12<1:15:56,  5.20s/it][A

{'eps': 0, 'objective/kl': 75.42610168457031, 'objective/entropy': 31.600839614868164, 'objective/non_score_reward': -3.7713050842285156, 'objective/rlhf_reward': -5.580431938171387, 'objective/scores': -1.809126853942871, 'policy/approxkl_avg': 0.014764350838959217, 'policy/clipfrac_avg': 0.0625, 'loss/policy_avg': -0.029366251081228256, 'loss/value_avg': 0.364998459815979, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.5604743957519531, 'val/ratio': 0.9752503633499146, 'val/ratio_var': 0.0004315424885135144, 'val/num_eos_tokens': 0, 'lr': 2.148456638902499e-05, 'episode': 4660, 'epoch': 0.57}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:44:35<1:05:39, 131kB/s]
 57%|█████▋    | 1166/2041 [1:41:17<1:15:38,  5.19s/it][A

{'eps': 0, 'objective/kl': 56.241024017333984, 'objective/entropy': 2.656346321105957, 'objective/non_score_reward': -2.812051296234131, 'objective/rlhf_reward': -4.590373516082764, 'objective/scores': -1.7783222198486328, 'policy/approxkl_avg': 0.0005117006949149072, 'policy/clipfrac_avg': 0.00589622650295496, 'loss/policy_avg': -0.0015777472872287035, 'loss/value_avg': 0.06373114138841629, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.08982325345277786, 'val/ratio': 0.9989045858383179, 'val/ratio_var': 1.3621192920254543e-06, 'val/num_eos_tokens': 0, 'lr': 2.1460068593826556e-05, 'episode': 4664, 'epoch': 0.57}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:44:40<1:05:39, 131kB/s]
 57%|█████▋    | 1167/2041 [1:41:22<1:15:17,  5.17s/it][A

{'eps': 0, 'objective/kl': 55.181697845458984, 'objective/entropy': 7.975537300109863, 'objective/non_score_reward': -2.759084939956665, 'objective/rlhf_reward': -4.508401870727539, 'objective/scores': -1.749316692352295, 'policy/approxkl_avg': 0.002295376267284155, 'policy/clipfrac_avg': 0.01179245300590992, 'loss/policy_avg': -0.006045280955731869, 'loss/value_avg': 0.052140720188617706, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.10018112510442734, 'val/ratio': 1.004183053970337, 'val/ratio_var': 1.408094340149546e-05, 'val/num_eos_tokens': 0, 'lr': 2.1435570798628124e-05, 'episode': 4668, 'epoch': 0.57}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:44:46<1:05:39, 131kB/s]
 57%|█████▋    | 1168/2041 [1:41:27<1:15:24,  5.18s/it][A

{'eps': 0, 'objective/kl': 56.26512908935547, 'objective/entropy': 4.756784439086914, 'objective/non_score_reward': -2.8132567405700684, 'objective/rlhf_reward': -4.528268814086914, 'objective/scores': -1.7150118350982666, 'policy/approxkl_avg': 0.002721303142607212, 'policy/clipfrac_avg': 0.022405661642551422, 'loss/policy_avg': -0.0047464920207858086, 'loss/value_avg': 0.06251843273639679, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1340271681547165, 'val/ratio': 0.9884387254714966, 'val/ratio_var': 0.00010581897367956117, 'val/num_eos_tokens': 0, 'lr': 2.1411073003429692e-05, 'episode': 4672, 'epoch': 0.57}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:44:51<1:05:39, 131kB/s]
 57%|█████▋    | 1169/2041 [1:41:32<1:15:10,  5.17s/it][A

{'eps': 0, 'objective/kl': 52.46649169921875, 'objective/entropy': 7.411537170410156, 'objective/non_score_reward': -2.6233246326446533, 'objective/rlhf_reward': -4.287023544311523, 'objective/scores': -1.663698673248291, 'policy/approxkl_avg': 0.001793854171410203, 'policy/clipfrac_avg': 0.010613207705318928, 'loss/policy_avg': -0.0064810700714588165, 'loss/value_avg': 0.05825076997280121, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1599947214126587, 'val/ratio': 0.998023271560669, 'val/ratio_var': 1.893349349302298e-06, 'val/num_eos_tokens': 0, 'lr': 2.138657520823126e-05, 'episode': 4676, 'epoch': 0.57}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:44:56<1:05:39, 131kB/s]
 57%|█████▋    | 1170/2041 [1:41:38<1:15:20,  5.19s/it][A

{'eps': 0, 'objective/kl': 53.991825103759766, 'objective/entropy': 6.619078159332275, 'objective/non_score_reward': -2.6995911598205566, 'objective/rlhf_reward': -4.610998153686523, 'objective/scores': -1.9114067554473877, 'policy/approxkl_avg': 0.001298912218771875, 'policy/clipfrac_avg': 0.015330188907682896, 'loss/policy_avg': -0.002996412105858326, 'loss/value_avg': 0.07051705569028854, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.14199298620224, 'val/ratio': 1.0002994537353516, 'val/ratio_var': 4.4314372615872344e-08, 'val/num_eos_tokens': 0, 'lr': 2.136207741303283e-05, 'episode': 4680, 'epoch': 0.57}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:45:01<1:05:39, 131kB/s]
 57%|█████▋    | 1171/2041 [1:41:43<1:15:21,  5.20s/it][A

{'eps': 0, 'objective/kl': 63.33278274536133, 'objective/entropy': 14.160154342651367, 'objective/non_score_reward': -3.1666388511657715, 'objective/rlhf_reward': -5.328681945800781, 'objective/scores': -2.1620430946350098, 'policy/approxkl_avg': 0.15181423723697662, 'policy/clipfrac_avg': 0.03655660152435303, 'loss/policy_avg': -0.016379093751311302, 'loss/value_avg': 0.10822409391403198, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.31014496088027954, 'val/ratio': 0.9909448623657227, 'val/ratio_var': 3.749785173567943e-05, 'val/num_eos_tokens': 0, 'lr': 2.1337579617834397e-05, 'episode': 4684, 'epoch': 0.57}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:45:06<1:05:39, 131kB/s]
 57%|█████▋    | 1172/2041 [1:41:48<1:15:10,  5.19s/it][A

{'eps': 0, 'objective/kl': 61.02880096435547, 'objective/entropy': 8.332479476928711, 'objective/non_score_reward': -3.0514400005340576, 'objective/rlhf_reward': -4.895233154296875, 'objective/scores': -1.8437931537628174, 'policy/approxkl_avg': 0.0020937861409038305, 'policy/clipfrac_avg': 0.024764150381088257, 'loss/policy_avg': -0.0030903825536370277, 'loss/value_avg': 0.05862326920032501, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1682465374469757, 'val/ratio': 0.9956483840942383, 'val/ratio_var': 1.6876880181371234e-05, 'val/num_eos_tokens': 0, 'lr': 2.1313081822635965e-05, 'episode': 4688, 'epoch': 0.57}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:45:12<1:05:39, 131kB/s]
 57%|█████▋    | 1173/2041 [1:41:53<1:15:20,  5.21s/it][A

{'eps': 0, 'objective/kl': 68.34434509277344, 'objective/entropy': 8.93563461303711, 'objective/non_score_reward': -3.41721773147583, 'objective/rlhf_reward': -5.358372211456299, 'objective/scores': -1.9411544799804688, 'policy/approxkl_avg': 0.0011011653114110231, 'policy/clipfrac_avg': 0.008254717104136944, 'loss/policy_avg': -0.006587577052414417, 'loss/value_avg': 0.1753908097743988, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.18812456727027893, 'val/ratio': 0.9959752559661865, 'val/ratio_var': 1.2878640518465545e-05, 'val/num_eos_tokens': 0, 'lr': 2.128858402743753e-05, 'episode': 4692, 'epoch': 0.57}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:45:17<1:05:39, 131kB/s]
 58%|█████▊    | 1174/2041 [1:41:58<1:15:00,  5.19s/it][A

{'eps': 0, 'objective/kl': 57.005706787109375, 'objective/entropy': 3.065993309020996, 'objective/non_score_reward': -2.850285530090332, 'objective/rlhf_reward': -4.548042297363281, 'objective/scores': -1.6977565288543701, 'policy/approxkl_avg': 0.000412260415032506, 'policy/clipfrac_avg': 0.004716981202363968, 'loss/policy_avg': -0.003353582229465246, 'loss/value_avg': 0.04385543614625931, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.10799004882574081, 'val/ratio': 0.9965202808380127, 'val/ratio_var': 1.0906393981713336e-05, 'val/num_eos_tokens': 0, 'lr': 2.12640862322391e-05, 'episode': 4696, 'epoch': 0.58}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:45:22<1:05:39, 131kB/s]
 58%|█████▊    | 1175/2041 [1:42:04<1:14:51,  5.19s/it][A

{'eps': 0, 'objective/kl': 57.44916534423828, 'objective/entropy': 5.322890281677246, 'objective/non_score_reward': -2.8724582195281982, 'objective/rlhf_reward': -5.052330017089844, 'objective/scores': -2.1798720359802246, 'policy/approxkl_avg': 0.0017341750208288431, 'policy/clipfrac_avg': 0.017688680440187454, 'loss/policy_avg': -0.005600602366030216, 'loss/value_avg': 0.06565891951322556, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1422535479068756, 'val/ratio': 0.9972551465034485, 'val/ratio_var': 9.138276254816446e-06, 'val/num_eos_tokens': 0, 'lr': 2.1239588437040665e-05, 'episode': 4700, 'epoch': 0.58}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:45:27<1:05:39, 131kB/s]
 58%|█████▊    | 1176/2041 [1:42:09<1:14:25,  5.16s/it][A

{'eps': 0, 'objective/kl': 54.87513732910156, 'objective/entropy': 1.1614956855773926, 'objective/non_score_reward': -2.7437567710876465, 'objective/rlhf_reward': -4.59877872467041, 'objective/scores': -1.8550221920013428, 'policy/approxkl_avg': 3.275756535003893e-05, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': 0.0006374502554535866, 'loss/value_avg': 0.051261670887470245, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.06930464506149292, 'val/ratio': 1.0011913776397705, 'val/ratio_var': 9.088112165045459e-07, 'val/num_eos_tokens': 0, 'lr': 2.1215090641842237e-05, 'episode': 4704, 'epoch': 0.58}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:45:32<1:05:39, 131kB/s]
 58%|█████▊    | 1177/2041 [1:42:14<1:14:05,  5.15s/it][A

{'eps': 0, 'objective/kl': 56.24640655517578, 'objective/entropy': 2.872811794281006, 'objective/non_score_reward': -2.8123204708099365, 'objective/rlhf_reward': -4.602904319763184, 'objective/scores': -1.790583610534668, 'policy/approxkl_avg': 0.00034180760849267244, 'policy/clipfrac_avg': 0.002358490601181984, 'loss/policy_avg': -0.0010398547165095806, 'loss/value_avg': 0.04425642639398575, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.09848075360059738, 'val/ratio': 0.9999614953994751, 'val/ratio_var': 2.4445287749585987e-07, 'val/num_eos_tokens': 0, 'lr': 2.11905928466438e-05, 'episode': 4708, 'epoch': 0.58}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:45:37<1:05:39, 131kB/s]
 58%|█████▊    | 1178/2041 [1:42:19<1:14:24,  5.17s/it][A

{'eps': 0, 'objective/kl': 56.947715759277344, 'objective/entropy': 4.57100248336792, 'objective/non_score_reward': -2.847385883331299, 'objective/rlhf_reward': -4.58336067199707, 'objective/scores': -1.7359747886657715, 'policy/approxkl_avg': 0.00023664541367907077, 'policy/clipfrac_avg': 0.003537735901772976, 'loss/policy_avg': -0.001434113597497344, 'loss/value_avg': 0.058310091495513916, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.13314616680145264, 'val/ratio': 0.9995195865631104, 'val/ratio_var': 4.311479813168262e-07, 'val/num_eos_tokens': 0, 'lr': 2.116609505144537e-05, 'episode': 4712, 'epoch': 0.58}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:45:43<1:05:39, 131kB/s]
 58%|█████▊    | 1179/2041 [1:42:24<1:14:06,  5.16s/it][A

{'eps': 0, 'objective/kl': 60.957969665527344, 'objective/entropy': 7.336262226104736, 'objective/non_score_reward': -3.047898769378662, 'objective/rlhf_reward': -4.859412670135498, 'objective/scores': -1.8115137815475464, 'policy/approxkl_avg': 0.003386708674952388, 'policy/clipfrac_avg': 0.025943396613001823, 'loss/policy_avg': -0.004107555374503136, 'loss/value_avg': 0.05832795053720474, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1265750527381897, 'val/ratio': 0.9955242276191711, 'val/ratio_var': 1.4111795280769002e-05, 'val/num_eos_tokens': 0, 'lr': 2.114159725624694e-05, 'episode': 4716, 'epoch': 0.58}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:45:48<1:05:39, 131kB/s]
 58%|█████▊    | 1180/2041 [1:42:29<1:14:26,  5.19s/it][A

{'eps': 0, 'objective/kl': 69.02788543701172, 'objective/entropy': 34.86149597167969, 'objective/non_score_reward': -3.451394557952881, 'objective/rlhf_reward': -4.784707546234131, 'objective/scores': -1.33331298828125, 'policy/approxkl_avg': 0.032069750130176544, 'policy/clipfrac_avg': 0.0766509473323822, 'loss/policy_avg': -0.031361550092697144, 'loss/value_avg': 0.14926542341709137, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6383031606674194, 'val/ratio': 0.9769771695137024, 'val/ratio_var': 0.0003375106316525489, 'val/num_eos_tokens': 0, 'lr': 2.1117099461048506e-05, 'episode': 4720, 'epoch': 0.58}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:45:53<1:05:39, 131kB/s]
 58%|█████▊    | 1181/2041 [1:42:35<1:14:45,  5.22s/it][A

{'eps': 0, 'objective/kl': 58.87769317626953, 'objective/entropy': 9.486490249633789, 'objective/non_score_reward': -2.9438846111297607, 'objective/rlhf_reward': -4.910006999969482, 'objective/scores': -1.9661223888397217, 'policy/approxkl_avg': 0.04405050352215767, 'policy/clipfrac_avg': 0.010613207705318928, 'loss/policy_avg': -0.00571393221616745, 'loss/value_avg': 0.05065536126494408, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.09472943842411041, 'val/ratio': 0.9970738887786865, 'val/ratio_var': 4.236469521856634e-06, 'val/num_eos_tokens': 0, 'lr': 2.1092601665850077e-05, 'episode': 4724, 'epoch': 0.58}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:45:58<1:05:39, 131kB/s]
 58%|█████▊    | 1182/2041 [1:42:40<1:14:30,  5.20s/it][A

{'eps': 0, 'objective/kl': 67.54981231689453, 'objective/entropy': 29.36172103881836, 'objective/non_score_reward': -3.377490758895874, 'objective/rlhf_reward': -4.975250244140625, 'objective/scores': -1.597759485244751, 'policy/approxkl_avg': 0.11848318576812744, 'policy/clipfrac_avg': 0.060141511261463165, 'loss/policy_avg': -0.02619807980954647, 'loss/value_avg': 0.169438898563385, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.5381520986557007, 'val/ratio': 0.9700327515602112, 'val/ratio_var': 0.0005007149302400649, 'val/num_eos_tokens': 0, 'lr': 2.1068103870651642e-05, 'episode': 4728, 'epoch': 0.58}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:46:03<1:05:39, 131kB/s]
 58%|█████▊    | 1183/2041 [1:42:45<1:13:59,  5.17s/it][A

{'eps': 0, 'objective/kl': 56.97342300415039, 'objective/entropy': 4.044988632202148, 'objective/non_score_reward': -2.8486709594726562, 'objective/rlhf_reward': -4.614494800567627, 'objective/scores': -1.7658239603042603, 'policy/approxkl_avg': 0.00014034633932169527, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.0011598496930673718, 'loss/value_avg': 0.04644530266523361, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1169596016407013, 'val/ratio': 0.9967352747917175, 'val/ratio_var': 8.293139217130374e-06, 'val/num_eos_tokens': 0, 'lr': 2.104360607545321e-05, 'episode': 4732, 'epoch': 0.58}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:46:09<1:05:39, 131kB/s]
 58%|█████▊    | 1184/2041 [1:42:50<1:14:14,  5.20s/it][A

{'eps': 0, 'objective/kl': 55.31780242919922, 'objective/entropy': 7.783125877380371, 'objective/non_score_reward': -2.765890121459961, 'objective/rlhf_reward': -5.296475410461426, 'objective/scores': -2.5305850505828857, 'policy/approxkl_avg': 0.0015119323506951332, 'policy/clipfrac_avg': 0.010613207705318928, 'loss/policy_avg': -0.004110739566385746, 'loss/value_avg': 0.08157476782798767, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.14742380380630493, 'val/ratio': 0.9961738586425781, 'val/ratio_var': 1.5204689589154441e-05, 'val/num_eos_tokens': 0, 'lr': 2.1019108280254778e-05, 'episode': 4736, 'epoch': 0.58}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:46:14<1:05:39, 131kB/s]
 58%|█████▊    | 1185/2041 [1:42:55<1:13:55,  5.18s/it][A

{'eps': 0, 'objective/kl': 51.99215316772461, 'objective/entropy': 3.6419854164123535, 'objective/non_score_reward': -2.5996077060699463, 'objective/rlhf_reward': -4.619977951049805, 'objective/scores': -2.0203704833984375, 'policy/approxkl_avg': 0.0016710801282897592, 'policy/clipfrac_avg': 0.012971698306500912, 'loss/policy_avg': -0.002662312239408493, 'loss/value_avg': 0.05720289424061775, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.09710200875997543, 'val/ratio': 0.9995602369308472, 'val/ratio_var': 3.439505462665693e-07, 'val/num_eos_tokens': 0, 'lr': 2.0994610485056346e-05, 'episode': 4740, 'epoch': 0.58}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:46:19<1:05:39, 131kB/s]
 58%|█████▊    | 1186/2041 [1:43:01<1:14:02,  5.20s/it][A

{'eps': 0, 'objective/kl': 59.03620529174805, 'objective/entropy': 6.451117515563965, 'objective/non_score_reward': -2.951810359954834, 'objective/rlhf_reward': -4.878695487976074, 'objective/scores': -1.9268848896026611, 'policy/approxkl_avg': 0.001972273923456669, 'policy/clipfrac_avg': 0.01886792480945587, 'loss/policy_avg': -0.007975328713655472, 'loss/value_avg': 0.07550057023763657, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.10736491531133652, 'val/ratio': 0.9945042133331299, 'val/ratio_var': 1.985739254450891e-05, 'val/num_eos_tokens': 0, 'lr': 2.0970112689857914e-05, 'episode': 4744, 'epoch': 0.58}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:46:24<1:05:39, 131kB/s]
 58%|█████▊    | 1187/2041 [1:43:06<1:13:54,  5.19s/it][A

{'eps': 0, 'objective/kl': 58.970123291015625, 'objective/entropy': 6.043631553649902, 'objective/non_score_reward': -2.9485063552856445, 'objective/rlhf_reward': -4.88599157333374, 'objective/scores': -1.9374852180480957, 'policy/approxkl_avg': 0.0010816813446581364, 'policy/clipfrac_avg': 0.009433962404727936, 'loss/policy_avg': -0.0018815526273101568, 'loss/value_avg': 0.05819923058152199, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.11803814023733139, 'val/ratio': 0.9955230951309204, 'val/ratio_var': 1.5473595340154134e-05, 'val/num_eos_tokens': 0, 'lr': 2.0945614894659482e-05, 'episode': 4748, 'epoch': 0.58}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:46:29<1:05:39, 131kB/s]
 58%|█████▊    | 1188/2041 [1:43:11<1:13:30,  5.17s/it][A

{'eps': 0, 'objective/kl': 53.53807830810547, 'objective/entropy': 9.150177001953125, 'objective/non_score_reward': -2.6769039630889893, 'objective/rlhf_reward': -4.757078170776367, 'objective/scores': -2.080173969268799, 'policy/approxkl_avg': 0.0016801158199086785, 'policy/clipfrac_avg': 0.014150943607091904, 'loss/policy_avg': -0.004018651321530342, 'loss/value_avg': 0.06944293528795242, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.08179567754268646, 'val/ratio': 0.9948323369026184, 'val/ratio_var': 2.083649997075554e-05, 'val/num_eos_tokens': 0, 'lr': 2.092111709946105e-05, 'episode': 4752, 'epoch': 0.58}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:46:34<1:05:39, 131kB/s]
 58%|█████▊    | 1189/2041 [1:43:16<1:13:00,  5.14s/it][A

{'eps': 0, 'objective/kl': 63.13288497924805, 'objective/entropy': 3.8683183193206787, 'objective/non_score_reward': -3.156644344329834, 'objective/rlhf_reward': -5.416746139526367, 'objective/scores': -2.260101795196533, 'policy/approxkl_avg': 0.009192650206387043, 'policy/clipfrac_avg': 0.029481131583452225, 'loss/policy_avg': -0.008083722554147243, 'loss/value_avg': 0.06550389528274536, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.10144390165805817, 'val/ratio': 0.9863530397415161, 'val/ratio_var': 0.00012379435065668076, 'val/num_eos_tokens': 0, 'lr': 2.0896619304262618e-05, 'episode': 4756, 'epoch': 0.58}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:46:40<1:05:39, 131kB/s]
 58%|█████▊    | 1190/2041 [1:43:21<1:13:00,  5.15s/it][A

{'eps': 0, 'objective/kl': 52.90343475341797, 'objective/entropy': 0.9640002250671387, 'objective/non_score_reward': -2.645171880722046, 'objective/rlhf_reward': -4.530129909515381, 'objective/scores': -1.884958028793335, 'policy/approxkl_avg': 9.967904770746827e-05, 'policy/clipfrac_avg': 0.001179245300590992, 'loss/policy_avg': -0.001003779238089919, 'loss/value_avg': 0.03743362799286842, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.036358192563056946, 'val/ratio': 0.9995688199996948, 'val/ratio_var': 5.440143695523147e-07, 'val/num_eos_tokens': 0, 'lr': 2.0872121509064186e-05, 'episode': 4760, 'epoch': 0.58}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:46:45<1:05:39, 131kB/s]
 58%|█████▊    | 1191/2041 [1:43:26<1:12:52,  5.14s/it][A

{'eps': 0, 'objective/kl': 51.38977813720703, 'objective/entropy': 1.1366448402404785, 'objective/non_score_reward': -2.569489002227783, 'objective/rlhf_reward': -4.2468180656433105, 'objective/scores': -1.6773289442062378, 'policy/approxkl_avg': 0.0007003500941209495, 'policy/clipfrac_avg': 0.007075471803545952, 'loss/policy_avg': -0.0024946718476712704, 'loss/value_avg': 0.04565669223666191, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.052561596035957336, 'val/ratio': 0.999329686164856, 'val/ratio_var': 6.479842795670265e-07, 'val/num_eos_tokens': 0, 'lr': 2.084762371386575e-05, 'episode': 4764, 'epoch': 0.58}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:46:50<1:05:39, 131kB/s]
 58%|█████▊    | 1192/2041 [1:43:31<1:12:53,  5.15s/it][A

{'eps': 0, 'objective/kl': 64.04957580566406, 'objective/entropy': 8.564788818359375, 'objective/non_score_reward': -3.2024786472320557, 'objective/rlhf_reward': -5.151250839233398, 'objective/scores': -1.9487724304199219, 'policy/approxkl_avg': 0.009829668328166008, 'policy/clipfrac_avg': 0.02004716917872429, 'loss/policy_avg': -0.012299779802560806, 'loss/value_avg': 0.056118715554475784, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.13166050612926483, 'val/ratio': 0.9878690242767334, 'val/ratio_var': 0.00010172533802688122, 'val/num_eos_tokens': 0, 'lr': 2.0823125918667322e-05, 'episode': 4768, 'epoch': 0.58}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:46:55<1:05:39, 131kB/s]
 58%|█████▊    | 1193/2041 [1:43:36<1:12:30,  5.13s/it][A

{'eps': 0, 'objective/kl': 56.34156799316406, 'objective/entropy': 2.250223398208618, 'objective/non_score_reward': -2.8170783519744873, 'objective/rlhf_reward': -4.609199523925781, 'objective/scores': -1.7921210527420044, 'policy/approxkl_avg': 0.008342485874891281, 'policy/clipfrac_avg': 0.010613207705318928, 'loss/policy_avg': -0.0015984318451955914, 'loss/value_avg': 0.04535803198814392, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.07502982020378113, 'val/ratio': 0.9910263419151306, 'val/ratio_var': 4.821140828425996e-05, 'val/num_eos_tokens': 0, 'lr': 2.0798628123468887e-05, 'episode': 4772, 'epoch': 0.58}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:47:00<1:05:39, 131kB/s]
 59%|█████▊    | 1194/2041 [1:43:42<1:12:35,  5.14s/it][A

{'eps': 0, 'objective/kl': 58.10320281982422, 'objective/entropy': 2.029139995574951, 'objective/non_score_reward': -2.9051599502563477, 'objective/rlhf_reward': -4.784704208374023, 'objective/scores': -1.8795444965362549, 'policy/approxkl_avg': 0.007859556935727596, 'policy/clipfrac_avg': 0.008254717104136944, 'loss/policy_avg': -0.004781208001077175, 'loss/value_avg': 0.032913558185100555, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.05649333819746971, 'val/ratio': 0.9951345324516296, 'val/ratio_var': 1.9900066035916097e-05, 'val/num_eos_tokens': 0, 'lr': 2.0774130328270458e-05, 'episode': 4776, 'epoch': 0.59}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:47:05<1:05:39, 131kB/s]
 59%|█████▊    | 1195/2041 [1:43:47<1:12:23,  5.13s/it][A

{'eps': 0, 'objective/kl': 72.51277160644531, 'objective/entropy': 17.00902557373047, 'objective/non_score_reward': -3.625638484954834, 'objective/rlhf_reward': -5.296465873718262, 'objective/scores': -1.6708275079727173, 'policy/approxkl_avg': 0.012156527489423752, 'policy/clipfrac_avg': 0.05188679322600365, 'loss/policy_avg': -0.022609230130910873, 'loss/value_avg': 0.30163103342056274, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.4001155495643616, 'val/ratio': 0.9852114915847778, 'val/ratio_var': 0.00013497100735548884, 'val/num_eos_tokens': 0, 'lr': 2.0749632533072026e-05, 'episode': 4780, 'epoch': 0.59}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:47:10<1:05:39, 131kB/s]
 59%|█████▊    | 1196/2041 [1:43:52<1:12:20,  5.14s/it][A

{'eps': 0, 'objective/kl': 84.44380187988281, 'objective/entropy': 34.095062255859375, 'objective/non_score_reward': -4.2221903800964355, 'objective/rlhf_reward': -6.150584697723389, 'objective/scores': -1.9283943176269531, 'policy/approxkl_avg': 0.011578093282878399, 'policy/clipfrac_avg': 0.07193396240472794, 'loss/policy_avg': -0.027761297300457954, 'loss/value_avg': 0.27694886922836304, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.5734107494354248, 'val/ratio': 0.9994780421257019, 'val/ratio_var': 1.411639345860749e-06, 'val/num_eos_tokens': 0, 'lr': 2.072513473787359e-05, 'episode': 4784, 'epoch': 0.59}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:47:16<1:05:39, 131kB/s]
 59%|█████▊    | 1197/2041 [1:43:57<1:12:31,  5.16s/it][A

{'eps': 0, 'objective/kl': 69.96739196777344, 'objective/entropy': 8.969300270080566, 'objective/non_score_reward': -3.4983694553375244, 'objective/rlhf_reward': -4.836729526519775, 'objective/scores': -1.338360071182251, 'policy/approxkl_avg': 0.0064485990442335606, 'policy/clipfrac_avg': 0.040094342082738876, 'loss/policy_avg': -0.0168094951659441, 'loss/value_avg': 0.1293615996837616, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.2721101641654968, 'val/ratio': 0.9942808151245117, 'val/ratio_var': 1.8472777810529806e-05, 'val/num_eos_tokens': 0, 'lr': 2.0700636942675162e-05, 'episode': 4788, 'epoch': 0.59}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:47:21<1:05:39, 131kB/s]
 59%|█████▊    | 1198/2041 [1:44:02<1:12:22,  5.15s/it][A

{'eps': 0, 'objective/kl': 62.16349411010742, 'objective/entropy': 8.165460586547852, 'objective/non_score_reward': -3.1081748008728027, 'objective/rlhf_reward': -4.772561073303223, 'objective/scores': -1.6643861532211304, 'policy/approxkl_avg': 0.017590591683983803, 'policy/clipfrac_avg': 0.02712264098227024, 'loss/policy_avg': -0.013464119285345078, 'loss/value_avg': 0.1279529631137848, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.19878605008125305, 'val/ratio': 0.982939600944519, 'val/ratio_var': 0.00017275575373787433, 'val/num_eos_tokens': 0, 'lr': 2.0676139147476727e-05, 'episode': 4792, 'epoch': 0.59}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:47:26<1:05:39, 131kB/s]
 59%|█████▊    | 1199/2041 [1:44:07<1:12:16,  5.15s/it][A

{'eps': 0, 'objective/kl': 70.47047424316406, 'objective/entropy': 43.573482513427734, 'objective/non_score_reward': -3.5235238075256348, 'objective/rlhf_reward': -5.510873317718506, 'objective/scores': -1.987349510192871, 'policy/approxkl_avg': 0.07500516623258591, 'policy/clipfrac_avg': 0.10495283454656601, 'loss/policy_avg': -0.040188267827034, 'loss/value_avg': 0.28097158670425415, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8208814859390259, 'val/ratio': 0.9608734250068665, 'val/ratio_var': 0.0009690274018794298, 'val/num_eos_tokens': 0, 'lr': 2.0651641352278295e-05, 'episode': 4796, 'epoch': 0.59}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:47:31<1:05:39, 131kB/s]
 59%|█████▉    | 1200/2041 [1:44:13<1:12:05,  5.14s/it][A

{'eps': 0, 'objective/kl': 55.6983528137207, 'objective/entropy': 2.7828989028930664, 'objective/non_score_reward': -2.7849178314208984, 'objective/rlhf_reward': -4.683579444885254, 'objective/scores': -1.8986616134643555, 'policy/approxkl_avg': 0.0015312755713239312, 'policy/clipfrac_avg': 0.012971698306500912, 'loss/policy_avg': -0.005358767695724964, 'loss/value_avg': 0.03673926368355751, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.06691018491983414, 'val/ratio': 1.000708818435669, 'val/ratio_var': 3.301831554836099e-07, 'val/num_eos_tokens': 0, 'lr': 2.0627143557079863e-05, 'episode': 4800, 'epoch': 0.59}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:47:36<1:05:39, 131kB/s]
 59%|█████▉    | 1201/2041 [1:44:18<1:12:14,  5.16s/it][A

{'eps': 0, 'objective/kl': 55.10817337036133, 'objective/entropy': 7.633549213409424, 'objective/non_score_reward': -2.755408763885498, 'objective/rlhf_reward': -4.506969928741455, 'objective/scores': -1.751561164855957, 'policy/approxkl_avg': 0.005242799408733845, 'policy/clipfrac_avg': 0.01886792480945587, 'loss/policy_avg': -0.010027457028627396, 'loss/value_avg': 0.04274692386388779, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.0881325975060463, 'val/ratio': 0.9954560399055481, 'val/ratio_var': 1.4980026207922492e-05, 'val/num_eos_tokens': 0, 'lr': 2.060264576188143e-05, 'episode': 4804, 'epoch': 0.59}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:47:41<1:05:39, 131kB/s]
 59%|█████▉    | 1202/2041 [1:44:23<1:12:22,  5.18s/it][A

{'eps': 0, 'objective/kl': 69.81838989257812, 'objective/entropy': 23.139785766601562, 'objective/non_score_reward': -3.490919589996338, 'objective/rlhf_reward': -5.294957160949707, 'objective/scores': -1.8040378093719482, 'policy/approxkl_avg': 0.04350591078400612, 'policy/clipfrac_avg': 0.07429245114326477, 'loss/policy_avg': -0.028888773173093796, 'loss/value_avg': 0.14416484534740448, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.46795982122421265, 'val/ratio': 0.9744766354560852, 'val/ratio_var': 0.0003391755744814873, 'val/num_eos_tokens': 0, 'lr': 2.0578147966683e-05, 'episode': 4808, 'epoch': 0.59}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:47:46<1:05:39, 131kB/s]
 59%|█████▉    | 1203/2041 [1:44:28<1:11:53,  5.15s/it][A

{'eps': 0, 'objective/kl': 69.54452514648438, 'objective/entropy': 5.114681720733643, 'objective/non_score_reward': -3.4772262573242188, 'objective/rlhf_reward': -5.930908203125, 'objective/scores': -2.4536819458007812, 'policy/approxkl_avg': 0.006667100824415684, 'policy/clipfrac_avg': 0.012971698306500912, 'loss/policy_avg': -0.00790755357593298, 'loss/value_avg': 0.24138236045837402, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.07669997960329056, 'val/ratio': 0.9925358295440674, 'val/ratio_var': 3.2119289244292304e-05, 'val/num_eos_tokens': 0, 'lr': 2.0553650171484567e-05, 'episode': 4812, 'epoch': 0.59}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:47:52<1:05:39, 131kB/s]
 59%|█████▉    | 1204/2041 [1:44:33<1:12:07,  5.17s/it][A

{'eps': 0, 'objective/kl': 60.26059341430664, 'objective/entropy': 7.1986565589904785, 'objective/non_score_reward': -3.0130295753479004, 'objective/rlhf_reward': -5.32949161529541, 'objective/scores': -2.3164620399475098, 'policy/approxkl_avg': 0.006928283255547285, 'policy/clipfrac_avg': 0.03419811278581619, 'loss/policy_avg': -0.014314856380224228, 'loss/value_avg': 0.10488513112068176, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.11989608407020569, 'val/ratio': 0.9850947856903076, 'val/ratio_var': 0.00020403361122589558, 'val/num_eos_tokens': 0, 'lr': 2.0529152376286135e-05, 'episode': 4816, 'epoch': 0.59}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:47:57<1:05:39, 131kB/s]
 59%|█████▉    | 1205/2041 [1:44:38<1:11:55,  5.16s/it][A

{'eps': 0, 'objective/kl': 57.618255615234375, 'objective/entropy': 2.9096877574920654, 'objective/non_score_reward': -2.8809127807617188, 'objective/rlhf_reward': -4.910002708435059, 'objective/scores': -2.0290896892547607, 'policy/approxkl_avg': 0.0006569710676558316, 'policy/clipfrac_avg': 0.004716981202363968, 'loss/policy_avg': 0.000755062501411885, 'loss/value_avg': 0.03032800555229187, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.07641439139842987, 'val/ratio': 1.0000606775283813, 'val/ratio_var': 1.628446888446433e-08, 'val/num_eos_tokens': 0, 'lr': 2.0504654581087703e-05, 'episode': 4820, 'epoch': 0.59}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:48:02<1:05:39, 131kB/s]
 59%|█████▉    | 1206/2041 [1:44:44<1:11:43,  5.15s/it][A

{'eps': 0, 'objective/kl': 60.23030471801758, 'objective/entropy': 8.712395668029785, 'objective/non_score_reward': -3.0115153789520264, 'objective/rlhf_reward': -5.0509490966796875, 'objective/scores': -2.039433479309082, 'policy/approxkl_avg': 0.0012198644690215588, 'policy/clipfrac_avg': 0.008254717104136944, 'loss/policy_avg': -0.006642347201704979, 'loss/value_avg': 0.1537601351737976, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.15256008505821228, 'val/ratio': 0.9958963394165039, 'val/ratio_var': 1.9425984646659344e-05, 'val/num_eos_tokens': 0, 'lr': 2.048015678588927e-05, 'episode': 4824, 'epoch': 0.59}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:48:07<1:05:39, 131kB/s]
 59%|█████▉    | 1207/2041 [1:44:49<1:11:36,  5.15s/it][A

{'eps': 0, 'objective/kl': 56.13035202026367, 'objective/entropy': 7.755815029144287, 'objective/non_score_reward': -2.806518077850342, 'objective/rlhf_reward': -4.79111385345459, 'objective/scores': -1.984595775604248, 'policy/approxkl_avg': 0.0007199348183348775, 'policy/clipfrac_avg': 0.008254717104136944, 'loss/policy_avg': -0.004552599508315325, 'loss/value_avg': 0.07618095725774765, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.133287250995636, 'val/ratio': 0.9987430572509766, 'val/ratio_var': 1.8391557432551053e-06, 'val/num_eos_tokens': 0, 'lr': 2.0455658990690836e-05, 'episode': 4828, 'epoch': 0.59}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:48:12<1:05:39, 131kB/s]
 59%|█████▉    | 1208/2041 [1:44:54<1:11:40,  5.16s/it][A

{'eps': 0, 'objective/kl': 58.76249694824219, 'objective/entropy': 9.488739013671875, 'objective/non_score_reward': -2.938124895095825, 'objective/rlhf_reward': -5.126770496368408, 'objective/scores': -2.188645601272583, 'policy/approxkl_avg': 0.0031717021483927965, 'policy/clipfrac_avg': 0.02358490601181984, 'loss/policy_avg': -0.010285861790180206, 'loss/value_avg': 0.07437282800674438, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.17470718920230865, 'val/ratio': 0.9950889348983765, 'val/ratio_var': 2.4765884518274106e-05, 'val/num_eos_tokens': 0, 'lr': 2.0431161195492407e-05, 'episode': 4832, 'epoch': 0.59}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:48:17<1:05:39, 131kB/s]
 59%|█████▉    | 1209/2041 [1:44:59<1:11:29,  5.16s/it][A

{'eps': 0, 'objective/kl': 64.16339874267578, 'objective/entropy': 6.644688129425049, 'objective/non_score_reward': -3.208169937133789, 'objective/rlhf_reward': -5.122195243835449, 'objective/scores': -1.9140254259109497, 'policy/approxkl_avg': 0.0059821936301887035, 'policy/clipfrac_avg': 0.041273586452007294, 'loss/policy_avg': -0.006113711278885603, 'loss/value_avg': 0.0589563213288784, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.13293075561523438, 'val/ratio': 0.9932811260223389, 'val/ratio_var': 3.199618004146032e-05, 'val/num_eos_tokens': 0, 'lr': 2.0406663400293975e-05, 'episode': 4836, 'epoch': 0.59}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:48:23<1:05:39, 131kB/s]
 59%|█████▉    | 1210/2041 [1:45:04<1:11:43,  5.18s/it][A

{'eps': 0, 'objective/kl': 66.31444549560547, 'objective/entropy': 7.422784805297852, 'objective/non_score_reward': -3.3157224655151367, 'objective/rlhf_reward': -5.563088417053223, 'objective/scores': -2.247365951538086, 'policy/approxkl_avg': 0.0009565292857587337, 'policy/clipfrac_avg': 0.009433962404727936, 'loss/policy_avg': -0.004936509765684605, 'loss/value_avg': 0.12362195551395416, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.11611476540565491, 'val/ratio': 0.9957407712936401, 'val/ratio_var': 1.4087609997659456e-05, 'val/num_eos_tokens': 0, 'lr': 2.0382165605095544e-05, 'episode': 4840, 'epoch': 0.59}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:48:28<1:05:39, 131kB/s]
 59%|█████▉    | 1211/2041 [1:45:09<1:11:05,  5.14s/it][A

{'eps': 0, 'objective/kl': 70.25640869140625, 'objective/entropy': 5.501821517944336, 'objective/non_score_reward': -3.51282000541687, 'objective/rlhf_reward': -6.199546813964844, 'objective/scores': -2.6867270469665527, 'policy/approxkl_avg': 0.007109662983566523, 'policy/clipfrac_avg': 0.017688680440187454, 'loss/policy_avg': -0.005664927884936333, 'loss/value_avg': 0.19568219780921936, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.08081506937742233, 'val/ratio': 0.9938734769821167, 'val/ratio_var': 2.2225853172130883e-05, 'val/num_eos_tokens': 0, 'lr': 2.035766780989711e-05, 'episode': 4844, 'epoch': 0.59}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:48:33<1:05:39, 131kB/s]
 59%|█████▉    | 1212/2041 [1:45:14<1:11:00,  5.14s/it][A

{'eps': 0, 'objective/kl': 59.946441650390625, 'objective/entropy': 2.0644564628601074, 'objective/non_score_reward': -2.9973220825195312, 'objective/rlhf_reward': -4.776088237762451, 'objective/scores': -1.77876615524292, 'policy/approxkl_avg': 0.0005809659487567842, 'policy/clipfrac_avg': 0.00589622650295496, 'loss/policy_avg': -0.004951130133122206, 'loss/value_avg': 0.03659990429878235, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.0716436430811882, 'val/ratio': 0.9963915348052979, 'val/ratio_var': 1.7200012734974734e-05, 'val/num_eos_tokens': 0, 'lr': 2.0333170014698676e-05, 'episode': 4848, 'epoch': 0.59}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:48:38<1:05:39, 131kB/s]
 59%|█████▉    | 1213/2041 [1:45:20<1:11:04,  5.15s/it][A

{'eps': 0, 'objective/kl': 59.40645217895508, 'objective/entropy': 1.9902763366699219, 'objective/non_score_reward': -2.970322847366333, 'objective/rlhf_reward': -4.827651500701904, 'objective/scores': -1.8573285341262817, 'policy/approxkl_avg': 0.0010205230209976435, 'policy/clipfrac_avg': 0.003537735901772976, 'loss/policy_avg': -0.0041428483091294765, 'loss/value_avg': 0.0326477512717247, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.07224031537771225, 'val/ratio': 0.9950443506240845, 'val/ratio_var': 2.55925642704824e-05, 'val/num_eos_tokens': 0, 'lr': 2.0308672219500248e-05, 'episode': 4852, 'epoch': 0.59}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:48:43<1:05:39, 131kB/s]
 59%|█████▉    | 1214/2041 [1:45:25<1:10:55,  5.15s/it][A

{'eps': 0, 'objective/kl': 62.39215850830078, 'objective/entropy': 10.31033706665039, 'objective/non_score_reward': -3.119608163833618, 'objective/rlhf_reward': -5.122878074645996, 'objective/scores': -2.003270149230957, 'policy/approxkl_avg': 0.009394032880663872, 'policy/clipfrac_avg': 0.03537736088037491, 'loss/policy_avg': -0.011291722767055035, 'loss/value_avg': 0.0433785542845726, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.16720448434352875, 'val/ratio': 0.9977987408638, 'val/ratio_var': 7.3409414653724525e-06, 'val/num_eos_tokens': 0, 'lr': 2.0284174424301812e-05, 'episode': 4856, 'epoch': 0.59}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:48:48<1:05:39, 131kB/s]
 60%|█████▉    | 1215/2041 [1:45:30<1:10:45,  5.14s/it][A

{'eps': 0, 'objective/kl': 62.326786041259766, 'objective/entropy': 26.458255767822266, 'objective/non_score_reward': -3.1163392066955566, 'objective/rlhf_reward': -4.885252952575684, 'objective/scores': -1.768913745880127, 'policy/approxkl_avg': 0.00870533473789692, 'policy/clipfrac_avg': 0.05424528568983078, 'loss/policy_avg': -0.017780084162950516, 'loss/value_avg': 0.1260058879852295, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.5505036115646362, 'val/ratio': 0.9935784935951233, 'val/ratio_var': 2.3225045879371464e-05, 'val/num_eos_tokens': 0, 'lr': 2.0259676629103384e-05, 'episode': 4860, 'epoch': 0.6}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:48:53<1:05:39, 131kB/s]
 60%|█████▉    | 1216/2041 [1:45:35<1:10:51,  5.15s/it][A

{'eps': 0, 'objective/kl': 68.12030029296875, 'objective/entropy': 17.077072143554688, 'objective/non_score_reward': -3.406015157699585, 'objective/rlhf_reward': -5.808622360229492, 'objective/scores': -2.4026074409484863, 'policy/approxkl_avg': 0.01007079053670168, 'policy/clipfrac_avg': 0.04245283454656601, 'loss/policy_avg': -0.014474686235189438, 'loss/value_avg': 0.09078246355056763, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.2251628339290619, 'val/ratio': 0.9932774305343628, 'val/ratio_var': 2.0553679860313423e-05, 'val/num_eos_tokens': 0, 'lr': 2.023517883390495e-05, 'episode': 4864, 'epoch': 0.6}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:48:59<1:05:39, 131kB/s]
 60%|█████▉    | 1217/2041 [1:45:40<1:10:40,  5.15s/it][A

{'eps': 0, 'objective/kl': 56.75779724121094, 'objective/entropy': 6.910442352294922, 'objective/non_score_reward': -2.8378899097442627, 'objective/rlhf_reward': -5.039605617523193, 'objective/scores': -2.2017157077789307, 'policy/approxkl_avg': 0.004379958380013704, 'policy/clipfrac_avg': 0.025943396613001823, 'loss/policy_avg': -0.00573978666216135, 'loss/value_avg': 0.06922221183776855, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.12617316842079163, 'val/ratio': 0.9954715967178345, 'val/ratio_var': 1.7080690668080933e-05, 'val/num_eos_tokens': 0, 'lr': 2.0210681038706516e-05, 'episode': 4868, 'epoch': 0.6}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:49:04<1:05:39, 131kB/s]
 60%|█████▉    | 1218/2041 [1:45:45<1:10:30,  5.14s/it][A

{'eps': 0, 'objective/kl': 54.57485580444336, 'objective/entropy': 1.5169260501861572, 'objective/non_score_reward': -2.728743076324463, 'objective/rlhf_reward': -4.713150978088379, 'objective/scores': -1.984407901763916, 'policy/approxkl_avg': 0.00019853247795253992, 'policy/clipfrac_avg': 0.003537735901772976, 'loss/policy_avg': -0.00041011764551512897, 'loss/value_avg': 0.03499501198530197, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.06487619876861572, 'val/ratio': 0.9994263648986816, 'val/ratio_var': 4.401072146720253e-07, 'val/num_eos_tokens': 0, 'lr': 2.0186183243508085e-05, 'episode': 4872, 'epoch': 0.6}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:49:09<1:05:39, 131kB/s]
 60%|█████▉    | 1219/2041 [1:45:50<1:10:34,  5.15s/it][A

{'eps': 0, 'objective/kl': 53.80003356933594, 'objective/entropy': 12.071816444396973, 'objective/non_score_reward': -2.6900014877319336, 'objective/rlhf_reward': -4.389886379241943, 'objective/scores': -1.6998848915100098, 'policy/approxkl_avg': 0.002458930481225252, 'policy/clipfrac_avg': 0.020047171041369438, 'loss/policy_avg': -0.012226960621774197, 'loss/value_avg': 0.058659009635448456, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.13210192322731018, 'val/ratio': 0.9964193105697632, 'val/ratio_var': 1.056829478329746e-05, 'val/num_eos_tokens': 0, 'lr': 2.0161685448309653e-05, 'episode': 4876, 'epoch': 0.6}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:49:14<1:05:39, 131kB/s]
 60%|█████▉    | 1220/2041 [1:45:56<1:10:34,  5.16s/it][A

{'eps': 0, 'objective/kl': 55.45669937133789, 'objective/entropy': 4.892033100128174, 'objective/non_score_reward': -2.7728350162506104, 'objective/rlhf_reward': -5.590760231018066, 'objective/scores': -2.817925453186035, 'policy/approxkl_avg': 0.0006903498433530331, 'policy/clipfrac_avg': 0.00589622650295496, 'loss/policy_avg': -0.0010596569627523422, 'loss/value_avg': 0.09426462650299072, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.11355727910995483, 'val/ratio': 1.0019252300262451, 'val/ratio_var': 1.7496171267339378e-06, 'val/num_eos_tokens': 0, 'lr': 2.013718765311122e-05, 'episode': 4880, 'epoch': 0.6}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:49:19<1:05:39, 131kB/s]
 60%|█████▉    | 1221/2041 [1:46:01<1:10:12,  5.14s/it][A

{'eps': 0, 'objective/kl': 57.258934020996094, 'objective/entropy': 6.647016525268555, 'objective/non_score_reward': -2.8629467487335205, 'objective/rlhf_reward': -5.171069145202637, 'objective/scores': -2.3081226348876953, 'policy/approxkl_avg': 0.0003820850688498467, 'policy/clipfrac_avg': 0.004716981202363968, 'loss/policy_avg': -0.002980950754135847, 'loss/value_avg': 0.06347808241844177, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.12558653950691223, 'val/ratio': 0.9983609318733215, 'val/ratio_var': 3.881747488776455e-06, 'val/num_eos_tokens': 0, 'lr': 2.011268985791279e-05, 'episode': 4884, 'epoch': 0.6}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:49:24<1:05:39, 131kB/s]
 60%|█████▉    | 1222/2041 [1:46:06<1:10:09,  5.14s/it][A

{'eps': 0, 'objective/kl': 60.38261032104492, 'objective/entropy': 17.040637969970703, 'objective/non_score_reward': -3.0191304683685303, 'objective/rlhf_reward': -5.328072547912598, 'objective/scores': -2.3089420795440674, 'policy/approxkl_avg': 0.01206815056502819, 'policy/clipfrac_avg': 0.02358490601181984, 'loss/policy_avg': -0.012464940547943115, 'loss/value_avg': 0.06654959172010422, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.21048924326896667, 'val/ratio': 0.9902969002723694, 'val/ratio_var': 6.303624104475603e-05, 'val/num_eos_tokens': 0, 'lr': 2.0088192062714357e-05, 'episode': 4888, 'epoch': 0.6}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:49:30<1:05:39, 131kB/s]
 60%|█████▉    | 1223/2041 [1:46:11<1:10:32,  5.17s/it][A

{'eps': 0, 'objective/kl': 55.98591995239258, 'objective/entropy': 3.7165536880493164, 'objective/non_score_reward': -2.7992961406707764, 'objective/rlhf_reward': -5.17848014831543, 'objective/scores': -2.3791840076446533, 'policy/approxkl_avg': 0.0006686518318019807, 'policy/clipfrac_avg': 0.003537735901772976, 'loss/policy_avg': -0.002719961106777191, 'loss/value_avg': 0.07869676500558853, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.12053287029266357, 'val/ratio': 0.997616708278656, 'val/ratio_var': 5.7671236390888225e-06, 'val/num_eos_tokens': 0, 'lr': 2.0063694267515925e-05, 'episode': 4892, 'epoch': 0.6}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:49:35<1:05:39, 131kB/s]
 60%|█████▉    | 1224/2041 [1:46:16<1:09:52,  5.13s/it][A

{'eps': 0, 'objective/kl': 57.402809143066406, 'objective/entropy': 6.626273155212402, 'objective/non_score_reward': -2.870140552520752, 'objective/rlhf_reward': -5.5151448249816895, 'objective/scores': -2.6450042724609375, 'policy/approxkl_avg': 0.0023844176903367043, 'policy/clipfrac_avg': 0.017688680440187454, 'loss/policy_avg': -0.006755932699888945, 'loss/value_avg': 0.07449361681938171, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.10049684345722198, 'val/ratio': 0.9984772801399231, 'val/ratio_var': 1.957637095983955e-06, 'val/num_eos_tokens': 0, 'lr': 2.0039196472317493e-05, 'episode': 4896, 'epoch': 0.6}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:49:40<1:05:39, 131kB/s]
 60%|██████    | 1225/2041 [1:46:21<1:09:59,  5.15s/it][A

{'eps': 0, 'objective/kl': 55.920040130615234, 'objective/entropy': 3.3765792846679688, 'objective/non_score_reward': -2.79600191116333, 'objective/rlhf_reward': -5.132469177246094, 'objective/scores': -2.3364672660827637, 'policy/approxkl_avg': 0.0014777362812310457, 'policy/clipfrac_avg': 0.009433962404727936, 'loss/policy_avg': -0.005018607713282108, 'loss/value_avg': 0.0381014347076416, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.08141911029815674, 'val/ratio': 0.998045802116394, 'val/ratio_var': 3.6647932120104088e-06, 'val/num_eos_tokens': 0, 'lr': 2.001469867711906e-05, 'episode': 4900, 'epoch': 0.6}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:49:48<1:05:39, 131kB/s]
 60%|██████    | 1226/2041 [1:46:30<1:22:59,  6.11s/it][A

{'eps': 0, 'objective/kl': 53.38145446777344, 'objective/entropy': 3.4998388290405273, 'objective/non_score_reward': -2.6690728664398193, 'objective/rlhf_reward': -4.817748546600342, 'objective/scores': -2.1486756801605225, 'policy/approxkl_avg': 0.0011577897239476442, 'policy/clipfrac_avg': 0.010613207705318928, 'loss/policy_avg': -0.0003142980858683586, 'loss/value_avg': 0.03881363570690155, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.09431031346321106, 'val/ratio': 0.9976550936698914, 'val/ratio_var': 3.7432394037750782e-06, 'val/num_eos_tokens': 0, 'lr': 1.999020088192063e-05, 'episode': 4904, 'epoch': 0.6}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:49:53<1:05:39, 131kB/s]
 60%|██████    | 1227/2041 [1:46:35<1:18:36,  5.79s/it][A

{'eps': 0, 'objective/kl': 53.46083450317383, 'objective/entropy': 0.8627197742462158, 'objective/non_score_reward': -2.673041820526123, 'objective/rlhf_reward': -4.754012107849121, 'objective/scores': -2.080970525741577, 'policy/approxkl_avg': 7.110290607670322e-05, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.0010706714820116758, 'loss/value_avg': 0.03250604867935181, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.05186223238706589, 'val/ratio': 0.9993524551391602, 'val/ratio_var': 4.232103947288124e-07, 'val/num_eos_tokens': 0, 'lr': 1.9965703086722197e-05, 'episode': 4908, 'epoch': 0.6}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:49:58<1:05:39, 131kB/s]
 60%|██████    | 1228/2041 [1:46:40<1:16:01,  5.61s/it][A

{'eps': 0, 'objective/kl': 72.10435485839844, 'objective/entropy': 48.15373611450195, 'objective/non_score_reward': -3.6052184104919434, 'objective/rlhf_reward': -5.561921119689941, 'objective/scores': -1.9567025899887085, 'policy/approxkl_avg': 0.05613762140274048, 'policy/clipfrac_avg': 0.07783018797636032, 'loss/policy_avg': -0.030399538576602936, 'loss/value_avg': 0.19391897320747375, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7901328802108765, 'val/ratio': 0.9929534196853638, 'val/ratio_var': 2.459727147652302e-05, 'val/num_eos_tokens': 0, 'lr': 1.9941205291523765e-05, 'episode': 4912, 'epoch': 0.6}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:50:04<1:05:39, 131kB/s]
 60%|██████    | 1229/2041 [1:46:45<1:13:57,  5.46s/it][A

{'eps': 0, 'objective/kl': 65.1776123046875, 'objective/entropy': 28.153120040893555, 'objective/non_score_reward': -3.258880853652954, 'objective/rlhf_reward': -5.445863723754883, 'objective/scores': -2.1869828701019287, 'policy/approxkl_avg': 0.008044760674238205, 'policy/clipfrac_avg': 0.04363207519054413, 'loss/policy_avg': -0.026112377643585205, 'loss/value_avg': 0.1333903670310974, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.49504467844963074, 'val/ratio': 0.9846374988555908, 'val/ratio_var': 0.00015172369603533298, 'val/num_eos_tokens': 0, 'lr': 1.9916707496325333e-05, 'episode': 4916, 'epoch': 0.6}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:50:09<1:05:39, 131kB/s]
 60%|██████    | 1230/2041 [1:46:50<1:12:28,  5.36s/it][A

{'eps': 0, 'objective/kl': 71.45484924316406, 'objective/entropy': 15.881604194641113, 'objective/non_score_reward': -3.5727427005767822, 'objective/rlhf_reward': -6.108366966247559, 'objective/scores': -2.5356245040893555, 'policy/approxkl_avg': 0.00988966878503561, 'policy/clipfrac_avg': 0.03891509398818016, 'loss/policy_avg': -0.02006208896636963, 'loss/value_avg': 0.1939961016178131, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.29523783922195435, 'val/ratio': 0.9827439188957214, 'val/ratio_var': 0.0002053712960332632, 'val/num_eos_tokens': 0, 'lr': 1.9892209701126898e-05, 'episode': 4920, 'epoch': 0.6}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:50:14<1:05:39, 131kB/s]
 60%|██████    | 1231/2041 [1:46:55<1:11:35,  5.30s/it][A

{'eps': 0, 'objective/kl': 50.48822784423828, 'objective/entropy': 19.868282318115234, 'objective/non_score_reward': -2.524411678314209, 'objective/rlhf_reward': -5.128175735473633, 'objective/scores': -2.603764295578003, 'policy/approxkl_avg': 0.012255653738975525, 'policy/clipfrac_avg': 0.04009433835744858, 'loss/policy_avg': -0.017017195001244545, 'loss/value_avg': 0.0502166822552681, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.35102009773254395, 'val/ratio': 1.0119506120681763, 'val/ratio_var': 0.00023422185040544719, 'val/num_eos_tokens': 0, 'lr': 1.986771190592847e-05, 'episode': 4924, 'epoch': 0.6}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:50:19<1:05:39, 131kB/s]
 60%|██████    | 1232/2041 [1:47:01<1:11:09,  5.28s/it][A

{'eps': 0, 'objective/kl': 66.21630859375, 'objective/entropy': 30.778236389160156, 'objective/non_score_reward': -3.3108158111572266, 'objective/rlhf_reward': -4.943516731262207, 'objective/scores': -1.6327009201049805, 'policy/approxkl_avg': 0.006311620585620403, 'policy/clipfrac_avg': 0.03891509398818016, 'loss/policy_avg': -0.017711535096168518, 'loss/value_avg': 0.10085602104663849, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.5000244379043579, 'val/ratio': 0.9836671352386475, 'val/ratio_var': 0.00020243394828867167, 'val/num_eos_tokens': 0, 'lr': 1.9843214110730034e-05, 'episode': 4928, 'epoch': 0.6}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:50:24<1:05:39, 131kB/s]
 60%|██████    | 1233/2041 [1:47:06<1:10:22,  5.23s/it][A

{'eps': 0, 'objective/kl': 54.40397644042969, 'objective/entropy': 0.7039799690246582, 'objective/non_score_reward': -2.720198631286621, 'objective/rlhf_reward': -4.925790309906006, 'objective/scores': -2.2055916786193848, 'policy/approxkl_avg': 3.9374612242681906e-05, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.0016314578242599964, 'loss/value_avg': 0.03184403106570244, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.06165068969130516, 'val/ratio': 0.9988148212432861, 'val/ratio_var': 2.06378240363847e-06, 'val/num_eos_tokens': 0, 'lr': 1.9818716315531602e-05, 'episode': 4932, 'epoch': 0.6}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:50:29<1:05:39, 131kB/s]
 60%|██████    | 1234/2041 [1:47:11<1:09:41,  5.18s/it][A

{'eps': 0, 'objective/kl': 54.21230697631836, 'objective/entropy': 2.1052935123443604, 'objective/non_score_reward': -2.710615634918213, 'objective/rlhf_reward': -5.016859531402588, 'objective/scores': -2.306243896484375, 'policy/approxkl_avg': 0.00022309640189632773, 'policy/clipfrac_avg': 0.003537735901772976, 'loss/policy_avg': -0.0011168931378051639, 'loss/value_avg': 0.03018878772854805, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.09504541754722595, 'val/ratio': 0.9977759718894958, 'val/ratio_var': 3.3339185847580666e-06, 'val/num_eos_tokens': 0, 'lr': 1.979421852033317e-05, 'episode': 4936, 'epoch': 0.6}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:50:34<1:05:39, 131kB/s]
 61%|██████    | 1235/2041 [1:47:16<1:09:20,  5.16s/it][A

{'eps': 0, 'objective/kl': 70.44879913330078, 'objective/entropy': 15.253260612487793, 'objective/non_score_reward': -3.522439956665039, 'objective/rlhf_reward': -6.452581405639648, 'objective/scores': -2.9301412105560303, 'policy/approxkl_avg': 0.0026936205103993416, 'policy/clipfrac_avg': 0.017688680440187454, 'loss/policy_avg': -0.014132342301309109, 'loss/value_avg': 0.14336125552654266, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.35099005699157715, 'val/ratio': 0.9984089136123657, 'val/ratio_var': 1.1581712442421122e-06, 'val/num_eos_tokens': 0, 'lr': 1.9769720725134738e-05, 'episode': 4940, 'epoch': 0.61}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:50:39<1:05:39, 131kB/s]
 61%|██████    | 1236/2041 [1:47:21<1:08:54,  5.14s/it][A

{'eps': 0, 'objective/kl': 65.80902862548828, 'objective/entropy': 13.676202774047852, 'objective/non_score_reward': -3.2904510498046875, 'objective/rlhf_reward': -5.872368335723877, 'objective/scores': -2.5819172859191895, 'policy/approxkl_avg': 0.005917937960475683, 'policy/clipfrac_avg': 0.025943396613001823, 'loss/policy_avg': -0.01791195757687092, 'loss/value_avg': 0.20659306645393372, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.2519025206565857, 'val/ratio': 0.9897814393043518, 'val/ratio_var': 9.04501139302738e-05, 'val/num_eos_tokens': 0, 'lr': 1.974522292993631e-05, 'episode': 4944, 'epoch': 0.61}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:50:45<1:05:39, 131kB/s]
 61%|██████    | 1237/2041 [1:47:26<1:08:43,  5.13s/it][A

{'eps': 0, 'objective/kl': 59.21004867553711, 'objective/entropy': 6.7368974685668945, 'objective/non_score_reward': -2.9605021476745605, 'objective/rlhf_reward': -5.367110252380371, 'objective/scores': -2.4066081047058105, 'policy/approxkl_avg': 0.005173748824745417, 'policy/clipfrac_avg': 0.02004716917872429, 'loss/policy_avg': -0.010085995309054852, 'loss/value_avg': 0.04722389951348305, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1606369912624359, 'val/ratio': 0.9926877021789551, 'val/ratio_var': 3.837138137896545e-05, 'val/num_eos_tokens': 0, 'lr': 1.9720725134737874e-05, 'episode': 4948, 'epoch': 0.61}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:50:50<1:05:39, 131kB/s]
 61%|██████    | 1238/2041 [1:47:31<1:08:46,  5.14s/it][A

{'eps': 0, 'objective/kl': 52.98394775390625, 'objective/entropy': 5.929290294647217, 'objective/non_score_reward': -2.6491973400115967, 'objective/rlhf_reward': -4.898159980773926, 'objective/scores': -2.248962879180908, 'policy/approxkl_avg': 0.004796233959496021, 'policy/clipfrac_avg': 0.02712264284491539, 'loss/policy_avg': -0.009251043200492859, 'loss/value_avg': 0.045130349695682526, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1443115919828415, 'val/ratio': 1.011134147644043, 'val/ratio_var': 9.572383714839816e-05, 'val/num_eos_tokens': 0, 'lr': 1.9696227339539442e-05, 'episode': 4952, 'epoch': 0.61}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:50:55<1:05:39, 131kB/s]
 61%|██████    | 1239/2041 [1:47:36<1:08:46,  5.14s/it][A

{'eps': 0, 'objective/kl': 45.65812683105469, 'objective/entropy': 4.3040313720703125, 'objective/non_score_reward': -2.2829062938690186, 'objective/rlhf_reward': -4.517520904541016, 'objective/scores': -2.234614372253418, 'policy/approxkl_avg': 0.0008531223866157234, 'policy/clipfrac_avg': 0.01179245300590992, 'loss/policy_avg': -0.00395045755431056, 'loss/value_avg': 0.046896886080503464, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.10952456295490265, 'val/ratio': 0.9979338049888611, 'val/ratio_var': 3.4067252272507176e-06, 'val/num_eos_tokens': 0, 'lr': 1.967172954434101e-05, 'episode': 4956, 'epoch': 0.61}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:51:00<1:05:39, 131kB/s]
 61%|██████    | 1240/2041 [1:47:42<1:08:45,  5.15s/it][A

{'eps': 0, 'objective/kl': 50.18090057373047, 'objective/entropy': 32.855628967285156, 'objective/non_score_reward': -2.509044885635376, 'objective/rlhf_reward': -4.768056869506836, 'objective/scores': -2.259012222290039, 'policy/approxkl_avg': 0.015281646512448788, 'policy/clipfrac_avg': 0.06603773683309555, 'loss/policy_avg': -0.025515716522932053, 'loss/value_avg': 0.05574977025389671, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.5594145655632019, 'val/ratio': 0.983826220035553, 'val/ratio_var': 0.00018559694581199437, 'val/num_eos_tokens': 0, 'lr': 1.9647231749142578e-05, 'episode': 4960, 'epoch': 0.61}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:51:05<1:05:39, 131kB/s]
 61%|██████    | 1241/2041 [1:47:47<1:08:43,  5.15s/it][A

{'eps': 0, 'objective/kl': 57.393394470214844, 'objective/entropy': 22.85663604736328, 'objective/non_score_reward': -2.8696701526641846, 'objective/rlhf_reward': -5.478475570678711, 'objective/scores': -2.6088056564331055, 'policy/approxkl_avg': 0.0159857589751482, 'policy/clipfrac_avg': 0.06485849618911743, 'loss/policy_avg': -0.02336246147751808, 'loss/value_avg': 0.12774300575256348, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.45902764797210693, 'val/ratio': 0.9827913641929626, 'val/ratio_var': 0.00017581689462531358, 'val/num_eos_tokens': 0, 'lr': 1.9622733953944146e-05, 'episode': 4964, 'epoch': 0.61}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:51:10<1:05:39, 131kB/s]
 61%|██████    | 1242/2041 [1:47:52<1:08:32,  5.15s/it][A

{'eps': 0, 'objective/kl': 54.71715545654297, 'objective/entropy': 64.55714416503906, 'objective/non_score_reward': -2.7358579635620117, 'objective/rlhf_reward': -5.48622465133667, 'objective/scores': -2.750366687774658, 'policy/approxkl_avg': 0.01737976260483265, 'policy/clipfrac_avg': 0.10495282709598541, 'loss/policy_avg': -0.03572680801153183, 'loss/value_avg': 0.21693503856658936, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.24540376663208, 'val/ratio': 0.9873359799385071, 'val/ratio_var': 9.209082782035694e-05, 'val/num_eos_tokens': 0, 'lr': 1.9598236158745714e-05, 'episode': 4968, 'epoch': 0.61}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:51:15<1:05:39, 131kB/s]
 61%|██████    | 1243/2041 [1:47:57<1:08:07,  5.12s/it][A

{'eps': 0, 'objective/kl': 76.58511352539062, 'objective/entropy': 79.51675415039062, 'objective/non_score_reward': -3.8292555809020996, 'objective/rlhf_reward': -6.582554340362549, 'objective/scores': -2.753298759460449, 'policy/approxkl_avg': 0.06662670522928238, 'policy/clipfrac_avg': 0.1391509473323822, 'loss/policy_avg': -0.05028432607650757, 'loss/value_avg': 0.47512808442115784, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.507153034210205, 'val/ratio': 1.0035425424575806, 'val/ratio_var': 0.00025482350611127913, 'val/num_eos_tokens': 0, 'lr': 1.9573738363547282e-05, 'episode': 4972, 'epoch': 0.61}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:51:20<1:05:39, 131kB/s]
 61%|██████    | 1244/2041 [1:48:02<1:07:47,  5.10s/it][A

{'eps': 0, 'objective/kl': 57.38682174682617, 'objective/entropy': 59.08680725097656, 'objective/non_score_reward': -2.8693408966064453, 'objective/rlhf_reward': -5.361645698547363, 'objective/scores': -2.492305040359497, 'policy/approxkl_avg': 0.021387510001659393, 'policy/clipfrac_avg': 0.09198112785816193, 'loss/policy_avg': -0.033462025225162506, 'loss/value_avg': 0.17702503502368927, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1936734914779663, 'val/ratio': 0.9605162143707275, 'val/ratio_var': 0.001174471341073513, 'val/num_eos_tokens': 0, 'lr': 1.954924056834885e-05, 'episode': 4976, 'epoch': 0.61}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:51:26<1:05:39, 131kB/s]
 61%|██████    | 1245/2041 [1:48:07<1:07:37,  5.10s/it][A

{'eps': 0, 'objective/kl': 69.90751647949219, 'objective/entropy': 49.335662841796875, 'objective/non_score_reward': -3.4953761100769043, 'objective/rlhf_reward': -6.225778579711914, 'objective/scores': -2.730402708053589, 'policy/approxkl_avg': 0.01793469302356243, 'policy/clipfrac_avg': 0.08962263911962509, 'loss/policy_avg': -0.03564415127038956, 'loss/value_avg': 0.26699098944664, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8221002817153931, 'val/ratio': 0.9723175764083862, 'val/ratio_var': 0.0004974423791281879, 'val/num_eos_tokens': 0, 'lr': 1.9524742773150418e-05, 'episode': 4980, 'epoch': 0.61}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:51:31<1:05:39, 131kB/s]
 61%|██████    | 1246/2041 [1:48:12<1:07:43,  5.11s/it][A

{'eps': 0, 'objective/kl': 71.17029571533203, 'objective/entropy': 77.6658706665039, 'objective/non_score_reward': -3.5585145950317383, 'objective/rlhf_reward': -6.307843208312988, 'objective/scores': -2.74932861328125, 'policy/approxkl_avg': 0.025621967390179634, 'policy/clipfrac_avg': 0.13443395495414734, 'loss/policy_avg': -0.044991135597229004, 'loss/value_avg': 0.3152123689651489, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.4482829570770264, 'val/ratio': 0.9994062781333923, 'val/ratio_var': 6.487900827778503e-05, 'val/num_eos_tokens': 0, 'lr': 1.9500244977951983e-05, 'episode': 4984, 'epoch': 0.61}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:51:36<1:05:39, 131kB/s]
 61%|██████    | 1247/2041 [1:48:17<1:08:00,  5.14s/it][A

{'eps': 0, 'objective/kl': 61.7808837890625, 'objective/entropy': 37.46237564086914, 'objective/non_score_reward': -3.0890443325042725, 'objective/rlhf_reward': -5.586941719055176, 'objective/scores': -2.4978976249694824, 'policy/approxkl_avg': 0.013509836979210377, 'policy/clipfrac_avg': 0.049528300762176514, 'loss/policy_avg': -0.023684075102210045, 'loss/value_avg': 0.15832795202732086, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7261088490486145, 'val/ratio': 0.9740592837333679, 'val/ratio_var': 0.0004976558848284185, 'val/num_eos_tokens': 0, 'lr': 1.9475747182753554e-05, 'episode': 4988, 'epoch': 0.61}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:51:41<1:05:39, 131kB/s]
 61%|██████    | 1248/2041 [1:48:23<1:07:45,  5.13s/it][A

{'eps': 0, 'objective/kl': 50.22657775878906, 'objective/entropy': 28.497278213500977, 'objective/non_score_reward': -2.511329174041748, 'objective/rlhf_reward': -5.031694412231445, 'objective/scores': -2.5203654766082764, 'policy/approxkl_avg': 0.00995239894837141, 'policy/clipfrac_avg': 0.0554245300590992, 'loss/policy_avg': -0.025081926956772804, 'loss/value_avg': 0.24988235533237457, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6156936883926392, 'val/ratio': 0.9904943704605103, 'val/ratio_var': 6.392548675648868e-05, 'val/num_eos_tokens': 0, 'lr': 1.945124938755512e-05, 'episode': 4992, 'epoch': 0.61}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:51:46<1:05:39, 131kB/s]
 61%|██████    | 1249/2041 [1:48:28<1:08:01,  5.15s/it][A

{'eps': 0, 'objective/kl': 45.322906494140625, 'objective/entropy': 33.85306167602539, 'objective/non_score_reward': -2.2661452293395996, 'objective/rlhf_reward': -5.0848846435546875, 'objective/scores': -2.818739175796509, 'policy/approxkl_avg': 0.00556223513558507, 'policy/clipfrac_avg': 0.03183962404727936, 'loss/policy_avg': -0.015000496990978718, 'loss/value_avg': 0.22526496648788452, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.5378302931785583, 'val/ratio': 0.9982622861862183, 'val/ratio_var': 1.4897514120093547e-06, 'val/num_eos_tokens': 0, 'lr': 1.942675159235669e-05, 'episode': 4996, 'epoch': 0.61}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:51:51<1:05:39, 131kB/s]
 61%|██████    | 1250/2041 [1:48:33<1:08:26,  5.19s/it][A

{'eps': 0, 'objective/kl': 63.30007553100586, 'objective/entropy': 43.9725341796875, 'objective/non_score_reward': -3.165004014968872, 'objective/rlhf_reward': -5.194418907165527, 'objective/scores': -2.0294151306152344, 'policy/approxkl_avg': 0.03354512155056, 'policy/clipfrac_avg': 0.08726415783166885, 'loss/policy_avg': -0.036890119314193726, 'loss/value_avg': 0.14708225429058075, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8316966891288757, 'val/ratio': 1.0667394399642944, 'val/ratio_var': 0.0039214142598211765, 'val/num_eos_tokens': 0, 'lr': 1.940225379715826e-05, 'episode': 5000, 'epoch': 0.61}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:51:57<1:05:39, 131kB/s]
 61%|██████▏   | 1251/2041 [1:48:38<1:08:11,  5.18s/it][A

{'eps': 0, 'objective/kl': 37.291046142578125, 'objective/entropy': 3.6659128665924072, 'objective/non_score_reward': -1.8645524978637695, 'objective/rlhf_reward': -3.9924588203430176, 'objective/scores': -2.127906322479248, 'policy/approxkl_avg': 0.0005058776005171239, 'policy/clipfrac_avg': 0.012971698306500912, 'loss/policy_avg': -0.0067334650084376335, 'loss/value_avg': 0.05234111472964287, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.15225014090538025, 'val/ratio': 1.0040026903152466, 'val/ratio_var': 8.032494406506885e-06, 'val/num_eos_tokens': 0, 'lr': 1.9377756001959823e-05, 'episode': 5004, 'epoch': 0.61}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:52:02<1:05:39, 131kB/s]
 61%|██████▏   | 1252/2041 [1:48:43<1:08:11,  5.19s/it][A

{'eps': 0, 'objective/kl': 53.4803352355957, 'objective/entropy': 46.03202438354492, 'objective/non_score_reward': -2.6740169525146484, 'objective/rlhf_reward': -5.194532871246338, 'objective/scores': -2.5205159187316895, 'policy/approxkl_avg': 0.013121441937983036, 'policy/clipfrac_avg': 0.06485848873853683, 'loss/policy_avg': -0.031962450593709946, 'loss/value_avg': 0.16572052240371704, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8726063370704651, 'val/ratio': 0.9777051210403442, 'val/ratio_var': 0.0003472271200735122, 'val/num_eos_tokens': 0, 'lr': 1.9353258206761395e-05, 'episode': 5008, 'epoch': 0.61}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:52:07<1:05:39, 131kB/s]
 61%|██████▏   | 1253/2041 [1:48:49<1:08:04,  5.18s/it][A

{'eps': 0, 'objective/kl': 71.94670104980469, 'objective/entropy': 50.02059555053711, 'objective/non_score_reward': -3.597334861755371, 'objective/rlhf_reward': -5.937077522277832, 'objective/scores': -2.339742660522461, 'policy/approxkl_avg': 0.023222189396619797, 'policy/clipfrac_avg': 0.08018868416547775, 'loss/policy_avg': -0.03368214890360832, 'loss/value_avg': 0.3559320867061615, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9102615118026733, 'val/ratio': 0.9929966926574707, 'val/ratio_var': 2.4809733076835983e-05, 'val/num_eos_tokens': 0, 'lr': 1.932876041156296e-05, 'episode': 5012, 'epoch': 0.61}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:52:12<1:05:39, 131kB/s]
 61%|██████▏   | 1254/2041 [1:48:54<1:07:58,  5.18s/it][A

{'eps': 0, 'objective/kl': 67.4696044921875, 'objective/entropy': 73.86289978027344, 'objective/non_score_reward': -3.3734803199768066, 'objective/rlhf_reward': -5.619314193725586, 'objective/scores': -2.2458338737487793, 'policy/approxkl_avg': 0.03157567232847214, 'policy/clipfrac_avg': 0.1179245263338089, 'loss/policy_avg': -0.045284610241651535, 'loss/value_avg': 0.1439257711172104, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.3521138429641724, 'val/ratio': 0.9824603796005249, 'val/ratio_var': 0.00017484881391283125, 'val/num_eos_tokens': 0, 'lr': 1.9304262616364527e-05, 'episode': 5016, 'epoch': 0.61}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:52:17<1:05:39, 131kB/s]
 61%|██████▏   | 1255/2041 [1:48:59<1:07:55,  5.19s/it][A

{'eps': 0, 'objective/kl': 43.23210144042969, 'objective/entropy': 22.337799072265625, 'objective/non_score_reward': -2.1616051197052, 'objective/rlhf_reward': -4.207817077636719, 'objective/scores': -2.0462119579315186, 'policy/approxkl_avg': 0.020632304251194, 'policy/clipfrac_avg': 0.06603773683309555, 'loss/policy_avg': -0.025750629603862762, 'loss/value_avg': 0.2327548861503601, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.4814334213733673, 'val/ratio': 0.9884334802627563, 'val/ratio_var': 7.723532326053828e-05, 'val/num_eos_tokens': 0, 'lr': 1.9279764821166095e-05, 'episode': 5020, 'epoch': 0.61}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:52:22<1:05:39, 131kB/s]
 62%|██████▏   | 1256/2041 [1:49:04<1:07:37,  5.17s/it][A

{'eps': 0, 'objective/kl': 55.99063491821289, 'objective/entropy': 26.430015563964844, 'objective/non_score_reward': -2.7995314598083496, 'objective/rlhf_reward': -4.730648994445801, 'objective/scores': -1.9311174154281616, 'policy/approxkl_avg': 0.016229208558797836, 'policy/clipfrac_avg': 0.05306604132056236, 'loss/policy_avg': -0.02149968035519123, 'loss/value_avg': 0.19308793544769287, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.42614734172821045, 'val/ratio': 1.0055328607559204, 'val/ratio_var': 9.177724859910086e-05, 'val/num_eos_tokens': 0, 'lr': 1.9255267025967663e-05, 'episode': 5024, 'epoch': 0.62}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:52:28<1:05:39, 131kB/s]
 62%|██████▏   | 1257/2041 [1:49:09<1:07:35,  5.17s/it][A

{'eps': 0, 'objective/kl': 63.97780990600586, 'objective/entropy': 68.47112274169922, 'objective/non_score_reward': -3.1988906860351562, 'objective/rlhf_reward': -4.542683124542236, 'objective/scores': -1.3437923192977905, 'policy/approxkl_avg': 0.025335073471069336, 'policy/clipfrac_avg': 0.09198113530874252, 'loss/policy_avg': -0.039213426411151886, 'loss/value_avg': 0.3164101243019104, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.3099642992019653, 'val/ratio': 0.9938105344772339, 'val/ratio_var': 1.8405053197056986e-05, 'val/num_eos_tokens': 0, 'lr': 1.923076923076923e-05, 'episode': 5028, 'epoch': 0.62}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:52:33<1:05:39, 131kB/s]
 62%|██████▏   | 1258/2041 [1:49:14<1:07:13,  5.15s/it][A

{'eps': 0, 'objective/kl': 43.781883239746094, 'objective/entropy': 9.801203727722168, 'objective/non_score_reward': -2.189094066619873, 'objective/rlhf_reward': -3.956324338912964, 'objective/scores': -1.7672302722930908, 'policy/approxkl_avg': 0.4605923295021057, 'policy/clipfrac_avg': 0.05070754885673523, 'loss/policy_avg': -0.02024776116013527, 'loss/value_avg': 0.3611046373844147, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.1877826601266861, 'val/ratio': 0.9870429635047913, 'val/ratio_var': 8.676268043927848e-05, 'val/num_eos_tokens': 0, 'lr': 1.92062714355708e-05, 'episode': 5032, 'epoch': 0.62}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:52:38<1:05:39, 131kB/s]
 62%|██████▏   | 1259/2041 [1:49:20<1:07:19,  5.17s/it][A

{'eps': 0, 'objective/kl': 42.45014953613281, 'objective/entropy': 0.435957670211792, 'objective/non_score_reward': -2.122507333755493, 'objective/rlhf_reward': -4.02931022644043, 'objective/scores': -1.9068026542663574, 'policy/approxkl_avg': 4.832401464227587e-05, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.0005006003193557262, 'loss/value_avg': 0.0428880974650383, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.04247744753956795, 'val/ratio': 1.0019036531448364, 'val/ratio_var': 2.1668613499059575e-06, 'val/num_eos_tokens': 0, 'lr': 1.9181773640372368e-05, 'episode': 5036, 'epoch': 0.62}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:52:43<1:05:39, 131kB/s]
 62%|██████▏   | 1260/2041 [1:49:25<1:07:12,  5.16s/it][A

{'eps': 0, 'objective/kl': 42.615692138671875, 'objective/entropy': 0.19498205184936523, 'objective/non_score_reward': -2.130784511566162, 'objective/rlhf_reward': -3.8960537910461426, 'objective/scores': -1.765269160270691, 'policy/approxkl_avg': 1.263341800950002e-06, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.0001757195423124358, 'loss/value_avg': 0.045546647161245346, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.025567451491951942, 'val/ratio': 1.000304937362671, 'val/ratio_var': 5.3874000371934017e-08, 'val/num_eos_tokens': 0, 'lr': 1.9157275845173936e-05, 'episode': 5040, 'epoch': 0.62}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:52:48<1:05:39, 131kB/s]
 62%|██████▏   | 1261/2041 [1:49:30<1:07:12,  5.17s/it][A

{'eps': 0, 'objective/kl': 39.20220947265625, 'objective/entropy': 1.4456062316894531, 'objective/non_score_reward': -1.9601106643676758, 'objective/rlhf_reward': -3.9650278091430664, 'objective/scores': -2.0049171447753906, 'policy/approxkl_avg': 0.001695616520009935, 'policy/clipfrac_avg': 0.003537735901772976, 'loss/policy_avg': -0.0005218024016357958, 'loss/value_avg': 0.04192294180393219, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.021910736337304115, 'val/ratio': 1.0056589841842651, 'val/ratio_var': 2.8613249014597386e-05, 'val/num_eos_tokens': 0, 'lr': 1.9132778049975504e-05, 'episode': 5044, 'epoch': 0.62}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:52:53<1:05:39, 131kB/s]
 62%|██████▏   | 1262/2041 [1:49:35<1:06:53,  5.15s/it][A

{'eps': 0, 'objective/kl': 45.4630126953125, 'objective/entropy': 2.052086353302002, 'objective/non_score_reward': -2.2731504440307617, 'objective/rlhf_reward': -3.8389334678649902, 'objective/scores': -1.565782904624939, 'policy/approxkl_avg': 0.050033386796712875, 'policy/clipfrac_avg': 0.024764152243733406, 'loss/policy_avg': -0.012186167761683464, 'loss/value_avg': 0.061052627861499786, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.09067581593990326, 'val/ratio': 0.9994394183158875, 'val/ratio_var': 3.6142691897111945e-06, 'val/num_eos_tokens': 0, 'lr': 1.910828025477707e-05, 'episode': 5048, 'epoch': 0.62}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:52:59<1:05:39, 131kB/s]
 62%|██████▏   | 1263/2041 [1:49:40<1:06:37,  5.14s/it][A

{'eps': 0, 'objective/kl': 73.81649780273438, 'objective/entropy': 33.04316329956055, 'objective/non_score_reward': -3.6908247470855713, 'objective/rlhf_reward': -5.133637428283691, 'objective/scores': -1.4428126811981201, 'policy/approxkl_avg': 0.04881307855248451, 'policy/clipfrac_avg': 0.1108490526676178, 'loss/policy_avg': -0.027619237080216408, 'loss/value_avg': 0.3615995943546295, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.5580989122390747, 'val/ratio': 0.9735285639762878, 'val/ratio_var': 0.00042411230970174074, 'val/num_eos_tokens': 0, 'lr': 1.908378245957864e-05, 'episode': 5052, 'epoch': 0.62}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:53:04<1:05:39, 131kB/s]
 62%|██████▏   | 1264/2041 [1:49:45<1:06:18,  5.12s/it][A

{'eps': 0, 'objective/kl': 75.05570983886719, 'objective/entropy': 48.54792785644531, 'objective/non_score_reward': -3.7527856826782227, 'objective/rlhf_reward': -4.49415397644043, 'objective/scores': -0.7413681745529175, 'policy/approxkl_avg': 0.029126228764653206, 'policy/clipfrac_avg': 0.11320754885673523, 'loss/policy_avg': -0.031163888052105904, 'loss/value_avg': 0.7030338644981384, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8811472058296204, 'val/ratio': 1.0130876302719116, 'val/ratio_var': 0.0001874882582342252, 'val/num_eos_tokens': 0, 'lr': 1.9059284664380204e-05, 'episode': 5056, 'epoch': 0.62}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:53:09<1:05:39, 131kB/s]
 62%|██████▏   | 1265/2041 [1:49:50<1:06:00,  5.10s/it][A

{'eps': 0, 'objective/kl': 71.68196105957031, 'objective/entropy': 21.761211395263672, 'objective/non_score_reward': -3.5840983390808105, 'objective/rlhf_reward': -4.938554763793945, 'objective/scores': -1.3544561862945557, 'policy/approxkl_avg': 0.07263708114624023, 'policy/clipfrac_avg': 0.09080188721418381, 'loss/policy_avg': -0.03328807279467583, 'loss/value_avg': 0.33184629678726196, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.4896639585494995, 'val/ratio': 0.9807283878326416, 'val/ratio_var': 0.000272994366241619, 'val/num_eos_tokens': 0, 'lr': 1.9034786869181776e-05, 'episode': 5060, 'epoch': 0.62}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:53:14<1:05:39, 131kB/s]
 62%|██████▏   | 1266/2041 [1:49:55<1:06:06,  5.12s/it][A

{'eps': 0, 'objective/kl': 59.606414794921875, 'objective/entropy': 21.116477966308594, 'objective/non_score_reward': -2.980320692062378, 'objective/rlhf_reward': -4.437819480895996, 'objective/scores': -1.4574990272521973, 'policy/approxkl_avg': 0.031139837577939034, 'policy/clipfrac_avg': 0.056603774428367615, 'loss/policy_avg': -0.020927395671606064, 'loss/value_avg': 0.18081273138523102, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.39782077074050903, 'val/ratio': 1.006742238998413, 'val/ratio_var': 4.7957957576727495e-05, 'val/num_eos_tokens': 0, 'lr': 1.9010289073983344e-05, 'episode': 5064, 'epoch': 0.62}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:53:19<1:05:39, 131kB/s]
 62%|██████▏   | 1267/2041 [1:50:01<1:06:31,  5.16s/it][A

{'eps': 0, 'objective/kl': 53.87797546386719, 'objective/entropy': 13.08605670928955, 'objective/non_score_reward': -2.6938986778259277, 'objective/rlhf_reward': -3.6444225311279297, 'objective/scores': -0.9505237936973572, 'policy/approxkl_avg': 0.017900187522172928, 'policy/clipfrac_avg': 0.044811323285102844, 'loss/policy_avg': -0.019737716764211655, 'loss/value_avg': 0.12469476461410522, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.3158189058303833, 'val/ratio': 0.9930813908576965, 'val/ratio_var': 2.8108384867664427e-05, 'val/num_eos_tokens': 0, 'lr': 1.898579127878491e-05, 'episode': 5068, 'epoch': 0.62}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:53:24<1:05:39, 131kB/s]
 62%|██████▏   | 1268/2041 [1:50:06<1:06:40,  5.17s/it][A

{'eps': 0, 'objective/kl': 58.14967727661133, 'objective/entropy': 23.655555725097656, 'objective/non_score_reward': -2.9074835777282715, 'objective/rlhf_reward': -3.5184695720672607, 'objective/scores': -0.610986053943634, 'policy/approxkl_avg': 0.04294343292713165, 'policy/clipfrac_avg': 0.0766509473323822, 'loss/policy_avg': -0.024531058967113495, 'loss/value_avg': 0.3329320549964905, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.3595663905143738, 'val/ratio': 1.0087776184082031, 'val/ratio_var': 6.946227949811146e-05, 'val/num_eos_tokens': 0, 'lr': 1.896129348358648e-05, 'episode': 5072, 'epoch': 0.62}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:53:29<1:05:39, 131kB/s]
 62%|██████▏   | 1269/2041 [1:50:11<1:06:42,  5.18s/it][A

{'eps': 0, 'objective/kl': 91.34689331054688, 'objective/entropy': 36.803043365478516, 'objective/non_score_reward': -4.567344665527344, 'objective/rlhf_reward': -5.652379989624023, 'objective/scores': -1.0850353240966797, 'policy/approxkl_avg': 0.034785058349370956, 'policy/clipfrac_avg': 0.08726415783166885, 'loss/policy_avg': -0.02412334829568863, 'loss/value_avg': 0.4284884035587311, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7557935118675232, 'val/ratio': 1.0061883926391602, 'val/ratio_var': 5.552511356654577e-05, 'val/num_eos_tokens': 0, 'lr': 1.8936795688388045e-05, 'episode': 5076, 'epoch': 0.62}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:53:35<1:05:39, 131kB/s]
 62%|██████▏   | 1270/2041 [1:50:16<1:06:19,  5.16s/it][A

{'eps': 0, 'objective/kl': 92.4927978515625, 'objective/entropy': 55.83599090576172, 'objective/non_score_reward': -4.624639511108398, 'objective/rlhf_reward': -4.334697723388672, 'objective/scores': 0.28994160890579224, 'policy/approxkl_avg': 0.011559304781258106, 'policy/clipfrac_avg': 0.09198112785816193, 'loss/policy_avg': -0.029897017404437065, 'loss/value_avg': 0.5992496013641357, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0187989473342896, 'val/ratio': 0.9926817417144775, 'val/ratio_var': 4.577252911985852e-05, 'val/num_eos_tokens': 0, 'lr': 1.8912297893189616e-05, 'episode': 5080, 'epoch': 0.62}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:53:40<1:05:39, 131kB/s]
 62%|██████▏   | 1271/2041 [1:50:21<1:06:19,  5.17s/it][A

{'eps': 0, 'objective/kl': 83.06503295898438, 'objective/entropy': 38.29534912109375, 'objective/non_score_reward': -4.153252124786377, 'objective/rlhf_reward': -5.26894474029541, 'objective/scores': -1.1156924962997437, 'policy/approxkl_avg': 0.012094239704310894, 'policy/clipfrac_avg': 0.08726415038108826, 'loss/policy_avg': -0.03152330964803696, 'loss/value_avg': 0.4392580986022949, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8642679452896118, 'val/ratio': 1.0002315044403076, 'val/ratio_var': 4.5461464992513356e-07, 'val/num_eos_tokens': 0, 'lr': 1.888780009799118e-05, 'episode': 5084, 'epoch': 0.62}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:53:45<1:05:39, 131kB/s]
 62%|██████▏   | 1272/2041 [1:50:27<1:06:27,  5.18s/it][A

{'eps': 0, 'objective/kl': 94.35968017578125, 'objective/entropy': 49.10477828979492, 'objective/non_score_reward': -4.717984199523926, 'objective/rlhf_reward': -4.577163219451904, 'objective/scores': 0.14082108438014984, 'policy/approxkl_avg': 0.013492260128259659, 'policy/clipfrac_avg': 0.09551887214183807, 'loss/policy_avg': -0.030704574659466743, 'loss/value_avg': 0.5724079608917236, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.994350254535675, 'val/ratio': 1.0002987384796143, 'val/ratio_var': 8.741249644117488e-07, 'val/num_eos_tokens': 0, 'lr': 1.886330230279275e-05, 'episode': 5088, 'epoch': 0.62}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:53:50<1:05:39, 131kB/s]
 62%|██████▏   | 1273/2041 [1:50:32<1:06:38,  5.21s/it][A

{'eps': 0, 'objective/kl': 86.7281265258789, 'objective/entropy': 56.6611213684082, 'objective/non_score_reward': -4.336406230926514, 'objective/rlhf_reward': -4.91341495513916, 'objective/scores': -0.577008843421936, 'policy/approxkl_avg': 0.022939415648579597, 'policy/clipfrac_avg': 0.10613207519054413, 'loss/policy_avg': -0.033578451722860336, 'loss/value_avg': 0.7620497345924377, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.098933458328247, 'val/ratio': 1.0066721439361572, 'val/ratio_var': 4.7783581976545975e-05, 'val/num_eos_tokens': 0, 'lr': 1.8838804507594317e-05, 'episode': 5092, 'epoch': 0.62}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:53:55<1:05:39, 131kB/s]
 62%|██████▏   | 1274/2041 [1:50:37<1:06:16,  5.18s/it][A

{'eps': 0, 'objective/kl': 90.71263122558594, 'objective/entropy': 61.167537689208984, 'objective/non_score_reward': -4.5356316566467285, 'objective/rlhf_reward': -5.310068130493164, 'objective/scores': -0.7744367122650146, 'policy/approxkl_avg': 0.015450790524482727, 'policy/clipfrac_avg': 0.09551886469125748, 'loss/policy_avg': -0.03730364143848419, 'loss/value_avg': 0.6654324531555176, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1872520446777344, 'val/ratio': 0.9842823147773743, 'val/ratio_var': 0.00016649709141347557, 'val/num_eos_tokens': 0, 'lr': 1.8814306712395885e-05, 'episode': 5096, 'epoch': 0.62}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:54:01<1:05:39, 131kB/s]
 62%|██████▏   | 1275/2041 [1:50:42<1:06:06,  5.18s/it][A

{'eps': 0, 'objective/kl': 94.21855163574219, 'objective/entropy': 69.55259704589844, 'objective/non_score_reward': -4.710927486419678, 'objective/rlhf_reward': -5.111149787902832, 'objective/scores': -0.40022215247154236, 'policy/approxkl_avg': 0.014943130314350128, 'policy/clipfrac_avg': 0.10495283454656601, 'loss/policy_avg': -0.03805746138095856, 'loss/value_avg': 0.4922635555267334, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2408504486083984, 'val/ratio': 0.9839975833892822, 'val/ratio_var': 0.00018081713642459363, 'val/num_eos_tokens': 0, 'lr': 1.8789808917197453e-05, 'episode': 5100, 'epoch': 0.62}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:54:06<1:05:39, 131kB/s]
 63%|██████▎   | 1276/2041 [1:50:47<1:06:03,  5.18s/it][A

{'eps': 0, 'objective/kl': 100.23965454101562, 'objective/entropy': 72.50928497314453, 'objective/non_score_reward': -5.011981964111328, 'objective/rlhf_reward': -5.7559943199157715, 'objective/scores': -0.744012176990509, 'policy/approxkl_avg': 0.054510291665792465, 'policy/clipfrac_avg': 0.13089622557163239, 'loss/policy_avg': -0.04463617876172066, 'loss/value_avg': 0.6525251269340515, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.324995517730713, 'val/ratio': 1.1411938667297363, 'val/ratio_var': 0.009821695275604725, 'val/num_eos_tokens': 0, 'lr': 1.876531112199902e-05, 'episode': 5104, 'epoch': 0.63}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:54:11<1:05:39, 131kB/s]
 63%|██████▎   | 1277/2041 [1:50:52<1:06:02,  5.19s/it][A

{'eps': 0, 'objective/kl': 90.14987182617188, 'objective/entropy': 76.63118743896484, 'objective/non_score_reward': -4.50749397277832, 'objective/rlhf_reward': -4.619797229766846, 'objective/scores': -0.11230331659317017, 'policy/approxkl_avg': 0.028569035232067108, 'policy/clipfrac_avg': 0.12971697747707367, 'loss/policy_avg': -0.03898858278989792, 'loss/value_avg': 0.8900232315063477, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2769145965576172, 'val/ratio': 1.0215034484863281, 'val/ratio_var': 0.00020673377730417997, 'val/num_eos_tokens': 0, 'lr': 1.874081332680059e-05, 'episode': 5108, 'epoch': 0.63}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:54:16<1:05:39, 131kB/s]
 63%|██████▎   | 1278/2041 [1:50:58<1:05:52,  5.18s/it][A

{'eps': 0, 'objective/kl': 88.96492004394531, 'objective/entropy': 67.94110107421875, 'objective/non_score_reward': -4.448246002197266, 'objective/rlhf_reward': -4.570328712463379, 'objective/scores': -0.12208284437656403, 'policy/approxkl_avg': 0.028094759210944176, 'policy/clipfrac_avg': 0.09669811278581619, 'loss/policy_avg': -0.03141508251428604, 'loss/value_avg': 0.6374073624610901, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.287844181060791, 'val/ratio': 1.0409231185913086, 'val/ratio_var': 0.0007792197284288704, 'val/num_eos_tokens': 0, 'lr': 1.8716315531602157e-05, 'episode': 5112, 'epoch': 0.63}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:54:21<1:05:39, 131kB/s]
 63%|██████▎   | 1279/2041 [1:51:03<1:05:45,  5.18s/it][A

{'eps': 0, 'objective/kl': 90.49906158447266, 'objective/entropy': 64.92021942138672, 'objective/non_score_reward': -4.5249528884887695, 'objective/rlhf_reward': -5.294102191925049, 'objective/scores': -0.7691494226455688, 'policy/approxkl_avg': 0.013228525407612324, 'policy/clipfrac_avg': 0.09669811278581619, 'loss/policy_avg': -0.035714928060770035, 'loss/value_avg': 0.619451642036438, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2334098815917969, 'val/ratio': 0.9820035099983215, 'val/ratio_var': 0.0002106975734932348, 'val/num_eos_tokens': 0, 'lr': 1.8691817736403725e-05, 'episode': 5116, 'epoch': 0.63}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:54:26<1:05:39, 131kB/s]
 63%|██████▎   | 1280/2041 [1:51:08<1:05:33,  5.17s/it][A

{'eps': 0, 'objective/kl': 89.53121948242188, 'objective/entropy': 72.35905456542969, 'objective/non_score_reward': -4.476560592651367, 'objective/rlhf_reward': -4.7234039306640625, 'objective/scores': -0.24684351682662964, 'policy/approxkl_avg': 0.013275008648633957, 'policy/clipfrac_avg': 0.08726415038108826, 'loss/policy_avg': -0.03661423176527023, 'loss/value_avg': 0.4882321357727051, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.371716022491455, 'val/ratio': 0.9843120574951172, 'val/ratio_var': 0.00017032142204698175, 'val/num_eos_tokens': 0, 'lr': 1.866731994120529e-05, 'episode': 5120, 'epoch': 0.63}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:54:32<1:05:39, 131kB/s]
 63%|██████▎   | 1281/2041 [1:51:13<1:05:14,  5.15s/it][A

{'eps': 0, 'objective/kl': 87.79759979248047, 'objective/entropy': 61.949066162109375, 'objective/non_score_reward': -4.389880180358887, 'objective/rlhf_reward': -4.086517810821533, 'objective/scores': 0.3033621609210968, 'policy/approxkl_avg': 0.014412990771234035, 'policy/clipfrac_avg': 0.11556603759527206, 'loss/policy_avg': -0.038541279733181, 'loss/value_avg': 0.6201947331428528, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1681079864501953, 'val/ratio': 0.987891435623169, 'val/ratio_var': 9.056463750312105e-05, 'val/num_eos_tokens': 0, 'lr': 1.864282214600686e-05, 'episode': 5124, 'epoch': 0.63}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:54:37<1:05:39, 131kB/s]
 63%|██████▎   | 1282/2041 [1:51:18<1:04:59,  5.14s/it][A

{'eps': 0, 'objective/kl': 96.19244384765625, 'objective/entropy': 62.82865524291992, 'objective/non_score_reward': -4.809621810913086, 'objective/rlhf_reward': -3.9620440006256104, 'objective/scores': 0.8475777506828308, 'policy/approxkl_avg': 0.012553911656141281, 'policy/clipfrac_avg': 0.09905660897493362, 'loss/policy_avg': -0.03713627904653549, 'loss/value_avg': 0.611237645149231, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1348092555999756, 'val/ratio': 0.9800059199333191, 'val/ratio_var': 0.00028003775514662266, 'val/num_eos_tokens': 0, 'lr': 1.861832435080843e-05, 'episode': 5128, 'epoch': 0.63}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:54:42<1:05:39, 131kB/s]
 63%|██████▎   | 1283/2041 [1:51:23<1:04:53,  5.14s/it][A

{'eps': 0, 'objective/kl': 93.44355773925781, 'objective/entropy': 69.32870483398438, 'objective/non_score_reward': -4.672177314758301, 'objective/rlhf_reward': -4.4589152336120605, 'objective/scores': 0.21326199173927307, 'policy/approxkl_avg': 0.027194710448384285, 'policy/clipfrac_avg': 0.1108490526676178, 'loss/policy_avg': -0.036648306995630264, 'loss/value_avg': 0.4939712882041931, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2796710729599, 'val/ratio': 1.0477311611175537, 'val/ratio_var': 0.003681678557768464, 'val/num_eos_tokens': 0, 'lr': 1.8593826555609997e-05, 'episode': 5132, 'epoch': 0.63}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:54:47<1:05:39, 131kB/s]
 63%|██████▎   | 1284/2041 [1:51:29<1:04:58,  5.15s/it][A

{'eps': 0, 'objective/kl': 113.07034301757812, 'objective/entropy': 70.58779907226562, 'objective/non_score_reward': -5.653517246246338, 'objective/rlhf_reward': -5.39156436920166, 'objective/scores': 0.26195287704467773, 'policy/approxkl_avg': 0.01653752475976944, 'policy/clipfrac_avg': 0.10495282709598541, 'loss/policy_avg': -0.03776257857680321, 'loss/value_avg': 0.5959087610244751, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.4388657808303833, 'val/ratio': 0.9817984104156494, 'val/ratio_var': 0.00021275108156260103, 'val/num_eos_tokens': 0, 'lr': 1.8569328760411565e-05, 'episode': 5136, 'epoch': 0.63}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:54:52<1:05:39, 131kB/s]
 63%|██████▎   | 1285/2041 [1:51:34<1:04:59,  5.16s/it][A

{'eps': 0, 'objective/kl': 104.52560424804688, 'objective/entropy': 85.34585571289062, 'objective/non_score_reward': -5.226280212402344, 'objective/rlhf_reward': -5.747279167175293, 'objective/scores': -0.5209987759590149, 'policy/approxkl_avg': 0.040542349219322205, 'policy/clipfrac_avg': 0.1603773534297943, 'loss/policy_avg': -0.050325945019721985, 'loss/value_avg': 0.44027459621429443, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.6341638565063477, 'val/ratio': 1.006842851638794, 'val/ratio_var': 0.00010750161163741723, 'val/num_eos_tokens': 0, 'lr': 1.854483096521313e-05, 'episode': 5140, 'epoch': 0.63}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:54:57<1:05:39, 131kB/s]
 63%|██████▎   | 1286/2041 [1:51:39<1:04:52,  5.16s/it][A

{'eps': 0, 'objective/kl': 98.0870590209961, 'objective/entropy': 73.38822937011719, 'objective/non_score_reward': -4.904353141784668, 'objective/rlhf_reward': -5.134218215942383, 'objective/scores': -0.22986510396003723, 'policy/approxkl_avg': 0.01907806284725666, 'policy/clipfrac_avg': 0.1179245263338089, 'loss/policy_avg': -0.04637894406914711, 'loss/value_avg': 0.7121899127960205, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.4511100053787231, 'val/ratio': 0.980850100517273, 'val/ratio_var': 0.00025071913842111826, 'val/num_eos_tokens': 0, 'lr': 1.85203331700147e-05, 'episode': 5144, 'epoch': 0.63}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:55:02<1:05:39, 131kB/s]
 63%|██████▎   | 1287/2041 [1:51:44<1:04:43,  5.15s/it][A

{'eps': 0, 'objective/kl': 104.5677490234375, 'objective/entropy': 85.41056060791016, 'objective/non_score_reward': -5.228387832641602, 'objective/rlhf_reward': -5.637699127197266, 'objective/scores': -0.40931129455566406, 'policy/approxkl_avg': 0.016374723985791206, 'policy/clipfrac_avg': 0.12028301507234573, 'loss/policy_avg': -0.0434633269906044, 'loss/value_avg': 0.6399686336517334, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.8147176504135132, 'val/ratio': 0.9680253267288208, 'val/ratio_var': 0.000759120739530772, 'val/num_eos_tokens': 0, 'lr': 1.8495835374816266e-05, 'episode': 5148, 'epoch': 0.63}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:55:08<1:05:39, 131kB/s]
 63%|██████▎   | 1288/2041 [1:51:49<1:04:31,  5.14s/it][A

{'eps': 0, 'objective/kl': 106.92951202392578, 'objective/entropy': 87.63081359863281, 'objective/non_score_reward': -5.346476078033447, 'objective/rlhf_reward': -5.002567291259766, 'objective/scores': 0.34390899538993835, 'policy/approxkl_avg': 0.02120370976626873, 'policy/clipfrac_avg': 0.16155660152435303, 'loss/policy_avg': -0.051171302795410156, 'loss/value_avg': 0.6134567856788635, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 2.0182251930236816, 'val/ratio': 0.9738852977752686, 'val/ratio_var': 0.0004412243142724037, 'val/num_eos_tokens': 0, 'lr': 1.8471337579617834e-05, 'episode': 5152, 'epoch': 0.63}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:55:13<1:05:39, 131kB/s]
 63%|██████▎   | 1289/2041 [1:51:54<1:04:10,  5.12s/it][A

{'eps': 0, 'objective/kl': 118.0210952758789, 'objective/entropy': 114.69723510742188, 'objective/non_score_reward': -5.901055335998535, 'objective/rlhf_reward': -5.725209712982178, 'objective/scores': 0.1758454144001007, 'policy/approxkl_avg': 0.016652394086122513, 'policy/clipfrac_avg': 0.14150942862033844, 'loss/policy_avg': -0.04831106215715408, 'loss/value_avg': 0.5955145359039307, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 2.0873823165893555, 'val/ratio': 0.982749342918396, 'val/ratio_var': 0.00021508285135496408, 'val/num_eos_tokens': 0, 'lr': 1.8446839784419402e-05, 'episode': 5156, 'epoch': 0.63}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:55:18<1:05:39, 131kB/s]
 63%|██████▎   | 1290/2041 [1:51:59<1:04:17,  5.14s/it][A

{'eps': 0, 'objective/kl': 113.09648132324219, 'objective/entropy': 96.49796295166016, 'objective/non_score_reward': -5.654824256896973, 'objective/rlhf_reward': -5.940552234649658, 'objective/scores': -0.28572797775268555, 'policy/approxkl_avg': 0.023737799376249313, 'policy/clipfrac_avg': 0.1521226465702057, 'loss/policy_avg': -0.049195852130651474, 'loss/value_avg': 0.6878112554550171, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.86765718460083, 'val/ratio': 1.0096336603164673, 'val/ratio_var': 0.00022751705546397716, 'val/num_eos_tokens': 0, 'lr': 1.842234198922097e-05, 'episode': 5160, 'epoch': 0.63}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:55:23<1:05:39, 131kB/s]
 63%|██████▎   | 1291/2041 [1:52:05<1:04:35,  5.17s/it][A

{'eps': 0, 'objective/kl': 109.72630310058594, 'objective/entropy': 100.83071899414062, 'objective/non_score_reward': -5.486315727233887, 'objective/rlhf_reward': -5.876753330230713, 'objective/scores': -0.39043760299682617, 'policy/approxkl_avg': 0.012396078556776047, 'policy/clipfrac_avg': 0.11910376697778702, 'loss/policy_avg': -0.04644783213734627, 'loss/value_avg': 0.5203942060470581, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 2.0570919513702393, 'val/ratio': 0.9879612326622009, 'val/ratio_var': 0.00010367009235778823, 'val/num_eos_tokens': 0, 'lr': 1.839784419402254e-05, 'episode': 5164, 'epoch': 0.63}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:55:28<1:05:39, 131kB/s]
 63%|██████▎   | 1292/2041 [1:52:10<1:04:34,  5.17s/it][A

{'eps': 0, 'objective/kl': 105.20799255371094, 'objective/entropy': 95.01832580566406, 'objective/non_score_reward': -5.26039981842041, 'objective/rlhf_reward': -5.726202487945557, 'objective/scores': -0.46580255031585693, 'policy/approxkl_avg': 0.01856827922165394, 'policy/clipfrac_avg': 0.13089622557163239, 'loss/policy_avg': -0.039238739758729935, 'loss/value_avg': 0.45027679204940796, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.7757225036621094, 'val/ratio': 0.9926283359527588, 'val/ratio_var': 2.9153079594834708e-05, 'val/num_eos_tokens': 0, 'lr': 1.8373346398824106e-05, 'episode': 5168, 'epoch': 0.63}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:55:33<1:05:39, 131kB/s]
 63%|██████▎   | 1293/2041 [1:52:15<1:04:19,  5.16s/it][A

{'eps': 0, 'objective/kl': 109.97615051269531, 'objective/entropy': 101.02412414550781, 'objective/non_score_reward': -5.498807430267334, 'objective/rlhf_reward': -5.5542449951171875, 'objective/scores': -0.05543772876262665, 'policy/approxkl_avg': 0.0152046550065279, 'policy/clipfrac_avg': 0.13561320304870605, 'loss/policy_avg': -0.0432482548058033, 'loss/value_avg': 0.5853439569473267, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.8816676139831543, 'val/ratio': 0.9751722812652588, 'val/ratio_var': 0.00042624110938049853, 'val/num_eos_tokens': 0, 'lr': 1.8348848603625674e-05, 'episode': 5172, 'epoch': 0.63}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:55:39<1:05:39, 131kB/s]
 63%|██████▎   | 1294/2041 [1:52:20<1:04:19,  5.17s/it][A

{'eps': 0, 'objective/kl': 113.78575134277344, 'objective/entropy': 89.456298828125, 'objective/non_score_reward': -5.689288139343262, 'objective/rlhf_reward': -5.7147603034973145, 'objective/scores': -0.025472283363342285, 'policy/approxkl_avg': 0.013293354772031307, 'policy/clipfrac_avg': 0.1179245263338089, 'loss/policy_avg': -0.04018115624785423, 'loss/value_avg': 0.6070617437362671, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.6844220161437988, 'val/ratio': 0.9967644214630127, 'val/ratio_var': 6.0339662013575435e-06, 'val/num_eos_tokens': 0, 'lr': 1.8324350808427242e-05, 'episode': 5176, 'epoch': 0.63}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:55:44<1:05:39, 131kB/s]
 63%|██████▎   | 1295/2041 [1:52:25<1:04:05,  5.15s/it][A

{'eps': 0, 'objective/kl': 99.01715087890625, 'objective/entropy': 66.0793228149414, 'objective/non_score_reward': -4.950857162475586, 'objective/rlhf_reward': -5.827682018280029, 'objective/scores': -0.8768250346183777, 'policy/approxkl_avg': 0.010825409553945065, 'policy/clipfrac_avg': 0.09669812023639679, 'loss/policy_avg': -0.0330890528857708, 'loss/value_avg': 0.5314237475395203, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.5298171043395996, 'val/ratio': 0.9815984964370728, 'val/ratio_var': 0.0002722958743106574, 'val/num_eos_tokens': 0, 'lr': 1.829985301322881e-05, 'episode': 5180, 'epoch': 0.63}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:55:49<1:05:39, 131kB/s]
 63%|██████▎   | 1296/2041 [1:52:30<1:03:49,  5.14s/it][A

{'eps': 0, 'objective/kl': 105.94853210449219, 'objective/entropy': 73.11920166015625, 'objective/non_score_reward': -5.297426700592041, 'objective/rlhf_reward': -5.335356712341309, 'objective/scores': -0.037929974496364594, 'policy/approxkl_avg': 0.01073635183274746, 'policy/clipfrac_avg': 0.09905660897493362, 'loss/policy_avg': -0.03298947587609291, 'loss/value_avg': 0.314365029335022, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.5783557891845703, 'val/ratio': 0.9900970458984375, 'val/ratio_var': 6.875085091451183e-05, 'val/num_eos_tokens': 0, 'lr': 1.827535521803038e-05, 'episode': 5184, 'epoch': 0.64}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:55:54<1:05:39, 131kB/s]
 64%|██████▎   | 1297/2041 [1:52:36<1:04:07,  5.17s/it][A

{'eps': 0, 'objective/kl': 111.91876220703125, 'objective/entropy': 88.82797241210938, 'objective/non_score_reward': -5.595938205718994, 'objective/rlhf_reward': -5.5363383293151855, 'objective/scores': 0.05959978699684143, 'policy/approxkl_avg': 0.025139465928077698, 'policy/clipfrac_avg': 0.1320754736661911, 'loss/policy_avg': -0.043295539915561676, 'loss/value_avg': 0.7246517539024353, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.4947922229766846, 'val/ratio': 0.9688116312026978, 'val/ratio_var': 0.0006792408530600369, 'val/num_eos_tokens': 0, 'lr': 1.8250857422831946e-05, 'episode': 5188, 'epoch': 0.64}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:55:59<1:05:39, 131kB/s]
 64%|██████▎   | 1298/2041 [1:52:41<1:03:48,  5.15s/it][A

{'eps': 0, 'objective/kl': 118.80282592773438, 'objective/entropy': 100.00873565673828, 'objective/non_score_reward': -5.940140724182129, 'objective/rlhf_reward': -6.798395156860352, 'objective/scores': -0.8582542538642883, 'policy/approxkl_avg': 0.011983967386186123, 'policy/clipfrac_avg': 0.10495282709598541, 'loss/policy_avg': -0.038003310561180115, 'loss/value_avg': 0.5972832441329956, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.8927831649780273, 'val/ratio': 0.9953104257583618, 'val/ratio_var': 1.1079414434789214e-05, 'val/num_eos_tokens': 0, 'lr': 1.8226359627633514e-05, 'episode': 5192, 'epoch': 0.64}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:56:04<1:05:39, 131kB/s]
 64%|██████▎   | 1299/2041 [1:52:46<1:03:30,  5.13s/it][A

{'eps': 0, 'objective/kl': 112.53067016601562, 'objective/entropy': 82.83412170410156, 'objective/non_score_reward': -5.626533508300781, 'objective/rlhf_reward': -6.107151985168457, 'objective/scores': -0.48061859607696533, 'policy/approxkl_avg': 0.010186681523919106, 'policy/clipfrac_avg': 0.09551887214183807, 'loss/policy_avg': -0.03848861902952194, 'loss/value_avg': 0.6234458684921265, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.6969780921936035, 'val/ratio': 0.9812751412391663, 'val/ratio_var': 0.0002977409167215228, 'val/num_eos_tokens': 0, 'lr': 1.8201861832435082e-05, 'episode': 5196, 'epoch': 0.64}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:56:09<1:05:39, 131kB/s]
 64%|██████▎   | 1300/2041 [1:52:51<1:03:30,  5.14s/it][A

{'eps': 0, 'objective/kl': 104.34712219238281, 'objective/entropy': 101.8741683959961, 'objective/non_score_reward': -5.2173566818237305, 'objective/rlhf_reward': -5.661600589752197, 'objective/scores': -0.44424405694007874, 'policy/approxkl_avg': 0.014682694338262081, 'policy/clipfrac_avg': 0.10849056392908096, 'loss/policy_avg': -0.043500956147909164, 'loss/value_avg': 0.6961035132408142, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.8845640420913696, 'val/ratio': 0.9961169362068176, 'val/ratio_var': 1.0046331226476468e-05, 'val/num_eos_tokens': 0, 'lr': 1.817736403723665e-05, 'episode': 5200, 'epoch': 0.64}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:56:15<1:05:39, 131kB/s]
 64%|██████▎   | 1301/2041 [1:52:56<1:03:37,  5.16s/it][A

{'eps': 0, 'objective/kl': 110.92454528808594, 'objective/entropy': 101.47034454345703, 'objective/non_score_reward': -5.54622745513916, 'objective/rlhf_reward': -6.951318740844727, 'objective/scores': -1.4050910472869873, 'policy/approxkl_avg': 0.013739294372498989, 'policy/clipfrac_avg': 0.09905660152435303, 'loss/policy_avg': -0.03712315112352371, 'loss/value_avg': 0.8191156387329102, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.9172545671463013, 'val/ratio': 0.9671761989593506, 'val/ratio_var': 0.0008693470736034214, 'val/num_eos_tokens': 0, 'lr': 1.8152866242038215e-05, 'episode': 5204, 'epoch': 0.64}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:56:20<1:05:39, 131kB/s]
 64%|██████▍   | 1302/2041 [1:53:01<1:03:39,  5.17s/it][A

{'eps': 0, 'objective/kl': 124.25989532470703, 'objective/entropy': 108.78972625732422, 'objective/non_score_reward': -6.212994575500488, 'objective/rlhf_reward': -5.972836971282959, 'objective/scores': 0.24015776813030243, 'policy/approxkl_avg': 0.01349168922752142, 'policy/clipfrac_avg': 0.12264151871204376, 'loss/policy_avg': -0.044565316289663315, 'loss/value_avg': 0.6302199363708496, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.9471849203109741, 'val/ratio': 0.986488938331604, 'val/ratio_var': 0.0001276008115382865, 'val/num_eos_tokens': 0, 'lr': 1.8128368446839787e-05, 'episode': 5208, 'epoch': 0.64}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:56:25<1:05:39, 131kB/s]
 64%|██████▍   | 1303/2041 [1:53:06<1:03:23,  5.15s/it][A

{'eps': 0, 'objective/kl': 120.296630859375, 'objective/entropy': 99.03883361816406, 'objective/non_score_reward': -6.01483154296875, 'objective/rlhf_reward': -6.563812732696533, 'objective/scores': -0.548981249332428, 'policy/approxkl_avg': 0.014069904573261738, 'policy/clipfrac_avg': 0.12028302252292633, 'loss/policy_avg': -0.04440158233046532, 'loss/value_avg': 0.7094223499298096, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.9034080505371094, 'val/ratio': 0.985235333442688, 'val/ratio_var': 0.00013714212400373071, 'val/num_eos_tokens': 0, 'lr': 1.810387065164135e-05, 'episode': 5212, 'epoch': 0.64}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:56:30<1:05:39, 131kB/s]
 64%|██████▍   | 1304/2041 [1:53:12<1:03:08,  5.14s/it][A

{'eps': 0, 'objective/kl': 115.3800048828125, 'objective/entropy': 86.12057495117188, 'objective/non_score_reward': -5.76900053024292, 'objective/rlhf_reward': -6.006115913391113, 'objective/scores': -0.23711523413658142, 'policy/approxkl_avg': 0.015820877626538277, 'policy/clipfrac_avg': 0.10613207519054413, 'loss/policy_avg': -0.039855167269706726, 'loss/value_avg': 0.6929212212562561, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.506122350692749, 'val/ratio': 0.9874950647354126, 'val/ratio_var': 0.00010302224109182134, 'val/num_eos_tokens': 0, 'lr': 1.8079372856442923e-05, 'episode': 5216, 'epoch': 0.64}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:56:35<1:05:39, 131kB/s]
 64%|██████▍   | 1305/2041 [1:53:17<1:02:48,  5.12s/it][A

{'eps': 0, 'objective/kl': 122.98799133300781, 'objective/entropy': 85.47669982910156, 'objective/non_score_reward': -6.149399757385254, 'objective/rlhf_reward': -6.798438549041748, 'objective/scores': -0.6490388512611389, 'policy/approxkl_avg': 0.017355844378471375, 'policy/clipfrac_avg': 0.14268867671489716, 'loss/policy_avg': -0.046012748032808304, 'loss/value_avg': 0.6697384119033813, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.6941453218460083, 'val/ratio': 0.9659173488616943, 'val/ratio_var': 0.0009689561557024717, 'val/num_eos_tokens': 0, 'lr': 1.8054875061244487e-05, 'episode': 5220, 'epoch': 0.64}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:56:40<1:05:39, 131kB/s]
 64%|██████▍   | 1306/2041 [1:53:22<1:02:44,  5.12s/it][A

{'eps': 0, 'objective/kl': 108.15260314941406, 'objective/entropy': 78.16316223144531, 'objective/non_score_reward': -5.407630920410156, 'objective/rlhf_reward': -6.613620758056641, 'objective/scores': -1.2059895992279053, 'policy/approxkl_avg': 0.01224728673696518, 'policy/clipfrac_avg': 0.10966981202363968, 'loss/policy_avg': -0.04002871736884117, 'loss/value_avg': 0.5898908376693726, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.5487830638885498, 'val/ratio': 0.9883084297180176, 'val/ratio_var': 0.00010288392513757572, 'val/num_eos_tokens': 0, 'lr': 1.8030377266046055e-05, 'episode': 5224, 'epoch': 0.64}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:56:45<1:05:39, 131kB/s]
 64%|██████▍   | 1307/2041 [1:53:27<1:02:35,  5.12s/it][A

{'eps': 0, 'objective/kl': 96.65373229980469, 'objective/entropy': 69.52507019042969, 'objective/non_score_reward': -4.832686901092529, 'objective/rlhf_reward': -6.332846164703369, 'objective/scores': -1.5001592636108398, 'policy/approxkl_avg': 0.019090011715888977, 'policy/clipfrac_avg': 0.12971699237823486, 'loss/policy_avg': -0.042660485953092575, 'loss/value_avg': 0.506998598575592, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2311742305755615, 'val/ratio': 1.0107245445251465, 'val/ratio_var': 0.00025910366093739867, 'val/num_eos_tokens': 0, 'lr': 1.8005879470847627e-05, 'episode': 5228, 'epoch': 0.64}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:56:50<1:05:39, 131kB/s]
 64%|██████▍   | 1308/2041 [1:53:32<1:02:27,  5.11s/it][A

{'eps': 0, 'objective/kl': 97.5289306640625, 'objective/entropy': 86.30567932128906, 'objective/non_score_reward': -4.876446723937988, 'objective/rlhf_reward': -4.879185676574707, 'objective/scores': -0.002738863229751587, 'policy/approxkl_avg': 0.011483919806778431, 'policy/clipfrac_avg': 0.0966981053352356, 'loss/policy_avg': -0.03545606881380081, 'loss/value_avg': 0.5278863906860352, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.680345058441162, 'val/ratio': 0.9973044991493225, 'val/ratio_var': 2.2807564164395444e-05, 'val/num_eos_tokens': 0, 'lr': 1.798138167564919e-05, 'episode': 5232, 'epoch': 0.64}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:56:55<1:05:39, 131kB/s]
 64%|██████▍   | 1309/2041 [1:53:37<1:02:17,  5.11s/it][A

{'eps': 0, 'objective/kl': 102.28672790527344, 'objective/entropy': 122.00020599365234, 'objective/non_score_reward': -5.114336967468262, 'objective/rlhf_reward': -6.560742378234863, 'objective/scores': -1.4464056491851807, 'policy/approxkl_avg': 0.009864065796136856, 'policy/clipfrac_avg': 0.12735848128795624, 'loss/policy_avg': -0.04393000900745392, 'loss/value_avg': 0.5820285081863403, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 2.059760570526123, 'val/ratio': 0.9899904727935791, 'val/ratio_var': 7.019419717835262e-05, 'val/num_eos_tokens': 0, 'lr': 1.7956883880450763e-05, 'episode': 5236, 'epoch': 0.64}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:57:01<1:05:39, 131kB/s]
 64%|██████▍   | 1310/2041 [1:53:42<1:02:15,  5.11s/it][A

{'eps': 0, 'objective/kl': 111.47210693359375, 'objective/entropy': 75.401123046875, 'objective/non_score_reward': -5.573605537414551, 'objective/rlhf_reward': -6.186696529388428, 'objective/scores': -0.613090991973877, 'policy/approxkl_avg': 0.04591792821884155, 'policy/clipfrac_avg': 0.09316038340330124, 'loss/policy_avg': -0.03222619369626045, 'loss/value_avg': 0.6081357598304749, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.4854762554168701, 'val/ratio': 1.464462161064148, 'val/ratio_var': 0.34976303577423096, 'val/num_eos_tokens': 0, 'lr': 1.7932386085252328e-05, 'episode': 5240, 'epoch': 0.64}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:57:06<1:05:39, 131kB/s]
 64%|██████▍   | 1311/2041 [1:53:47<1:02:29,  5.14s/it][A

{'eps': 0, 'objective/kl': 106.80770874023438, 'objective/entropy': 87.58003234863281, 'objective/non_score_reward': -5.340385437011719, 'objective/rlhf_reward': -5.551604747772217, 'objective/scores': -0.21121934056282043, 'policy/approxkl_avg': 0.017842020839452744, 'policy/clipfrac_avg': 0.13679245114326477, 'loss/policy_avg': -0.04484603554010391, 'loss/value_avg': 0.4926932752132416, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.5294923782348633, 'val/ratio': 1.0007929801940918, 'val/ratio_var': 2.170059588024742e-06, 'val/num_eos_tokens': 0, 'lr': 1.7907888290053896e-05, 'episode': 5244, 'epoch': 0.64}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:57:11<1:05:39, 131kB/s]
 64%|██████▍   | 1312/2041 [1:53:52<1:02:14,  5.12s/it][A

{'eps': 0, 'objective/kl': 104.01167297363281, 'objective/entropy': 87.76712799072266, 'objective/non_score_reward': -5.2005839347839355, 'objective/rlhf_reward': -6.178896903991699, 'objective/scores': -0.9783127307891846, 'policy/approxkl_avg': 0.012086557224392891, 'policy/clipfrac_avg': 0.11320754885673523, 'loss/policy_avg': -0.0414603054523468, 'loss/value_avg': 0.5215829014778137, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.7486662864685059, 'val/ratio': 0.9853100180625916, 'val/ratio_var': 0.00015363232523668557, 'val/num_eos_tokens': 0, 'lr': 1.7883390494855464e-05, 'episode': 5248, 'epoch': 0.64}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:57:16<1:05:39, 131kB/s]
 64%|██████▍   | 1313/2041 [1:53:58<1:02:09,  5.12s/it][A

{'eps': 0, 'objective/kl': 131.22634887695312, 'objective/entropy': 100.70423889160156, 'objective/non_score_reward': -6.561318397521973, 'objective/rlhf_reward': -6.914671897888184, 'objective/scores': -0.35335367918014526, 'policy/approxkl_avg': 0.0089269345626235, 'policy/clipfrac_avg': 0.0966981053352356, 'loss/policy_avg': -0.03842191770672798, 'loss/value_avg': 0.8030917644500732, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.9089887142181396, 'val/ratio': 0.9859793782234192, 'val/ratio_var': 0.00017596909310668707, 'val/num_eos_tokens': 0, 'lr': 1.7858892699657032e-05, 'episode': 5252, 'epoch': 0.64}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:57:21<1:05:39, 131kB/s]
 64%|██████▍   | 1314/2041 [1:54:03<1:02:15,  5.14s/it][A

{'eps': 0, 'objective/kl': 117.88699340820312, 'objective/entropy': 85.2488784790039, 'objective/non_score_reward': -5.894350051879883, 'objective/rlhf_reward': -7.310885429382324, 'objective/scores': -1.4165356159210205, 'policy/approxkl_avg': 0.013937756419181824, 'policy/clipfrac_avg': 0.1108490601181984, 'loss/policy_avg': -0.04321656376123428, 'loss/value_avg': 0.677330732345581, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.7713040113449097, 'val/ratio': 0.9922254085540771, 'val/ratio_var': 0.00014121974527370185, 'val/num_eos_tokens': 0, 'lr': 1.78343949044586e-05, 'episode': 5256, 'epoch': 0.64}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:57:26<1:05:39, 131kB/s]
 64%|██████▍   | 1315/2041 [1:54:08<1:01:54,  5.12s/it][A

{'eps': 0, 'objective/kl': 118.0020751953125, 'objective/entropy': 87.76168823242188, 'objective/non_score_reward': -5.900103569030762, 'objective/rlhf_reward': -6.767049789428711, 'objective/scores': -0.8669461011886597, 'policy/approxkl_avg': 0.013556966558098793, 'policy/clipfrac_avg': 0.10495282709598541, 'loss/policy_avg': -0.03744968771934509, 'loss/value_avg': 0.6661449074745178, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.6896520853042603, 'val/ratio': 0.9640810489654541, 'val/ratio_var': 0.0010665240697562695, 'val/num_eos_tokens': 0, 'lr': 1.7809897109260168e-05, 'episode': 5260, 'epoch': 0.64}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:57:31<1:05:39, 131kB/s]
 64%|██████▍   | 1316/2041 [1:54:13<1:01:48,  5.12s/it][A

{'eps': 0, 'objective/kl': 91.53773498535156, 'objective/entropy': 58.1990966796875, 'objective/non_score_reward': -4.5768866539001465, 'objective/rlhf_reward': -5.226961612701416, 'objective/scores': -0.6500749588012695, 'policy/approxkl_avg': 0.023426949977874756, 'policy/clipfrac_avg': 0.12382075190544128, 'loss/policy_avg': -0.04238532483577728, 'loss/value_avg': 0.5935549139976501, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1491286754608154, 'val/ratio': 1.0048716068267822, 'val/ratio_var': 1.277795308851637e-05, 'val/num_eos_tokens': 0, 'lr': 1.7785399314061736e-05, 'episode': 5264, 'epoch': 0.64}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:57:36<1:05:39, 131kB/s]
 65%|██████▍   | 1317/2041 [1:54:18<1:01:40,  5.11s/it][A

{'eps': 0, 'objective/kl': 95.23353576660156, 'objective/entropy': 72.9651870727539, 'objective/non_score_reward': -4.761676788330078, 'objective/rlhf_reward': -6.51597785949707, 'objective/scores': -1.754300832748413, 'policy/approxkl_avg': 0.02468068338930607, 'policy/clipfrac_avg': 0.11202830821275711, 'loss/policy_avg': -0.04185650497674942, 'loss/value_avg': 0.518273115158081, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.3113113641738892, 'val/ratio': 1.0331634283065796, 'val/ratio_var': 0.0008149389177560806, 'val/num_eos_tokens': 0, 'lr': 1.7760901518863304e-05, 'episode': 5268, 'epoch': 0.65}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:57:42<1:05:39, 131kB/s]
 65%|██████▍   | 1318/2041 [1:54:23<1:01:41,  5.12s/it][A

{'eps': 0, 'objective/kl': 112.73876953125, 'objective/entropy': 103.82376098632812, 'objective/non_score_reward': -5.636938571929932, 'objective/rlhf_reward': -6.8194661140441895, 'objective/scores': -1.1825275421142578, 'policy/approxkl_avg': 0.020550737157464027, 'policy/clipfrac_avg': 0.12735849618911743, 'loss/policy_avg': -0.042765695601701736, 'loss/value_avg': 1.5201389789581299, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.603527545928955, 'val/ratio': 0.9956324100494385, 'val/ratio_var': 1.3590183698397595e-05, 'val/num_eos_tokens': 0, 'lr': 1.7736403723664872e-05, 'episode': 5272, 'epoch': 0.65}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:57:47<1:05:39, 131kB/s]
 65%|██████▍   | 1319/2041 [1:54:28<1:01:40,  5.13s/it][A

{'eps': 0, 'objective/kl': 117.02584838867188, 'objective/entropy': 93.38520812988281, 'objective/non_score_reward': -5.851292610168457, 'objective/rlhf_reward': -6.334842681884766, 'objective/scores': -0.4835503101348877, 'policy/approxkl_avg': 0.011748668737709522, 'policy/clipfrac_avg': 0.11556603759527206, 'loss/policy_avg': -0.04552172124385834, 'loss/value_avg': 0.5125593543052673, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.6676050424575806, 'val/ratio': 0.9798429608345032, 'val/ratio_var': 0.00032845139503479004, 'val/num_eos_tokens': 0, 'lr': 1.7711905928466437e-05, 'episode': 5276, 'epoch': 0.65}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:57:52<1:05:39, 131kB/s]
 65%|██████▍   | 1320/2041 [1:54:34<1:02:08,  5.17s/it][A

{'eps': 0, 'objective/kl': 114.48329162597656, 'objective/entropy': 81.14794921875, 'objective/non_score_reward': -5.724164962768555, 'objective/rlhf_reward': -6.9045305252075195, 'objective/scores': -1.1803653240203857, 'policy/approxkl_avg': 0.012525921687483788, 'policy/clipfrac_avg': 0.11438679695129395, 'loss/policy_avg': -0.044878698885440826, 'loss/value_avg': 0.6257712244987488, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.4101271629333496, 'val/ratio': 0.9885478019714355, 'val/ratio_var': 8.443315891781822e-05, 'val/num_eos_tokens': 0, 'lr': 1.7687408133268008e-05, 'episode': 5280, 'epoch': 0.65}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:57:57<1:05:39, 131kB/s]
 65%|██████▍   | 1321/2041 [1:54:39<1:02:04,  5.17s/it][A

{'eps': 0, 'objective/kl': 103.69770050048828, 'objective/entropy': 52.32494354248047, 'objective/non_score_reward': -5.184885025024414, 'objective/rlhf_reward': -6.3595194816589355, 'objective/scores': -1.174634337425232, 'policy/approxkl_avg': 0.009190227836370468, 'policy/clipfrac_avg': 0.08608490228652954, 'loss/policy_avg': -0.03629765659570694, 'loss/value_avg': 0.4134035110473633, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1145323514938354, 'val/ratio': 0.9839272499084473, 'val/ratio_var': 0.00018465628090780228, 'val/num_eos_tokens': 0, 'lr': 1.7662910338069573e-05, 'episode': 5284, 'epoch': 0.65}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:58:02<1:05:39, 131kB/s]
 65%|██████▍   | 1322/2041 [1:54:44<1:01:44,  5.15s/it][A

{'eps': 0, 'objective/kl': 118.1858901977539, 'objective/entropy': 91.34346008300781, 'objective/non_score_reward': -5.909295082092285, 'objective/rlhf_reward': -7.332281112670898, 'objective/scores': -1.4229862689971924, 'policy/approxkl_avg': 0.015079742297530174, 'policy/clipfrac_avg': 0.10023584961891174, 'loss/policy_avg': -0.037848617881536484, 'loss/value_avg': 0.8216269016265869, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.6669950485229492, 'val/ratio': 0.9858301877975464, 'val/ratio_var': 0.00015085386985447258, 'val/num_eos_tokens': 0, 'lr': 1.763841254287114e-05, 'episode': 5288, 'epoch': 0.65}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:58:07<1:05:39, 131kB/s]
 65%|██████▍   | 1323/2041 [1:54:49<1:01:45,  5.16s/it][A

{'eps': 0, 'objective/kl': 87.05683898925781, 'objective/entropy': 39.2130012512207, 'objective/non_score_reward': -4.352841854095459, 'objective/rlhf_reward': -5.955268859863281, 'objective/scores': -1.6024270057678223, 'policy/approxkl_avg': 0.00978703610599041, 'policy/clipfrac_avg': 0.06603773683309555, 'loss/policy_avg': -0.02971000038087368, 'loss/value_avg': 0.43788963556289673, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7858039140701294, 'val/ratio': 1.0007984638214111, 'val/ratio_var': 6.091153750276135e-07, 'val/num_eos_tokens': 0, 'lr': 1.7613914747672712e-05, 'episode': 5292, 'epoch': 0.65}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:58:13<1:05:39, 131kB/s]
 65%|██████▍   | 1324/2041 [1:54:54<1:01:50,  5.17s/it][A

{'eps': 0, 'objective/kl': 115.30686950683594, 'objective/entropy': 65.2392578125, 'objective/non_score_reward': -5.76534366607666, 'objective/rlhf_reward': -6.226842880249023, 'objective/scores': -0.4614989757537842, 'policy/approxkl_avg': 0.012816566973924637, 'policy/clipfrac_avg': 0.08844339847564697, 'loss/policy_avg': -0.034900277853012085, 'loss/value_avg': 0.68727707862854, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2117935419082642, 'val/ratio': 0.9960805177688599, 'val/ratio_var': 1.0245520570606459e-05, 'val/num_eos_tokens': 0, 'lr': 1.7589416952474277e-05, 'episode': 5296, 'epoch': 0.65}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:58:18<1:05:39, 131kB/s]
 65%|██████▍   | 1325/2041 [1:54:59<1:01:43,  5.17s/it][A

{'eps': 0, 'objective/kl': 113.88884735107422, 'objective/entropy': 66.57719421386719, 'objective/non_score_reward': -5.694442272186279, 'objective/rlhf_reward': -6.650899410247803, 'objective/scores': -0.956457257270813, 'policy/approxkl_avg': 0.013201066292822361, 'policy/clipfrac_avg': 0.11202830821275711, 'loss/policy_avg': -0.0420086532831192, 'loss/value_avg': 0.7049229741096497, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.3445780277252197, 'val/ratio': 0.9769889116287231, 'val/ratio_var': 0.00042591188685037196, 'val/num_eos_tokens': 0, 'lr': 1.7564919157275848e-05, 'episode': 5300, 'epoch': 0.65}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:58:23<1:05:39, 131kB/s]
 65%|██████▍   | 1326/2041 [1:55:05<1:01:52,  5.19s/it][A

{'eps': 0, 'objective/kl': 103.35208129882812, 'objective/entropy': 52.44633483886719, 'objective/non_score_reward': -5.167604446411133, 'objective/rlhf_reward': -5.917450904846191, 'objective/scores': -0.7498466968536377, 'policy/approxkl_avg': 0.013997518457472324, 'policy/clipfrac_avg': 0.10495283454656601, 'loss/policy_avg': -0.03705650568008423, 'loss/value_avg': 0.47145766019821167, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0684845447540283, 'val/ratio': 1.008845329284668, 'val/ratio_var': 7.22259355825372e-05, 'val/num_eos_tokens': 0, 'lr': 1.7540421362077413e-05, 'episode': 5304, 'epoch': 0.65}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:58:28<1:05:39, 131kB/s]
 65%|██████▌   | 1327/2041 [1:55:10<1:01:31,  5.17s/it][A

{'eps': 0, 'objective/kl': 97.04814147949219, 'objective/entropy': 60.3553466796875, 'objective/non_score_reward': -4.852407455444336, 'objective/rlhf_reward': -5.882528305053711, 'objective/scores': -1.0301209688186646, 'policy/approxkl_avg': 0.01535749901086092, 'policy/clipfrac_avg': 0.09787736088037491, 'loss/policy_avg': -0.036425165832042694, 'loss/value_avg': 0.47597575187683105, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.059988260269165, 'val/ratio': 1.0260798931121826, 'val/ratio_var': 0.0005929340841248631, 'val/num_eos_tokens': 0, 'lr': 1.751592356687898e-05, 'episode': 5308, 'epoch': 0.65}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:58:33<1:05:39, 131kB/s]
 65%|██████▌   | 1328/2041 [1:55:15<1:01:06,  5.14s/it][A

{'eps': 0, 'objective/kl': 106.69768524169922, 'objective/entropy': 59.93891906738281, 'objective/non_score_reward': -5.3348846435546875, 'objective/rlhf_reward': -5.931244373321533, 'objective/scores': -0.5963597297668457, 'policy/approxkl_avg': 0.017299054190516472, 'policy/clipfrac_avg': 0.09080188721418381, 'loss/policy_avg': -0.033624060451984406, 'loss/value_avg': 0.5666517019271851, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1202670335769653, 'val/ratio': 1.009556531906128, 'val/ratio_var': 0.00012213816808070987, 'val/num_eos_tokens': 0, 'lr': 1.749142577168055e-05, 'episode': 5312, 'epoch': 0.65}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:58:38<1:05:39, 131kB/s]
 65%|██████▌   | 1329/2041 [1:55:20<1:00:58,  5.14s/it][A

{'eps': 0, 'objective/kl': 104.53721618652344, 'objective/entropy': 65.31440734863281, 'objective/non_score_reward': -5.226861000061035, 'objective/rlhf_reward': -6.808521270751953, 'objective/scores': -1.581660270690918, 'policy/approxkl_avg': 0.010098202154040337, 'policy/clipfrac_avg': 0.08608490228652954, 'loss/policy_avg': -0.03952775150537491, 'loss/value_avg': 0.5680478811264038, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2843244075775146, 'val/ratio': 0.998072624206543, 'val/ratio_var': 2.7641233373287832e-06, 'val/num_eos_tokens': 0, 'lr': 1.7466927976482117e-05, 'episode': 5316, 'epoch': 0.65}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:58:44<1:05:39, 131kB/s]
 65%|██████▌   | 1330/2041 [1:55:25<1:00:57,  5.14s/it][A

{'eps': 0, 'objective/kl': 98.81124114990234, 'objective/entropy': 59.89771270751953, 'objective/non_score_reward': -4.9405622482299805, 'objective/rlhf_reward': -6.114418983459473, 'objective/scores': -1.1738569736480713, 'policy/approxkl_avg': 0.016492709517478943, 'policy/clipfrac_avg': 0.08962263911962509, 'loss/policy_avg': -0.03660924732685089, 'loss/value_avg': 0.47799280285835266, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1258139610290527, 'val/ratio': 1.0014781951904297, 'val/ratio_var': 7.3178948696295265e-06, 'val/num_eos_tokens': 0, 'lr': 1.7442430181283685e-05, 'episode': 5320, 'epoch': 0.65}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:58:49<1:05:39, 131kB/s]
 65%|██████▌   | 1331/2041 [1:55:30<1:01:16,  5.18s/it][A

{'eps': 0, 'objective/kl': 121.09489440917969, 'objective/entropy': 76.06073760986328, 'objective/non_score_reward': -6.054745197296143, 'objective/rlhf_reward': -7.237513542175293, 'objective/scores': -1.1827683448791504, 'policy/approxkl_avg': 0.01629965752363205, 'policy/clipfrac_avg': 0.11320754885673523, 'loss/policy_avg': -0.03798213228583336, 'loss/value_avg': 0.8039270043373108, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.326097011566162, 'val/ratio': 0.9921375513076782, 'val/ratio_var': 3.93281843571458e-05, 'val/num_eos_tokens': 0, 'lr': 1.7417932386085253e-05, 'episode': 5324, 'epoch': 0.65}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:58:54<1:05:39, 131kB/s]
 65%|██████▌   | 1332/2041 [1:55:36<1:01:04,  5.17s/it][A

{'eps': 0, 'objective/kl': 106.61592102050781, 'objective/entropy': 55.28429412841797, 'objective/non_score_reward': -5.330796241760254, 'objective/rlhf_reward': -6.702248573303223, 'objective/scores': -1.3714520931243896, 'policy/approxkl_avg': 0.015249457210302353, 'policy/clipfrac_avg': 0.08136792480945587, 'loss/policy_avg': -0.0359484888613224, 'loss/value_avg': 0.6480429768562317, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1104626655578613, 'val/ratio': 0.9815815687179565, 'val/ratio_var': 0.0002317954640602693, 'val/num_eos_tokens': 0, 'lr': 1.739343459088682e-05, 'episode': 5328, 'epoch': 0.65}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:58:59<1:05:39, 131kB/s]
 65%|██████▌   | 1333/2041 [1:55:41<1:00:44,  5.15s/it][A

{'eps': 0, 'objective/kl': 107.97349548339844, 'objective/entropy': 66.47698211669922, 'objective/non_score_reward': -5.398674964904785, 'objective/rlhf_reward': -6.356324195861816, 'objective/scores': -0.9576489925384521, 'policy/approxkl_avg': 0.01440166961401701, 'policy/clipfrac_avg': 0.1037735864520073, 'loss/policy_avg': -0.03985781967639923, 'loss/value_avg': 0.5826128721237183, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2799427509307861, 'val/ratio': 1.0047593116760254, 'val/ratio_var': 4.3379510316299275e-05, 'val/num_eos_tokens': 0, 'lr': 1.736893679568839e-05, 'episode': 5332, 'epoch': 0.65}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:59:04<1:05:39, 131kB/s]
 65%|██████▌   | 1334/2041 [1:55:46<1:00:27,  5.13s/it][A

{'eps': 0, 'objective/kl': 118.4248046875, 'objective/entropy': 84.1117935180664, 'objective/non_score_reward': -5.92124080657959, 'objective/rlhf_reward': -7.05666446685791, 'objective/scores': -1.1354234218597412, 'policy/approxkl_avg': 0.012390101328492165, 'policy/clipfrac_avg': 0.11202830076217651, 'loss/policy_avg': -0.041209422051906586, 'loss/value_avg': 0.6572486162185669, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.6148350238800049, 'val/ratio': 0.9842198491096497, 'val/ratio_var': 0.00019594495825003833, 'val/num_eos_tokens': 0, 'lr': 1.7344439000489957e-05, 'episode': 5336, 'epoch': 0.65}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:59:09<1:05:39, 131kB/s]
 65%|██████▌   | 1335/2041 [1:55:51<1:00:12,  5.12s/it][A

{'eps': 0, 'objective/kl': 119.60794067382812, 'objective/entropy': 76.1688232421875, 'objective/non_score_reward': -5.980396270751953, 'objective/rlhf_reward': -6.79352331161499, 'objective/scores': -0.8131270408630371, 'policy/approxkl_avg': 0.013758345507085323, 'policy/clipfrac_avg': 0.10495282709598541, 'loss/policy_avg': -0.036228056997060776, 'loss/value_avg': 0.7188869714736938, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.3215422630310059, 'val/ratio': 0.9958452582359314, 'val/ratio_var': 9.400483577337582e-06, 'val/num_eos_tokens': 0, 'lr': 1.7319941205291522e-05, 'episode': 5340, 'epoch': 0.65}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:59:14<1:05:39, 131kB/s]
 65%|██████▌   | 1336/2041 [1:55:56<1:00:22,  5.14s/it][A

{'eps': 0, 'objective/kl': 119.21247100830078, 'objective/entropy': 62.38688278198242, 'objective/non_score_reward': -5.960623741149902, 'objective/rlhf_reward': -6.987912178039551, 'objective/scores': -1.0272884368896484, 'policy/approxkl_avg': 0.011316200718283653, 'policy/clipfrac_avg': 0.07311321049928665, 'loss/policy_avg': -0.029256528243422508, 'loss/value_avg': 0.8218062520027161, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.314835548400879, 'val/ratio': 0.9849176406860352, 'val/ratio_var': 0.0001641119015403092, 'val/num_eos_tokens': 0, 'lr': 1.7295443410093093e-05, 'episode': 5344, 'epoch': 0.65}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:59:20<1:05:39, 131kB/s]
 66%|██████▌   | 1337/2041 [1:56:01<1:00:37,  5.17s/it][A

{'eps': 0, 'objective/kl': 113.53972625732422, 'objective/entropy': 49.33570098876953, 'objective/non_score_reward': -5.6769866943359375, 'objective/rlhf_reward': -6.1779680252075195, 'objective/scores': -0.500981330871582, 'policy/approxkl_avg': 0.12416593730449677, 'policy/clipfrac_avg': 0.09433962404727936, 'loss/policy_avg': -0.03204404562711716, 'loss/value_avg': 0.6746784448623657, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9230546951293945, 'val/ratio': 0.9860540628433228, 'val/ratio_var': 9.665390098234639e-05, 'val/num_eos_tokens': 0, 'lr': 1.727094561489466e-05, 'episode': 5348, 'epoch': 0.66}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:59:25<1:05:39, 131kB/s]
 66%|██████▌   | 1338/2041 [1:56:06<1:00:25,  5.16s/it][A

{'eps': 0, 'objective/kl': 94.90655517578125, 'objective/entropy': 42.95745849609375, 'objective/non_score_reward': -4.745327472686768, 'objective/rlhf_reward': -6.1823601722717285, 'objective/scores': -1.437032699584961, 'policy/approxkl_avg': 0.013750887475907803, 'policy/clipfrac_avg': 0.08136791735887527, 'loss/policy_avg': -0.034530192613601685, 'loss/value_avg': 0.5456635355949402, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8253754377365112, 'val/ratio': 0.9758716225624084, 'val/ratio_var': 0.00042899264371953905, 'val/num_eos_tokens': 0, 'lr': 1.724644781969623e-05, 'episode': 5352, 'epoch': 0.66}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:59:30<1:05:39, 131kB/s]
 66%|██████▌   | 1339/2041 [1:56:11<1:00:15,  5.15s/it][A

{'eps': 0, 'objective/kl': 119.88510131835938, 'objective/entropy': 85.39799499511719, 'objective/non_score_reward': -5.994255542755127, 'objective/rlhf_reward': -7.853689193725586, 'objective/scores': -1.859433889389038, 'policy/approxkl_avg': 0.010638769716024399, 'policy/clipfrac_avg': 0.09080188721418381, 'loss/policy_avg': -0.04035598784685135, 'loss/value_avg': 0.924789309501648, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.766817331314087, 'val/ratio': 0.9835488200187683, 'val/ratio_var': 0.00021029422350693494, 'val/num_eos_tokens': 0, 'lr': 1.7221950024497797e-05, 'episode': 5356, 'epoch': 0.66}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:59:35<1:05:39, 131kB/s]
 66%|██████▌   | 1340/2041 [1:56:17<1:00:19,  5.16s/it][A

{'eps': 0, 'objective/kl': 100.61363220214844, 'objective/entropy': 67.556396484375, 'objective/non_score_reward': -5.030681610107422, 'objective/rlhf_reward': -7.213186264038086, 'objective/scores': -2.182504653930664, 'policy/approxkl_avg': 0.010304052382707596, 'policy/clipfrac_avg': 0.09080188721418381, 'loss/policy_avg': -0.03418906033039093, 'loss/value_avg': 0.5944774150848389, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.4026634693145752, 'val/ratio': 0.9843029975891113, 'val/ratio_var': 0.00016473012510687113, 'val/num_eos_tokens': 0, 'lr': 1.7197452229299362e-05, 'episode': 5360, 'epoch': 0.66}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:59:40<1:05:39, 131kB/s]
 66%|██████▌   | 1341/2041 [1:56:22<59:48,  5.13s/it]  [A

{'eps': 0, 'objective/kl': 102.08688354492188, 'objective/entropy': 85.41943359375, 'objective/non_score_reward': -5.104344367980957, 'objective/rlhf_reward': -6.34017276763916, 'objective/scores': -1.235828161239624, 'policy/approxkl_avg': 0.0109977126121521, 'policy/clipfrac_avg': 0.06721697747707367, 'loss/policy_avg': -0.03403305262327194, 'loss/value_avg': 0.5209251642227173, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.5653960704803467, 'val/ratio': 0.9875476360321045, 'val/ratio_var': 0.00010356673010392115, 'val/num_eos_tokens': 0, 'lr': 1.7172954434100934e-05, 'episode': 5364, 'epoch': 0.66}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:59:45<1:05:39, 131kB/s]
 66%|██████▌   | 1342/2041 [1:56:27<59:39,  5.12s/it][A

{'eps': 0, 'objective/kl': 109.31393432617188, 'objective/entropy': 116.14115905761719, 'objective/non_score_reward': -5.465696811676025, 'objective/rlhf_reward': -7.536691665649414, 'objective/scores': -2.0709948539733887, 'policy/approxkl_avg': 0.008059355430305004, 'policy/clipfrac_avg': 0.07547169923782349, 'loss/policy_avg': -0.035159122198820114, 'loss/value_avg': 0.6446869373321533, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 2.1228599548339844, 'val/ratio': 0.9850696921348572, 'val/ratio_var': 0.00019990315195173025, 'val/num_eos_tokens': 0, 'lr': 1.7148456638902498e-05, 'episode': 5368, 'epoch': 0.66}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:59:50<1:05:39, 131kB/s]
 66%|██████▌   | 1343/2041 [1:56:32<59:50,  5.14s/it][A

{'eps': 0, 'objective/kl': 115.6260757446289, 'objective/entropy': 93.53573608398438, 'objective/non_score_reward': -5.781303882598877, 'objective/rlhf_reward': -7.411735534667969, 'objective/scores': -1.6304316520690918, 'policy/approxkl_avg': 0.008424543775618076, 'policy/clipfrac_avg': 0.08254717290401459, 'loss/policy_avg': -0.03594280779361725, 'loss/value_avg': 0.8704560399055481, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.7023181915283203, 'val/ratio': 0.9879418611526489, 'val/ratio_var': 0.0001046967227011919, 'val/num_eos_tokens': 0, 'lr': 1.712395884370407e-05, 'episode': 5372, 'epoch': 0.66}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [1:59:56<1:05:39, 131kB/s]
 66%|██████▌   | 1344/2041 [1:56:37<59:48,  5.15s/it][A

{'eps': 0, 'objective/kl': 107.32626342773438, 'objective/entropy': 79.59455871582031, 'objective/non_score_reward': -5.366313457489014, 'objective/rlhf_reward': -6.756896495819092, 'objective/scores': -1.3905831575393677, 'policy/approxkl_avg': 0.009425831027328968, 'policy/clipfrac_avg': 0.07075471431016922, 'loss/policy_avg': -0.03395689278841019, 'loss/value_avg': 0.5919890999794006, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.4828221797943115, 'val/ratio': 0.9782907962799072, 'val/ratio_var': 0.0003557282325346023, 'val/num_eos_tokens': 0, 'lr': 1.7099461048505634e-05, 'episode': 5376, 'epoch': 0.66}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:00:01<1:05:39, 131kB/s]
 66%|██████▌   | 1345/2041 [1:56:42<59:47,  5.16s/it][A

{'eps': 0, 'objective/kl': 107.49827575683594, 'objective/entropy': 85.30596923828125, 'objective/non_score_reward': -5.374913692474365, 'objective/rlhf_reward': -7.310901641845703, 'objective/scores': -1.9359880685806274, 'policy/approxkl_avg': 0.01490617636591196, 'policy/clipfrac_avg': 0.10377359390258789, 'loss/policy_avg': -0.04225374385714531, 'loss/value_avg': 0.7796576023101807, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.6133267879486084, 'val/ratio': 0.995115339756012, 'val/ratio_var': 1.3467789358401205e-05, 'val/num_eos_tokens': 0, 'lr': 1.7074963253307202e-05, 'episode': 5380, 'epoch': 0.66}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:00:06<1:05:39, 131kB/s]
 66%|██████▌   | 1346/2041 [1:56:47<59:36,  5.15s/it][A

{'eps': 0, 'objective/kl': 96.87162780761719, 'objective/entropy': 70.19558715820312, 'objective/non_score_reward': -4.843581676483154, 'objective/rlhf_reward': -5.546645641326904, 'objective/scores': -0.7030638456344604, 'policy/approxkl_avg': 0.017887677997350693, 'policy/clipfrac_avg': 0.08018867671489716, 'loss/policy_avg': -0.03477960824966431, 'loss/value_avg': 0.5075346827507019, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.3109138011932373, 'val/ratio': 1.0146616697311401, 'val/ratio_var': 9.94280562736094e-05, 'val/num_eos_tokens': 0, 'lr': 1.705046545810877e-05, 'episode': 5384, 'epoch': 0.66}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:00:11<1:05:39, 131kB/s]
 66%|██████▌   | 1347/2041 [1:56:53<59:32,  5.15s/it][A

{'eps': 0, 'objective/kl': 100.07440948486328, 'objective/entropy': 53.728206634521484, 'objective/non_score_reward': -5.003720283508301, 'objective/rlhf_reward': -7.013616561889648, 'objective/scores': -2.0098965167999268, 'policy/approxkl_avg': 0.01033259741961956, 'policy/clipfrac_avg': 0.08254717290401459, 'loss/policy_avg': -0.03087124042212963, 'loss/value_avg': 0.6191447377204895, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1307504177093506, 'val/ratio': 0.9814305305480957, 'val/ratio_var': 0.0002576553088147193, 'val/num_eos_tokens': 0, 'lr': 1.702596766291034e-05, 'episode': 5388, 'epoch': 0.66}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:00:16<1:05:39, 131kB/s]
 66%|██████▌   | 1348/2041 [1:56:58<59:40,  5.17s/it][A

{'eps': 0, 'objective/kl': 107.32164001464844, 'objective/entropy': 64.4974365234375, 'objective/non_score_reward': -5.366082191467285, 'objective/rlhf_reward': -5.785536289215088, 'objective/scores': -0.4194539189338684, 'policy/approxkl_avg': 0.011657634750008583, 'policy/clipfrac_avg': 0.08608490228652954, 'loss/policy_avg': -0.034179817885160446, 'loss/value_avg': 0.38270699977874756, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0965728759765625, 'val/ratio': 0.9810860753059387, 'val/ratio_var': 0.0002970611094497144, 'val/num_eos_tokens': 0, 'lr': 1.7001469867711906e-05, 'episode': 5392, 'epoch': 0.66}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:00:22<1:05:39, 131kB/s]
 66%|██████▌   | 1349/2041 [1:57:03<59:49,  5.19s/it][A

{'eps': 0, 'objective/kl': 113.63558959960938, 'objective/entropy': 56.12543487548828, 'objective/non_score_reward': -5.681779861450195, 'objective/rlhf_reward': -6.916592597961426, 'objective/scores': -1.23481285572052, 'policy/approxkl_avg': 0.011337668634951115, 'policy/clipfrac_avg': 0.09433962404727936, 'loss/policy_avg': -0.03229563683271408, 'loss/value_avg': 0.6542822122573853, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.3088696002960205, 'val/ratio': 0.9795180559158325, 'val/ratio_var': 0.0002826356212608516, 'val/num_eos_tokens': 0, 'lr': 1.6976972072513475e-05, 'episode': 5396, 'epoch': 0.66}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:00:27<1:05:39, 131kB/s]
 66%|██████▌   | 1350/2041 [1:57:08<59:40,  5.18s/it][A

{'eps': 0, 'objective/kl': 87.95752716064453, 'objective/entropy': 82.50688171386719, 'objective/non_score_reward': -4.397876739501953, 'objective/rlhf_reward': -6.16829252243042, 'objective/scores': -1.7704157829284668, 'policy/approxkl_avg': 0.012129899114370346, 'policy/clipfrac_avg': 0.08608490228652954, 'loss/policy_avg': -0.0352824367582798, 'loss/value_avg': 0.3277105689048767, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.5106301307678223, 'val/ratio': 1.0092390775680542, 'val/ratio_var': 7.260058919200674e-05, 'val/num_eos_tokens': 0, 'lr': 1.6952474277315043e-05, 'episode': 5400, 'epoch': 0.66}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:00:32<1:05:39, 131kB/s]
 66%|██████▌   | 1351/2041 [1:57:13<59:34,  5.18s/it][A

{'eps': 0, 'objective/kl': 89.74183654785156, 'objective/entropy': 52.05670928955078, 'objective/non_score_reward': -4.487092018127441, 'objective/rlhf_reward': -6.0636725425720215, 'objective/scores': -1.5765806436538696, 'policy/approxkl_avg': 0.019149640575051308, 'policy/clipfrac_avg': 0.06721698492765427, 'loss/policy_avg': -0.028367985039949417, 'loss/value_avg': 0.47242066264152527, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0022940635681152, 'val/ratio': 0.994274377822876, 'val/ratio_var': 3.505739368847571e-05, 'val/num_eos_tokens': 0, 'lr': 1.692797648211661e-05, 'episode': 5404, 'epoch': 0.66}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:00:37<1:05:39, 131kB/s]
 66%|██████▌   | 1352/2041 [1:57:19<59:32,  5.19s/it][A

{'eps': 0, 'objective/kl': 106.60319519042969, 'objective/entropy': 59.036338806152344, 'objective/non_score_reward': -5.330160140991211, 'objective/rlhf_reward': -6.3257880210876465, 'objective/scores': -0.9956278204917908, 'policy/approxkl_avg': 0.028309348970651627, 'policy/clipfrac_avg': 0.10259434580802917, 'loss/policy_avg': -0.035507142543792725, 'loss/value_avg': 0.49623042345046997, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1075363159179688, 'val/ratio': 1.0224977731704712, 'val/ratio_var': 0.000650926202069968, 'val/num_eos_tokens': 0, 'lr': 1.690347868691818e-05, 'episode': 5408, 'epoch': 0.66}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:00:42<1:05:39, 131kB/s]
 66%|██████▋   | 1353/2041 [1:57:24<59:20,  5.18s/it][A

{'eps': 0, 'objective/kl': 100.48333740234375, 'objective/entropy': 82.58393859863281, 'objective/non_score_reward': -5.024166584014893, 'objective/rlhf_reward': -7.010817527770996, 'objective/scores': -1.986651062965393, 'policy/approxkl_avg': 0.030040061101317406, 'policy/clipfrac_avg': 0.09433961659669876, 'loss/policy_avg': -0.04298136755824089, 'loss/value_avg': 0.6211323738098145, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.5636961460113525, 'val/ratio': 0.9992465376853943, 'val/ratio_var': 3.8783637137385085e-06, 'val/num_eos_tokens': 0, 'lr': 1.6878980891719747e-05, 'episode': 5412, 'epoch': 0.66}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:00:47<1:05:39, 131kB/s]
 66%|██████▋   | 1354/2041 [1:57:29<59:18,  5.18s/it][A

{'eps': 0, 'objective/kl': 91.6767578125, 'objective/entropy': 47.602142333984375, 'objective/non_score_reward': -4.583837985992432, 'objective/rlhf_reward': -6.264278411865234, 'objective/scores': -1.6804406642913818, 'policy/approxkl_avg': 0.010899091139435768, 'policy/clipfrac_avg': 0.0766509398818016, 'loss/policy_avg': -0.029390649870038033, 'loss/value_avg': 0.360381543636322, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0474905967712402, 'val/ratio': 0.9984862804412842, 'val/ratio_var': 1.2749527513733483e-06, 'val/num_eos_tokens': 0, 'lr': 1.6854483096521315e-05, 'episode': 5416, 'epoch': 0.66}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:00:53<1:05:39, 131kB/s]
 66%|██████▋   | 1355/2041 [1:57:34<59:28,  5.20s/it][A

{'eps': 0, 'objective/kl': 94.76106262207031, 'objective/entropy': 56.74311065673828, 'objective/non_score_reward': -4.738053321838379, 'objective/rlhf_reward': -6.883877754211426, 'objective/scores': -2.145824670791626, 'policy/approxkl_avg': 0.011949678882956505, 'policy/clipfrac_avg': 0.08254717290401459, 'loss/policy_avg': -0.03238609805703163, 'loss/value_avg': 0.631946325302124, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9921172857284546, 'val/ratio': 0.9863168001174927, 'val/ratio_var': 0.00012941418390255421, 'val/num_eos_tokens': 0, 'lr': 1.6829985301322883e-05, 'episode': 5420, 'epoch': 0.66}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:00:58<1:05:39, 131kB/s]
 66%|██████▋   | 1356/2041 [1:57:39<59:28,  5.21s/it][A

{'eps': 0, 'objective/kl': 88.91917419433594, 'objective/entropy': 56.77793884277344, 'objective/non_score_reward': -4.445959091186523, 'objective/rlhf_reward': -6.573943614959717, 'objective/scores': -2.1279845237731934, 'policy/approxkl_avg': 0.01061338558793068, 'policy/clipfrac_avg': 0.0766509473323822, 'loss/policy_avg': -0.02938436158001423, 'loss/value_avg': 0.35544973611831665, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0700315237045288, 'val/ratio': 0.9884508848190308, 'val/ratio_var': 9.042935562320054e-05, 'val/num_eos_tokens': 0, 'lr': 1.6805487506124447e-05, 'episode': 5424, 'epoch': 0.66}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:01:03<1:05:39, 131kB/s]
 66%|██████▋   | 1357/2041 [1:57:45<59:24,  5.21s/it][A

{'eps': 0, 'objective/kl': 93.7757568359375, 'objective/entropy': 48.76234817504883, 'objective/non_score_reward': -4.688788414001465, 'objective/rlhf_reward': -6.309811592102051, 'objective/scores': -1.621023178100586, 'policy/approxkl_avg': 0.008132304064929485, 'policy/clipfrac_avg': 0.07900942862033844, 'loss/policy_avg': -0.030078697949647903, 'loss/value_avg': 0.3540385961532593, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8650282621383667, 'val/ratio': 0.9894755482673645, 'val/ratio_var': 8.745183004066348e-05, 'val/num_eos_tokens': 0, 'lr': 1.678098971092602e-05, 'episode': 5428, 'epoch': 0.66}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:01:08<1:05:39, 131kB/s]
 67%|██████▋   | 1358/2041 [1:57:50<59:07,  5.19s/it][A

{'eps': 0, 'objective/kl': 112.62693786621094, 'objective/entropy': 56.37909698486328, 'objective/non_score_reward': -5.63134765625, 'objective/rlhf_reward': -7.074625015258789, 'objective/scores': -1.4432775974273682, 'policy/approxkl_avg': 0.01278565265238285, 'policy/clipfrac_avg': 0.08608490973711014, 'loss/policy_avg': -0.03640181943774223, 'loss/value_avg': 0.6160749197006226, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0559449195861816, 'val/ratio': 0.9959385395050049, 'val/ratio_var': 9.84802045422839e-06, 'val/num_eos_tokens': 0, 'lr': 1.6756491915727584e-05, 'episode': 5432, 'epoch': 0.67}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:01:13<1:05:39, 131kB/s]
 67%|██████▋   | 1359/2041 [1:57:55<58:55,  5.18s/it][A

{'eps': 0, 'objective/kl': 108.71837615966797, 'objective/entropy': 42.77732849121094, 'objective/non_score_reward': -5.435918807983398, 'objective/rlhf_reward': -6.508358955383301, 'objective/scores': -1.0724399089813232, 'policy/approxkl_avg': 0.011591208167374134, 'policy/clipfrac_avg': 0.07547169923782349, 'loss/policy_avg': -0.03609124943614006, 'loss/value_avg': 0.608245313167572, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8957345485687256, 'val/ratio': 1.0033915042877197, 'val/ratio_var': 2.002922883548308e-05, 'val/num_eos_tokens': 0, 'lr': 1.6731994120529155e-05, 'episode': 5436, 'epoch': 0.67}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:01:19<1:05:39, 131kB/s]
 67%|██████▋   | 1360/2041 [1:58:00<59:11,  5.22s/it][A

{'eps': 0, 'objective/kl': 104.38103485107422, 'objective/entropy': 42.75943374633789, 'objective/non_score_reward': -5.219051837921143, 'objective/rlhf_reward': -6.778861045837402, 'objective/scores': -1.5598093271255493, 'policy/approxkl_avg': 0.015597120858728886, 'policy/clipfrac_avg': 0.05778301879763603, 'loss/policy_avg': -0.022911641746759415, 'loss/value_avg': 0.4896520972251892, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7930977940559387, 'val/ratio': 0.9832192659378052, 'val/ratio_var': 0.000208078243304044, 'val/num_eos_tokens': 0, 'lr': 1.670749632533072e-05, 'episode': 5440, 'epoch': 0.67}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:01:24<1:05:39, 131kB/s]
 67%|██████▋   | 1361/2041 [1:58:05<58:45,  5.19s/it][A

{'eps': 0, 'objective/kl': 106.56871032714844, 'objective/entropy': 51.970428466796875, 'objective/non_score_reward': -5.328435897827148, 'objective/rlhf_reward': -7.01657772064209, 'objective/scores': -1.6881415843963623, 'policy/approxkl_avg': 0.009365230798721313, 'policy/clipfrac_avg': 0.08136792480945587, 'loss/policy_avg': -0.03038979321718216, 'loss/value_avg': 0.6173462867736816, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0104830265045166, 'val/ratio': 0.9826081395149231, 'val/ratio_var': 0.00024445561575703323, 'val/num_eos_tokens': 0, 'lr': 1.6682998530132288e-05, 'episode': 5444, 'epoch': 0.67}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:01:29<1:05:39, 131kB/s]
 67%|██████▋   | 1362/2041 [1:58:11<58:38,  5.18s/it][A

{'eps': 0, 'objective/kl': 99.1473388671875, 'objective/entropy': 48.795921325683594, 'objective/non_score_reward': -4.957366943359375, 'objective/rlhf_reward': -5.982824325561523, 'objective/scores': -1.0254571437835693, 'policy/approxkl_avg': 0.012041481211781502, 'policy/clipfrac_avg': 0.08608490228652954, 'loss/policy_avg': -0.03253135085105896, 'loss/value_avg': 0.44829195737838745, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8568501472473145, 'val/ratio': 1.0043964385986328, 'val/ratio_var': 1.0579358786344528e-05, 'val/num_eos_tokens': 0, 'lr': 1.6658500734933856e-05, 'episode': 5448, 'epoch': 0.67}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:01:34<1:05:39, 131kB/s]
 67%|██████▋   | 1363/2041 [1:58:16<58:26,  5.17s/it][A

{'eps': 0, 'objective/kl': 93.80694580078125, 'objective/entropy': 47.6314697265625, 'objective/non_score_reward': -4.690347194671631, 'objective/rlhf_reward': -5.635879993438721, 'objective/scores': -0.9455329775810242, 'policy/approxkl_avg': 0.010621201246976852, 'policy/clipfrac_avg': 0.07783018797636032, 'loss/policy_avg': -0.03162635862827301, 'loss/value_avg': 0.43381914496421814, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8630021810531616, 'val/ratio': 0.9857569932937622, 'val/ratio_var': 0.00016230774053838104, 'val/num_eos_tokens': 0, 'lr': 1.6634002939735424e-05, 'episode': 5452, 'epoch': 0.67}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:01:39<1:05:39, 131kB/s]
 67%|██████▋   | 1364/2041 [1:58:21<58:34,  5.19s/it][A

{'eps': 0, 'objective/kl': 93.99080657958984, 'objective/entropy': 40.95915222167969, 'objective/non_score_reward': -4.699540615081787, 'objective/rlhf_reward': -6.774731159210205, 'objective/scores': -2.075190544128418, 'policy/approxkl_avg': 0.008002398535609245, 'policy/clipfrac_avg': 0.06721697747707367, 'loss/policy_avg': -0.029634598642587662, 'loss/value_avg': 0.5804431438446045, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8185478448867798, 'val/ratio': 0.984290361404419, 'val/ratio_var': 0.00019196500943508, 'val/num_eos_tokens': 0, 'lr': 1.6609505144536995e-05, 'episode': 5456, 'epoch': 0.67}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:01:45<1:05:39, 131kB/s]
 67%|██████▋   | 1365/2041 [1:58:26<58:39,  5.21s/it][A

{'eps': 0, 'objective/kl': 115.5786361694336, 'objective/entropy': 51.78044128417969, 'objective/non_score_reward': -5.778931617736816, 'objective/rlhf_reward': -7.700189590454102, 'objective/scores': -1.9212582111358643, 'policy/approxkl_avg': 0.17685724794864655, 'policy/clipfrac_avg': 0.09551886469125748, 'loss/policy_avg': -0.037352003157138824, 'loss/value_avg': 0.7170037627220154, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1332061290740967, 'val/ratio': 0.9759283661842346, 'val/ratio_var': 0.0004439706972334534, 'val/num_eos_tokens': 0, 'lr': 1.658500734933856e-05, 'episode': 5460, 'epoch': 0.67}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:01:50<1:05:39, 131kB/s]
 67%|██████▋   | 1366/2041 [1:58:31<58:28,  5.20s/it][A

{'eps': 0, 'objective/kl': 88.91673278808594, 'objective/entropy': 59.89636993408203, 'objective/non_score_reward': -4.445836544036865, 'objective/rlhf_reward': -5.217990875244141, 'objective/scores': -0.7721545696258545, 'policy/approxkl_avg': 0.013795244507491589, 'policy/clipfrac_avg': 0.11202830076217651, 'loss/policy_avg': -0.04094338417053223, 'loss/value_avg': 0.29825326800346375, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0212392807006836, 'val/ratio': 0.9918941855430603, 'val/ratio_var': 3.8630412745987996e-05, 'val/num_eos_tokens': 0, 'lr': 1.6560509554140128e-05, 'episode': 5464, 'epoch': 0.67}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:01:55<1:05:39, 131kB/s]
 67%|██████▋   | 1367/2041 [1:58:37<58:24,  5.20s/it][A

{'eps': 0, 'objective/kl': 94.36015319824219, 'objective/entropy': 40.541542053222656, 'objective/non_score_reward': -4.718008041381836, 'objective/rlhf_reward': -6.459308624267578, 'objective/scores': -1.7413007020950317, 'policy/approxkl_avg': 0.01019328460097313, 'policy/clipfrac_avg': 0.08136792480945587, 'loss/policy_avg': -0.029415275901556015, 'loss/value_avg': 0.41046735644340515, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8552066087722778, 'val/ratio': 0.9962799549102783, 'val/ratio_var': 1.158340455731377e-05, 'val/num_eos_tokens': 0, 'lr': 1.6536011758941696e-05, 'episode': 5468, 'epoch': 0.67}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:02:00<1:05:39, 131kB/s]
 67%|██████▋   | 1368/2041 [1:58:42<57:53,  5.16s/it][A

{'eps': 0, 'objective/kl': 98.32380676269531, 'objective/entropy': 31.367189407348633, 'objective/non_score_reward': -4.916190147399902, 'objective/rlhf_reward': -6.583263874053955, 'objective/scores': -1.6670736074447632, 'policy/approxkl_avg': 0.008199202828109264, 'policy/clipfrac_avg': 0.0625, 'loss/policy_avg': -0.025053109973669052, 'loss/value_avg': 0.40572354197502136, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6888734102249146, 'val/ratio': 0.98360276222229, 'val/ratio_var': 0.00021319161169230938, 'val/num_eos_tokens': 0, 'lr': 1.6511513963743264e-05, 'episode': 5472, 'epoch': 0.67}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:02:05<1:05:39, 131kB/s]
 67%|██████▋   | 1369/2041 [1:58:47<57:53,  5.17s/it][A

{'eps': 0, 'objective/kl': 105.50515747070312, 'objective/entropy': 39.07493591308594, 'objective/non_score_reward': -5.2752580642700195, 'objective/rlhf_reward': -6.571245193481445, 'objective/scores': -1.2959873676300049, 'policy/approxkl_avg': 0.009688305668532848, 'policy/clipfrac_avg': 0.07311320304870605, 'loss/policy_avg': -0.028132786974310875, 'loss/value_avg': 0.47248679399490356, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7419213056564331, 'val/ratio': 0.9805215001106262, 'val/ratio_var': 0.0002751376014202833, 'val/num_eos_tokens': 0, 'lr': 1.6487016168544832e-05, 'episode': 5476, 'epoch': 0.67}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:02:10<1:05:39, 131kB/s]
 67%|██████▋   | 1370/2041 [1:58:52<57:48,  5.17s/it][A

{'eps': 0, 'objective/kl': 97.53115844726562, 'objective/entropy': 34.209228515625, 'objective/non_score_reward': -4.876558303833008, 'objective/rlhf_reward': -6.123482704162598, 'objective/scores': -1.2469245195388794, 'policy/approxkl_avg': 0.013692712411284447, 'policy/clipfrac_avg': 0.08136792480945587, 'loss/policy_avg': -0.03152379393577576, 'loss/value_avg': 0.4080069959163666, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7665589451789856, 'val/ratio': 1.0015498399734497, 'val/ratio_var': 1.186408189823851e-05, 'val/num_eos_tokens': 0, 'lr': 1.64625183733464e-05, 'episode': 5480, 'epoch': 0.67}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:02:16<1:05:39, 131kB/s]
 67%|██████▋   | 1371/2041 [1:58:57<57:44,  5.17s/it][A

{'eps': 0, 'objective/kl': 79.35212707519531, 'objective/entropy': 25.175886154174805, 'objective/non_score_reward': -3.96760630607605, 'objective/rlhf_reward': -5.514753341674805, 'objective/scores': -1.5471469163894653, 'policy/approxkl_avg': 0.010725302621722221, 'policy/clipfrac_avg': 0.07193396240472794, 'loss/policy_avg': -0.028780536726117134, 'loss/value_avg': 0.4356275200843811, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.5955499410629272, 'val/ratio': 0.9959232211112976, 'val/ratio_var': 2.1890251446166076e-05, 'val/num_eos_tokens': 0, 'lr': 1.6438020578147968e-05, 'episode': 5484, 'epoch': 0.67}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:02:21<1:05:39, 131kB/s]
 67%|██████▋   | 1372/2041 [1:59:02<57:39,  5.17s/it][A

{'eps': 0, 'objective/kl': 91.83868408203125, 'objective/entropy': 43.39491271972656, 'objective/non_score_reward': -4.5919342041015625, 'objective/rlhf_reward': -6.485381126403809, 'objective/scores': -1.8934470415115356, 'policy/approxkl_avg': 0.012391616590321064, 'policy/clipfrac_avg': 0.09316036850214005, 'loss/policy_avg': -0.03827500343322754, 'loss/value_avg': 0.5159585475921631, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8521610498428345, 'val/ratio': 0.9886622428894043, 'val/ratio_var': 8.165025064954534e-05, 'val/num_eos_tokens': 0, 'lr': 1.6413522782949536e-05, 'episode': 5488, 'epoch': 0.67}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:02:26<1:05:39, 131kB/s]
 67%|██████▋   | 1373/2041 [1:59:08<57:32,  5.17s/it][A

{'eps': 0, 'objective/kl': 92.67471313476562, 'objective/entropy': 46.04689025878906, 'objective/non_score_reward': -4.633735656738281, 'objective/rlhf_reward': -5.454195022583008, 'objective/scores': -0.8204594254493713, 'policy/approxkl_avg': 0.05927795544266701, 'policy/clipfrac_avg': 0.08490566909313202, 'loss/policy_avg': -0.03139916807413101, 'loss/value_avg': 0.34759825468063354, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7402429580688477, 'val/ratio': 0.9989144802093506, 'val/ratio_var': 4.94511732540559e-06, 'val/num_eos_tokens': 0, 'lr': 1.6389024987751104e-05, 'episode': 5492, 'epoch': 0.67}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:02:31<1:05:39, 131kB/s]
 67%|██████▋   | 1374/2041 [1:59:13<57:05,  5.14s/it][A

{'eps': 0, 'objective/kl': 106.58982849121094, 'objective/entropy': 62.988685607910156, 'objective/non_score_reward': -5.32949161529541, 'objective/rlhf_reward': -6.478498458862305, 'objective/scores': -1.1490070819854736, 'policy/approxkl_avg': 0.00846037920564413, 'policy/clipfrac_avg': 0.08608490973711014, 'loss/policy_avg': -0.035515353083610535, 'loss/value_avg': 0.49213922023773193, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1241847276687622, 'val/ratio': 0.9888135194778442, 'val/ratio_var': 9.411609062226489e-05, 'val/num_eos_tokens': 0, 'lr': 1.636452719255267e-05, 'episode': 5496, 'epoch': 0.67}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:02:36<1:05:39, 131kB/s]
 67%|██████▋   | 1375/2041 [1:59:18<56:56,  5.13s/it][A

{'eps': 0, 'objective/kl': 100.46038818359375, 'objective/entropy': 43.261714935302734, 'objective/non_score_reward': -5.023019790649414, 'objective/rlhf_reward': -6.05756139755249, 'objective/scores': -1.0345417261123657, 'policy/approxkl_avg': 0.006746148224920034, 'policy/clipfrac_avg': 0.0554245263338089, 'loss/policy_avg': -0.023455994203686714, 'loss/value_avg': 0.5077787637710571, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.798785924911499, 'val/ratio': 0.9779829978942871, 'val/ratio_var': 0.0004299446882214397, 'val/num_eos_tokens': 0, 'lr': 1.634002939735424e-05, 'episode': 5500, 'epoch': 0.67}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:02:41<1:05:39, 131kB/s]
 67%|██████▋   | 1376/2041 [1:59:23<56:57,  5.14s/it][A

{'eps': 0, 'objective/kl': 92.29684448242188, 'objective/entropy': 35.99989318847656, 'objective/non_score_reward': -4.614842414855957, 'objective/rlhf_reward': -6.507416725158691, 'objective/scores': -1.8925743103027344, 'policy/approxkl_avg': 0.008065344765782356, 'policy/clipfrac_avg': 0.056603774428367615, 'loss/policy_avg': -0.026076894253492355, 'loss/value_avg': 0.41745615005493164, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7955959439277649, 'val/ratio': 0.9810238480567932, 'val/ratio_var': 0.00028988640406168997, 'val/num_eos_tokens': 0, 'lr': 1.6315531602155805e-05, 'episode': 5504, 'epoch': 0.67}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:02:46<1:05:39, 131kB/s]
 67%|██████▋   | 1377/2041 [1:59:28<56:49,  5.13s/it][A

{'eps': 0, 'objective/kl': 96.57244873046875, 'objective/entropy': 45.63275146484375, 'objective/non_score_reward': -4.828622817993164, 'objective/rlhf_reward': -6.684389114379883, 'objective/scores': -1.8557664155960083, 'policy/approxkl_avg': 0.01001005619764328, 'policy/clipfrac_avg': 0.07311321049928665, 'loss/policy_avg': -0.028139542788267136, 'loss/value_avg': 0.4065985679626465, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7858300805091858, 'val/ratio': 0.9878045916557312, 'val/ratio_var': 0.00011399939103284851, 'val/num_eos_tokens': 0, 'lr': 1.6291033806957376e-05, 'episode': 5508, 'epoch': 0.67}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:02:52<1:05:39, 131kB/s]
 68%|██████▊   | 1378/2041 [1:59:33<56:40,  5.13s/it][A

{'eps': 0, 'objective/kl': 98.92294311523438, 'objective/entropy': 52.82444763183594, 'objective/non_score_reward': -4.9461469650268555, 'objective/rlhf_reward': -6.246661186218262, 'objective/scores': -1.3005142211914062, 'policy/approxkl_avg': 0.012882493436336517, 'policy/clipfrac_avg': 0.0695754736661911, 'loss/policy_avg': -0.034362830221652985, 'loss/value_avg': 0.43153923749923706, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0284132957458496, 'val/ratio': 0.9893590211868286, 'val/ratio_var': 7.94570951256901e-05, 'val/num_eos_tokens': 0, 'lr': 1.6266536011758944e-05, 'episode': 5512, 'epoch': 0.68}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:02:57<1:05:39, 131kB/s]
 68%|██████▊   | 1379/2041 [1:59:38<56:45,  5.14s/it][A

{'eps': 0, 'objective/kl': 99.58637237548828, 'objective/entropy': 44.3339958190918, 'objective/non_score_reward': -4.979318618774414, 'objective/rlhf_reward': -6.2758636474609375, 'objective/scores': -1.2965452671051025, 'policy/approxkl_avg': 0.023929843679070473, 'policy/clipfrac_avg': 0.06721697747707367, 'loss/policy_avg': -0.02772827260196209, 'loss/value_avg': 0.4675890803337097, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.751258909702301, 'val/ratio': 1.0115031003952026, 'val/ratio_var': 0.00015275915211532265, 'val/num_eos_tokens': 0, 'lr': 1.624203821656051e-05, 'episode': 5516, 'epoch': 0.68}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:03:02<1:05:39, 131kB/s]
 68%|██████▊   | 1380/2041 [1:59:43<56:54,  5.17s/it][A

{'eps': 0, 'objective/kl': 102.71697235107422, 'objective/entropy': 44.46025848388672, 'objective/non_score_reward': -5.1358489990234375, 'objective/rlhf_reward': -6.800063610076904, 'objective/scores': -1.6642146110534668, 'policy/approxkl_avg': 0.02184157632291317, 'policy/clipfrac_avg': 0.0766509473323822, 'loss/policy_avg': -0.03119797818362713, 'loss/value_avg': 0.5024530291557312, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7816320657730103, 'val/ratio': 0.9753066301345825, 'val/ratio_var': 0.00040722990524955094, 'val/num_eos_tokens': 0, 'lr': 1.621754042136208e-05, 'episode': 5520, 'epoch': 0.68}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:03:07<1:05:39, 131kB/s]
 68%|██████▊   | 1381/2041 [1:59:49<56:37,  5.15s/it][A

{'eps': 0, 'objective/kl': 106.81604766845703, 'objective/entropy': 28.281539916992188, 'objective/non_score_reward': -5.3408026695251465, 'objective/rlhf_reward': -6.574089050292969, 'objective/scores': -1.2332863807678223, 'policy/approxkl_avg': 0.019756179302930832, 'policy/clipfrac_avg': 0.06367924064397812, 'loss/policy_avg': -0.0272066630423069, 'loss/value_avg': 0.6192449927330017, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6410561203956604, 'val/ratio': 0.9802131652832031, 'val/ratio_var': 0.00030696194153279066, 'val/num_eos_tokens': 0, 'lr': 1.6193042626163645e-05, 'episode': 5524, 'epoch': 0.68}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:03:12<1:05:39, 131kB/s]
 68%|██████▊   | 1382/2041 [1:59:54<56:32,  5.15s/it][A

{'eps': 0, 'objective/kl': 95.788818359375, 'objective/entropy': 36.0601921081543, 'objective/non_score_reward': -4.789441108703613, 'objective/rlhf_reward': -5.853324890136719, 'objective/scores': -1.0638837814331055, 'policy/approxkl_avg': 0.016827832907438278, 'policy/clipfrac_avg': 0.07900943607091904, 'loss/policy_avg': -0.02311268262565136, 'loss/value_avg': 0.31188803911209106, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.5976210832595825, 'val/ratio': 0.9760580062866211, 'val/ratio_var': 0.0004330937808845192, 'val/num_eos_tokens': 0, 'lr': 1.6168544830965213e-05, 'episode': 5528, 'epoch': 0.68}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:03:17<1:05:39, 131kB/s]
 68%|██████▊   | 1383/2041 [1:59:59<56:25,  5.14s/it][A

{'eps': 0, 'objective/kl': 91.95404815673828, 'objective/entropy': 18.536205291748047, 'objective/non_score_reward': -4.597702503204346, 'objective/rlhf_reward': -7.232677459716797, 'objective/scores': -2.634974956512451, 'policy/approxkl_avg': 0.05676070973277092, 'policy/clipfrac_avg': 0.0695754736661911, 'loss/policy_avg': -0.02003135159611702, 'loss/value_avg': 0.46781498193740845, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.4313311278820038, 'val/ratio': 0.9838442206382751, 'val/ratio_var': 0.0001981353125302121, 'val/num_eos_tokens': 0, 'lr': 1.614404703576678e-05, 'episode': 5532, 'epoch': 0.68}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:03:22<1:05:39, 131kB/s]
 68%|██████▊   | 1384/2041 [2:00:04<56:11,  5.13s/it][A

{'eps': 0, 'objective/kl': 95.80400085449219, 'objective/entropy': 30.461153030395508, 'objective/non_score_reward': -4.790200233459473, 'objective/rlhf_reward': -5.736166954040527, 'objective/scores': -0.9459665417671204, 'policy/approxkl_avg': 0.014739502221345901, 'policy/clipfrac_avg': 0.09316037595272064, 'loss/policy_avg': -0.032259251922369, 'loss/value_avg': 0.39794719219207764, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7258644104003906, 'val/ratio': 0.9756540656089783, 'val/ratio_var': 0.00047808539238758385, 'val/num_eos_tokens': 0, 'lr': 1.611954924056835e-05, 'episode': 5536, 'epoch': 0.68}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:03:28<1:05:39, 131kB/s]
 68%|██████▊   | 1385/2041 [2:00:09<56:14,  5.14s/it][A

{'eps': 0, 'objective/kl': 115.90917205810547, 'objective/entropy': 38.56264877319336, 'objective/non_score_reward': -5.795458793640137, 'objective/rlhf_reward': -7.2064433097839355, 'objective/scores': -1.4109845161437988, 'policy/approxkl_avg': 0.016772884875535965, 'policy/clipfrac_avg': 0.08018867671489716, 'loss/policy_avg': -0.03190533444285393, 'loss/value_avg': 0.5972033143043518, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7167019248008728, 'val/ratio': 0.976658046245575, 'val/ratio_var': 0.00035527642467059195, 'val/num_eos_tokens': 0, 'lr': 1.6095051445369917e-05, 'episode': 5540, 'epoch': 0.68}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:03:33<1:05:39, 131kB/s]
 68%|██████▊   | 1386/2041 [2:00:14<56:27,  5.17s/it][A

{'eps': 0, 'objective/kl': 104.80821228027344, 'objective/entropy': 38.82145690917969, 'objective/non_score_reward': -5.240410804748535, 'objective/rlhf_reward': -6.961574077606201, 'objective/scores': -1.721163272857666, 'policy/approxkl_avg': 0.008858929388225079, 'policy/clipfrac_avg': 0.0625, 'loss/policy_avg': -0.022491060197353363, 'loss/value_avg': 0.5194757580757141, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7830519080162048, 'val/ratio': 0.9813128709793091, 'val/ratio_var': 0.0002851911121979356, 'val/num_eos_tokens': 0, 'lr': 1.6070553650171485e-05, 'episode': 5544, 'epoch': 0.68}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:03:38<1:05:39, 131kB/s]
 68%|██████▊   | 1387/2041 [2:00:19<56:10,  5.15s/it][A

{'eps': 0, 'objective/kl': 109.71632385253906, 'objective/entropy': 46.035858154296875, 'objective/non_score_reward': -5.48581600189209, 'objective/rlhf_reward': -7.723056793212891, 'objective/scores': -2.237240791320801, 'policy/approxkl_avg': 0.017422936856746674, 'policy/clipfrac_avg': 0.08608490228652954, 'loss/policy_avg': -0.0347520150244236, 'loss/value_avg': 0.6776732206344604, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8151999711990356, 'val/ratio': 0.9786368608474731, 'val/ratio_var': 0.0003303073171991855, 'val/num_eos_tokens': 0, 'lr': 1.6046055854973053e-05, 'episode': 5548, 'epoch': 0.68}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:03:43<1:05:39, 131kB/s]
 68%|██████▊   | 1388/2041 [2:00:25<56:29,  5.19s/it][A

{'eps': 0, 'objective/kl': 95.41041564941406, 'objective/entropy': 41.670249938964844, 'objective/non_score_reward': -4.77052116394043, 'objective/rlhf_reward': -6.501233100891113, 'objective/scores': -1.7307119369506836, 'policy/approxkl_avg': 0.007613226305693388, 'policy/clipfrac_avg': 0.07311321049928665, 'loss/policy_avg': -0.028728634119033813, 'loss/value_avg': 0.37790346145629883, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6512085795402527, 'val/ratio': 0.9835728406906128, 'val/ratio_var': 0.0002266232477268204, 'val/num_eos_tokens': 0, 'lr': 1.602155805977462e-05, 'episode': 5552, 'epoch': 0.68}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:03:48<1:05:39, 131kB/s]
 68%|██████▊   | 1389/2041 [2:00:30<56:16,  5.18s/it][A

{'eps': 0, 'objective/kl': 115.94966125488281, 'objective/entropy': 47.21314239501953, 'objective/non_score_reward': -5.797483444213867, 'objective/rlhf_reward': -7.211389541625977, 'objective/scores': -1.413906216621399, 'policy/approxkl_avg': 0.016840558499097824, 'policy/clipfrac_avg': 0.09198112785816193, 'loss/policy_avg': -0.032327841967344284, 'loss/value_avg': 0.7843233942985535, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9162063598632812, 'val/ratio': 0.9998762011528015, 'val/ratio_var': 5.444288490252802e-06, 'val/num_eos_tokens': 0, 'lr': 1.599706026457619e-05, 'episode': 5556, 'epoch': 0.68}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:03:53<1:05:39, 131kB/s]
 68%|██████▊   | 1390/2041 [2:00:35<55:58,  5.16s/it][A

{'eps': 0, 'objective/kl': 84.73165893554688, 'objective/entropy': 27.727222442626953, 'objective/non_score_reward': -4.2365827560424805, 'objective/rlhf_reward': -5.526972770690918, 'objective/scores': -1.2903897762298584, 'policy/approxkl_avg': 0.010147374123334885, 'policy/clipfrac_avg': 0.06721697747707367, 'loss/policy_avg': -0.02374522015452385, 'loss/value_avg': 0.37527596950531006, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.5956119298934937, 'val/ratio': 0.9968617558479309, 'val/ratio_var': 8.217131835408509e-06, 'val/num_eos_tokens': 0, 'lr': 1.5972562469377758e-05, 'episode': 5560, 'epoch': 0.68}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:03:59<1:05:39, 131kB/s]
 68%|██████▊   | 1391/2041 [2:00:40<55:44,  5.14s/it][A

{'eps': 0, 'objective/kl': 89.56837463378906, 'objective/entropy': 36.74779510498047, 'objective/non_score_reward': -4.478418827056885, 'objective/rlhf_reward': -6.474281311035156, 'objective/scores': -1.9958627223968506, 'policy/approxkl_avg': 0.0051713790744543076, 'policy/clipfrac_avg': 0.05424528568983078, 'loss/policy_avg': -0.022266477346420288, 'loss/value_avg': 0.38553327322006226, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7631956338882446, 'val/ratio': 0.991007387638092, 'val/ratio_var': 6.733313057338819e-05, 'val/num_eos_tokens': 0, 'lr': 1.5948064674179326e-05, 'episode': 5564, 'epoch': 0.68}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:04:04<1:05:39, 131kB/s]
 68%|██████▊   | 1392/2041 [2:00:45<55:38,  5.14s/it][A

{'eps': 0, 'objective/kl': 95.83528137207031, 'objective/entropy': 31.151687622070312, 'objective/non_score_reward': -4.791763782501221, 'objective/rlhf_reward': -6.460654258728027, 'objective/scores': -1.6688905954360962, 'policy/approxkl_avg': 0.019962860271334648, 'policy/clipfrac_avg': 0.09080188721418381, 'loss/policy_avg': -0.031113026663661003, 'loss/value_avg': 0.36195123195648193, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.5983846187591553, 'val/ratio': 0.9910247921943665, 'val/ratio_var': 4.038342376588844e-05, 'val/num_eos_tokens': 0, 'lr': 1.592356687898089e-05, 'episode': 5568, 'epoch': 0.68}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:04:09<1:05:39, 131kB/s]
 68%|██████▊   | 1393/2041 [2:00:50<55:19,  5.12s/it][A

{'eps': 0, 'objective/kl': 104.84750366210938, 'objective/entropy': 35.144065856933594, 'objective/non_score_reward': -5.242375373840332, 'objective/rlhf_reward': -6.766871929168701, 'objective/scores': -1.5244966745376587, 'policy/approxkl_avg': 0.01170880813151598, 'policy/clipfrac_avg': 0.06485848873853683, 'loss/policy_avg': -0.02573956921696663, 'loss/value_avg': 0.5680620074272156, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6641371250152588, 'val/ratio': 0.9930402040481567, 'val/ratio_var': 2.757435322564561e-05, 'val/num_eos_tokens': 0, 'lr': 1.589906908378246e-05, 'episode': 5572, 'epoch': 0.68}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:04:14<1:05:39, 131kB/s]
 68%|██████▊   | 1394/2041 [2:00:56<55:23,  5.14s/it][A

{'eps': 0, 'objective/kl': 112.8210220336914, 'objective/entropy': 47.638526916503906, 'objective/non_score_reward': -5.641051292419434, 'objective/rlhf_reward': -8.097402572631836, 'objective/scores': -2.4563517570495605, 'policy/approxkl_avg': 0.03210677206516266, 'policy/clipfrac_avg': 0.07900943607091904, 'loss/policy_avg': -0.027697576209902763, 'loss/value_avg': 0.8152596354484558, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7814774513244629, 'val/ratio': 0.9808816909790039, 'val/ratio_var': 0.0002294042642461136, 'val/num_eos_tokens': 0, 'lr': 1.587457128858403e-05, 'episode': 5576, 'epoch': 0.68}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:04:19<1:05:39, 131kB/s]
 68%|██████▊   | 1395/2041 [2:01:01<55:32,  5.16s/it][A

{'eps': 0, 'objective/kl': 99.04846954345703, 'objective/entropy': 45.938453674316406, 'objective/non_score_reward': -4.952423095703125, 'objective/rlhf_reward': -6.684007167816162, 'objective/scores': -1.7315841913223267, 'policy/approxkl_avg': 0.010119530372321606, 'policy/clipfrac_avg': 0.06721697747707367, 'loss/policy_avg': -0.027831465005874634, 'loss/value_avg': 0.5141393542289734, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7985624074935913, 'val/ratio': 0.9855564832687378, 'val/ratio_var': 0.00014012871542945504, 'val/num_eos_tokens': 0, 'lr': 1.5850073493385594e-05, 'episode': 5580, 'epoch': 0.68}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:04:24<1:05:39, 131kB/s]
 68%|██████▊   | 1396/2041 [2:01:06<55:17,  5.14s/it][A

{'eps': 0, 'objective/kl': 102.40534973144531, 'objective/entropy': 44.16730499267578, 'objective/non_score_reward': -5.120267868041992, 'objective/rlhf_reward': -7.0560126304626465, 'objective/scores': -1.9357446432113647, 'policy/approxkl_avg': 0.012824518606066704, 'policy/clipfrac_avg': 0.08844339847564697, 'loss/policy_avg': -0.03376869112253189, 'loss/value_avg': 0.5720828771591187, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7731642723083496, 'val/ratio': 0.9949047565460205, 'val/ratio_var': 1.2988413800485432e-05, 'val/num_eos_tokens': 0, 'lr': 1.5825575698187166e-05, 'episode': 5584, 'epoch': 0.68}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:04:29<1:05:39, 131kB/s]
 68%|██████▊   | 1397/2041 [2:01:11<55:10,  5.14s/it][A

{'eps': 0, 'objective/kl': 98.40234375, 'objective/entropy': 37.48772430419922, 'objective/non_score_reward': -4.920117378234863, 'objective/rlhf_reward': -6.61219596862793, 'objective/scores': -1.6920785903930664, 'policy/approxkl_avg': 0.00847725197672844, 'policy/clipfrac_avg': 0.0624999962747097, 'loss/policy_avg': -0.025363147258758545, 'loss/value_avg': 0.4134131968021393, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6122019290924072, 'val/ratio': 1.0039567947387695, 'val/ratio_var': 2.2367312340065837e-05, 'val/num_eos_tokens': 0, 'lr': 1.580107790298873e-05, 'episode': 5588, 'epoch': 0.68}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:04:35<1:05:39, 131kB/s]
 68%|██████▊   | 1398/2041 [2:01:16<55:12,  5.15s/it][A

{'eps': 0, 'objective/kl': 96.67652893066406, 'objective/entropy': 35.25492858886719, 'objective/non_score_reward': -4.833826065063477, 'objective/rlhf_reward': -6.833040237426758, 'objective/scores': -1.9992144107818604, 'policy/approxkl_avg': 0.006443138234317303, 'policy/clipfrac_avg': 0.06132075935602188, 'loss/policy_avg': -0.024042299017310143, 'loss/value_avg': 0.46368879079818726, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6653763055801392, 'val/ratio': 0.9880547523498535, 'val/ratio_var': 0.00010458409087732434, 'val/num_eos_tokens': 0, 'lr': 1.5776580107790302e-05, 'episode': 5592, 'epoch': 0.69}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:04:40<1:05:39, 131kB/s]
 69%|██████▊   | 1399/2041 [2:01:21<55:27,  5.18s/it][A

{'eps': 0, 'objective/kl': 91.89204406738281, 'objective/entropy': 29.640811920166016, 'objective/non_score_reward': -4.594603061676025, 'objective/rlhf_reward': -5.638060569763184, 'objective/scores': -1.0434577465057373, 'policy/approxkl_avg': 0.011343983002007008, 'policy/clipfrac_avg': 0.05188679322600365, 'loss/policy_avg': -0.0268312469124794, 'loss/value_avg': 0.3180568516254425, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6526601910591125, 'val/ratio': 0.9987558722496033, 'val/ratio_var': 1.437999617337482e-05, 'val/num_eos_tokens': 0, 'lr': 1.5752082312591867e-05, 'episode': 5596, 'epoch': 0.69}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:04:45<1:05:39, 131kB/s]
 69%|██████▊   | 1400/2041 [2:01:27<55:18,  5.18s/it][A

{'eps': 0, 'objective/kl': 97.25950622558594, 'objective/entropy': 39.45909118652344, 'objective/non_score_reward': -4.862975120544434, 'objective/rlhf_reward': -6.7613654136657715, 'objective/scores': -1.898390293121338, 'policy/approxkl_avg': 0.01408049650490284, 'policy/clipfrac_avg': 0.0766509473323822, 'loss/policy_avg': -0.031084662303328514, 'loss/value_avg': 0.5846782922744751, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6661405563354492, 'val/ratio': 0.9813127517700195, 'val/ratio_var': 0.00021319669031072408, 'val/num_eos_tokens': 0, 'lr': 1.5727584517393435e-05, 'episode': 5600, 'epoch': 0.69}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:04:50<1:05:39, 131kB/s]
 69%|██████▊   | 1401/2041 [2:01:32<55:01,  5.16s/it][A

{'eps': 0, 'objective/kl': 97.67433166503906, 'objective/entropy': 34.91541290283203, 'objective/non_score_reward': -4.883716583251953, 'objective/rlhf_reward': -6.428988456726074, 'objective/scores': -1.5452719926834106, 'policy/approxkl_avg': 0.008919917978346348, 'policy/clipfrac_avg': 0.07311320304870605, 'loss/policy_avg': -0.027851343154907227, 'loss/value_avg': 0.35898521542549133, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6338192820549011, 'val/ratio': 0.9861462116241455, 'val/ratio_var': 0.00014893178013153374, 'val/num_eos_tokens': 0, 'lr': 1.5703086722195003e-05, 'episode': 5604, 'epoch': 0.69}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:04:55<1:05:39, 131kB/s]
 69%|██████▊   | 1402/2041 [2:01:37<55:23,  5.20s/it][A

{'eps': 0, 'objective/kl': 72.25654602050781, 'objective/entropy': 24.269556045532227, 'objective/non_score_reward': -3.6128273010253906, 'objective/rlhf_reward': -5.415584564208984, 'objective/scores': -1.8027572631835938, 'policy/approxkl_avg': 0.015649063512682915, 'policy/clipfrac_avg': 0.0625, 'loss/policy_avg': -0.02251235768198967, 'loss/value_avg': 0.2676597833633423, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.479529470205307, 'val/ratio': 0.9731423258781433, 'val/ratio_var': 0.0005059303366579115, 'val/num_eos_tokens': 0, 'lr': 1.567858892699657e-05, 'episode': 5608, 'epoch': 0.69}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:05:01<1:05:39, 131kB/s]
 69%|██████▊   | 1403/2041 [2:01:42<55:17,  5.20s/it][A

{'eps': 0, 'objective/kl': 93.18032836914062, 'objective/entropy': 27.902193069458008, 'objective/non_score_reward': -4.6590166091918945, 'objective/rlhf_reward': -6.711480140686035, 'objective/scores': -2.0524637699127197, 'policy/approxkl_avg': 0.03340771794319153, 'policy/clipfrac_avg': 0.0554245300590992, 'loss/policy_avg': -0.02197471633553505, 'loss/value_avg': 0.49333834648132324, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6224055290222168, 'val/ratio': 0.9785444140434265, 'val/ratio_var': 0.00033477856777608395, 'val/num_eos_tokens': 0, 'lr': 1.565409113179814e-05, 'episode': 5612, 'epoch': 0.69}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:05:06<1:05:39, 131kB/s]
 69%|██████▉   | 1404/2041 [2:01:47<54:56,  5.18s/it][A

{'eps': 0, 'objective/kl': 110.38825988769531, 'objective/entropy': 55.846466064453125, 'objective/non_score_reward': -5.519413471221924, 'objective/rlhf_reward': -7.772218227386475, 'objective/scores': -2.252804756164551, 'policy/approxkl_avg': 0.011002001352608204, 'policy/clipfrac_avg': 0.07783018797636032, 'loss/policy_avg': -0.03701487183570862, 'loss/value_avg': 0.6497557759284973, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.976783812046051, 'val/ratio': 0.9766157269477844, 'val/ratio_var': 0.0004233574727550149, 'val/num_eos_tokens': 0, 'lr': 1.5629593336599707e-05, 'episode': 5616, 'epoch': 0.69}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:05:11<1:05:39, 131kB/s]
 69%|██████▉   | 1405/2041 [2:01:52<54:42,  5.16s/it][A

{'eps': 0, 'objective/kl': 99.94129943847656, 'objective/entropy': 35.99882507324219, 'objective/non_score_reward': -4.99706506729126, 'objective/rlhf_reward': -7.012041091918945, 'objective/scores': -2.0149762630462646, 'policy/approxkl_avg': 0.005303112789988518, 'policy/clipfrac_avg': 0.04599056392908096, 'loss/policy_avg': -0.020230602473020554, 'loss/value_avg': 0.4811907410621643, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6754235029220581, 'val/ratio': 0.9886941313743591, 'val/ratio_var': 0.00011729044490493834, 'val/num_eos_tokens': 0, 'lr': 1.5605095541401275e-05, 'episode': 5620, 'epoch': 0.69}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:05:16<1:05:39, 131kB/s]
 69%|██████▉   | 1406/2041 [2:01:58<54:32,  5.15s/it][A

{'eps': 0, 'objective/kl': 95.3848876953125, 'objective/entropy': 44.84783172607422, 'objective/non_score_reward': -4.76924467086792, 'objective/rlhf_reward': -7.118585109710693, 'objective/scores': -2.3493404388427734, 'policy/approxkl_avg': 0.024712519720196724, 'policy/clipfrac_avg': 0.053066037595272064, 'loss/policy_avg': -0.02690805494785309, 'loss/value_avg': 0.5031975507736206, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8447160720825195, 'val/ratio': 0.9831705689430237, 'val/ratio_var': 0.00020092121849302202, 'val/num_eos_tokens': 0, 'lr': 1.5580597746202843e-05, 'episode': 5624, 'epoch': 0.69}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:05:21<1:05:39, 131kB/s]
 69%|██████▉   | 1407/2041 [2:02:03<54:25,  5.15s/it][A

{'eps': 0, 'objective/kl': 90.67890930175781, 'objective/entropy': 31.973346710205078, 'objective/non_score_reward': -4.5339460372924805, 'objective/rlhf_reward': -6.832444190979004, 'objective/scores': -2.2984981536865234, 'policy/approxkl_avg': 0.007430812809616327, 'policy/clipfrac_avg': 0.049528300762176514, 'loss/policy_avg': -0.024146540090441704, 'loss/value_avg': 0.4300156831741333, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.634318470954895, 'val/ratio': 0.9836264848709106, 'val/ratio_var': 0.0002049307367997244, 'val/num_eos_tokens': 0, 'lr': 1.555609995100441e-05, 'episode': 5628, 'epoch': 0.69}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:05:26<1:05:39, 131kB/s]
 69%|██████▉   | 1408/2041 [2:02:08<54:26,  5.16s/it][A

{'eps': 0, 'objective/kl': 92.04336547851562, 'objective/entropy': 34.83723068237305, 'objective/non_score_reward': -4.602168083190918, 'objective/rlhf_reward': -6.496818542480469, 'objective/scores': -1.8946503400802612, 'policy/approxkl_avg': 0.020532870665192604, 'policy/clipfrac_avg': 0.0837264135479927, 'loss/policy_avg': -0.027875225991010666, 'loss/value_avg': 0.3362066447734833, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6911121606826782, 'val/ratio': 0.9781707525253296, 'val/ratio_var': 0.00031091575510799885, 'val/num_eos_tokens': 0, 'lr': 1.5531602155805976e-05, 'episode': 5632, 'epoch': 0.69}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:05:32<1:05:39, 131kB/s]
 69%|██████▉   | 1409/2041 [2:02:13<54:23,  5.16s/it][A

{'eps': 0, 'objective/kl': 95.66537475585938, 'objective/entropy': 27.497520446777344, 'objective/non_score_reward': -4.783268928527832, 'objective/rlhf_reward': -6.307538986206055, 'objective/scores': -1.524269938468933, 'policy/approxkl_avg': 0.007939985021948814, 'policy/clipfrac_avg': 0.04009433835744858, 'loss/policy_avg': -0.01890357956290245, 'loss/value_avg': 0.41184595227241516, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.5825399160385132, 'val/ratio': 0.9858876466751099, 'val/ratio_var': 0.0001517034397693351, 'val/num_eos_tokens': 0, 'lr': 1.5507104360607547e-05, 'episode': 5636, 'epoch': 0.69}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:05:37<1:05:39, 131kB/s]
 69%|██████▉   | 1410/2041 [2:02:18<54:12,  5.15s/it][A

{'eps': 0, 'objective/kl': 98.19914245605469, 'objective/entropy': 28.24335479736328, 'objective/non_score_reward': -4.909956932067871, 'objective/rlhf_reward': -7.319745063781738, 'objective/scores': -2.409787893295288, 'policy/approxkl_avg': 0.011449739336967468, 'policy/clipfrac_avg': 0.04716981202363968, 'loss/policy_avg': -0.02123960852622986, 'loss/value_avg': 0.4885328710079193, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.4938206076622009, 'val/ratio': 0.9831418991088867, 'val/ratio_var': 0.00021237651526462287, 'val/num_eos_tokens': 0, 'lr': 1.5482606565409115e-05, 'episode': 5640, 'epoch': 0.69}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:05:42<1:05:39, 131kB/s]
 69%|██████▉   | 1411/2041 [2:02:23<54:22,  5.18s/it][A

{'eps': 0, 'objective/kl': 107.82246398925781, 'objective/entropy': 40.66490936279297, 'objective/non_score_reward': -5.391123294830322, 'objective/rlhf_reward': -6.907737731933594, 'objective/scores': -1.5166144371032715, 'policy/approxkl_avg': 0.01525143627077341, 'policy/clipfrac_avg': 0.056603774428367615, 'loss/policy_avg': -0.02220323495566845, 'loss/value_avg': 0.5177849531173706, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.5670963525772095, 'val/ratio': 1.0204973220825195, 'val/ratio_var': 0.0005930177285335958, 'val/num_eos_tokens': 0, 'lr': 1.5458108770210683e-05, 'episode': 5644, 'epoch': 0.69}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:05:47<1:05:39, 131kB/s]
 69%|██████▉   | 1412/2041 [2:02:29<54:25,  5.19s/it][A

{'eps': 0, 'objective/kl': 69.78936767578125, 'objective/entropy': 38.5327033996582, 'objective/non_score_reward': -3.4894680976867676, 'objective/rlhf_reward': -6.214864730834961, 'objective/scores': -2.7253966331481934, 'policy/approxkl_avg': 0.014379705302417278, 'policy/clipfrac_avg': 0.07193396240472794, 'loss/policy_avg': -0.028832998126745224, 'loss/value_avg': 0.3510872423648834, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6978826522827148, 'val/ratio': 1.0041677951812744, 'val/ratio_var': 1.7198146451846696e-05, 'val/num_eos_tokens': 0, 'lr': 1.543361097501225e-05, 'episode': 5648, 'epoch': 0.69}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:05:52<1:05:39, 131kB/s]
 69%|██████▉   | 1413/2041 [2:02:34<54:06,  5.17s/it][A

{'eps': 0, 'objective/kl': 90.16923522949219, 'objective/entropy': 21.50592041015625, 'objective/non_score_reward': -4.508461952209473, 'objective/rlhf_reward': -7.039647579193115, 'objective/scores': -2.5311856269836426, 'policy/approxkl_avg': 0.010043364018201828, 'policy/clipfrac_avg': 0.05778301879763603, 'loss/policy_avg': -0.023497752845287323, 'loss/value_avg': 0.3595491945743561, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.4877181351184845, 'val/ratio': 0.9769856929779053, 'val/ratio_var': 0.00041362762567587197, 'val/num_eos_tokens': 0, 'lr': 1.5409113179813816e-05, 'episode': 5652, 'epoch': 0.69}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:05:57<1:05:39, 131kB/s]
 69%|██████▉   | 1414/2041 [2:02:39<53:45,  5.14s/it][A

{'eps': 0, 'objective/kl': 83.97496795654297, 'objective/entropy': 22.277437210083008, 'objective/non_score_reward': -4.198748588562012, 'objective/rlhf_reward': -5.660150527954102, 'objective/scores': -1.4614018201828003, 'policy/approxkl_avg': 0.009551106952130795, 'policy/clipfrac_avg': 0.06132075563073158, 'loss/policy_avg': -0.02142118476331234, 'loss/value_avg': 0.4207853674888611, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.49101078510284424, 'val/ratio': 0.9869413375854492, 'val/ratio_var': 0.00014134470256976783, 'val/num_eos_tokens': 0, 'lr': 1.5384615384615387e-05, 'episode': 5656, 'epoch': 0.69}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:06:03<1:05:39, 131kB/s]
 69%|██████▉   | 1415/2041 [2:02:44<54:07,  5.19s/it][A

{'eps': 0, 'objective/kl': 85.68732452392578, 'objective/entropy': 29.763195037841797, 'objective/non_score_reward': -4.284366607666016, 'objective/rlhf_reward': -6.984928607940674, 'objective/scores': -2.700562000274658, 'policy/approxkl_avg': 0.008142241276800632, 'policy/clipfrac_avg': 0.049528300762176514, 'loss/policy_avg': -0.02041388303041458, 'loss/value_avg': 0.479474812746048, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.5674171447753906, 'val/ratio': 0.9972226023674011, 'val/ratio_var': 7.377127531071892e-06, 'val/num_eos_tokens': 0, 'lr': 1.5360117589416952e-05, 'episode': 5660, 'epoch': 0.69}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:06:08<1:05:39, 131kB/s]
 69%|██████▉   | 1416/2041 [2:02:49<53:47,  5.16s/it][A

{'eps': 0, 'objective/kl': 91.90689086914062, 'objective/entropy': 28.307897567749023, 'objective/non_score_reward': -4.5953450202941895, 'objective/rlhf_reward': -7.1141180992126465, 'objective/scores': -2.518773078918457, 'policy/approxkl_avg': 0.011216184124350548, 'policy/clipfrac_avg': 0.07075472176074982, 'loss/policy_avg': -0.02797584980726242, 'loss/value_avg': 0.4978879988193512, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6410291790962219, 'val/ratio': 0.9832406640052795, 'val/ratio_var': 0.00020213871903251857, 'val/num_eos_tokens': 0, 'lr': 1.533561979421852e-05, 'episode': 5664, 'epoch': 0.69}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:06:13<1:05:39, 131kB/s]
 69%|██████▉   | 1417/2041 [2:02:54<53:45,  5.17s/it][A

{'eps': 0, 'objective/kl': 79.30819702148438, 'objective/entropy': 32.454833984375, 'objective/non_score_reward': -3.965409994125366, 'objective/rlhf_reward': -6.457746982574463, 'objective/scores': -2.4923369884490967, 'policy/approxkl_avg': 0.010459646582603455, 'policy/clipfrac_avg': 0.07311320304870605, 'loss/policy_avg': -0.022733192890882492, 'loss/value_avg': 0.2901536226272583, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.5284276604652405, 'val/ratio': 0.9975391030311584, 'val/ratio_var': 3.847997049888363e-06, 'val/num_eos_tokens': 0, 'lr': 1.5311121999020088e-05, 'episode': 5668, 'epoch': 0.69}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:06:18<1:05:39, 131kB/s]
 69%|██████▉   | 1418/2041 [2:03:00<53:47,  5.18s/it][A

{'eps': 0, 'objective/kl': 90.94818115234375, 'objective/entropy': 38.031253814697266, 'objective/non_score_reward': -4.5474090576171875, 'objective/rlhf_reward': -7.135385513305664, 'objective/scores': -2.5879764556884766, 'policy/approxkl_avg': 0.010603470727801323, 'policy/clipfrac_avg': 0.0695754736661911, 'loss/policy_avg': -0.023728840053081512, 'loss/value_avg': 0.37367159128189087, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6406583189964294, 'val/ratio': 0.9911059737205505, 'val/ratio_var': 4.815158899873495e-05, 'val/num_eos_tokens': 0, 'lr': 1.5286624203821656e-05, 'episode': 5672, 'epoch': 0.69}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:06:23<1:05:39, 131kB/s]
 70%|██████▉   | 1419/2041 [2:03:05<53:26,  5.16s/it][A

{'eps': 0, 'objective/kl': 100.01739501953125, 'objective/entropy': 40.19075012207031, 'objective/non_score_reward': -5.0008697509765625, 'objective/rlhf_reward': -6.9803667068481445, 'objective/scores': -1.9794971942901611, 'policy/approxkl_avg': 0.010713471099734306, 'policy/clipfrac_avg': 0.08844339102506638, 'loss/policy_avg': -0.03429177775979042, 'loss/value_avg': 0.4481419324874878, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7275186777114868, 'val/ratio': 0.995272159576416, 'val/ratio_var': 1.666878779360559e-05, 'val/num_eos_tokens': 0, 'lr': 1.5262126408623227e-05, 'episode': 5676, 'epoch': 0.7}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:06:28<1:05:39, 131kB/s]
 70%|██████▉   | 1420/2041 [2:03:10<53:10,  5.14s/it][A

{'eps': 0, 'objective/kl': 93.04594421386719, 'objective/entropy': 46.766422271728516, 'objective/non_score_reward': -4.6522979736328125, 'objective/rlhf_reward': -6.9235615730285645, 'objective/scores': -2.271263599395752, 'policy/approxkl_avg': 0.010624322108924389, 'policy/clipfrac_avg': 0.08018868416547775, 'loss/policy_avg': -0.029611993581056595, 'loss/value_avg': 0.41903144121170044, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.783456563949585, 'val/ratio': 0.9962698221206665, 'val/ratio_var': 1.244855684490176e-05, 'val/num_eos_tokens': 0, 'lr': 1.5237628613424792e-05, 'episode': 5680, 'epoch': 0.7}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:06:33<1:05:39, 131kB/s]
 70%|██████▉   | 1421/2041 [2:03:15<53:05,  5.14s/it][A

{'eps': 0, 'objective/kl': 105.29273986816406, 'objective/entropy': 32.388145446777344, 'objective/non_score_reward': -5.264636993408203, 'objective/rlhf_reward': -7.216529369354248, 'objective/scores': -1.9518924951553345, 'policy/approxkl_avg': 0.008722884580492973, 'policy/clipfrac_avg': 0.057783015072345734, 'loss/policy_avg': -0.02027846686542034, 'loss/value_avg': 0.4638849198818207, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6406180262565613, 'val/ratio': 0.9967528581619263, 'val/ratio_var': 4.806021024705842e-06, 'val/num_eos_tokens': 0, 'lr': 1.521313081822636e-05, 'episode': 5684, 'epoch': 0.7}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:06:39<1:05:39, 131kB/s]
 70%|██████▉   | 1422/2041 [2:03:20<53:11,  5.16s/it][A

{'eps': 0, 'objective/kl': 92.8985595703125, 'objective/entropy': 40.467140197753906, 'objective/non_score_reward': -4.644927978515625, 'objective/rlhf_reward': -6.475350379943848, 'objective/scores': -1.8304226398468018, 'policy/approxkl_avg': 0.00721120135858655, 'policy/clipfrac_avg': 0.06603772938251495, 'loss/policy_avg': -0.024825017899274826, 'loss/value_avg': 0.40962541103363037, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7552231550216675, 'val/ratio': 0.9896878004074097, 'val/ratio_var': 7.511310832342133e-05, 'val/num_eos_tokens': 0, 'lr': 1.5188633023027928e-05, 'episode': 5688, 'epoch': 0.7}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:06:44<1:05:39, 131kB/s]
 70%|██████▉   | 1423/2041 [2:03:25<53:08,  5.16s/it][A

{'eps': 0, 'objective/kl': 92.79339599609375, 'objective/entropy': 40.01969909667969, 'objective/non_score_reward': -4.639669418334961, 'objective/rlhf_reward': -7.190676212310791, 'objective/scores': -2.55100679397583, 'policy/approxkl_avg': 0.018191827461123466, 'policy/clipfrac_avg': 0.07429245859384537, 'loss/policy_avg': -0.030138010159134865, 'loss/value_avg': 0.4129115343093872, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6915897130966187, 'val/ratio': 0.9782724380493164, 'val/ratio_var': 0.0003170623676851392, 'val/num_eos_tokens': 0, 'lr': 1.5164135227829496e-05, 'episode': 5692, 'epoch': 0.7}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:06:49<1:05:39, 131kB/s]
 70%|██████▉   | 1424/2041 [2:03:31<53:12,  5.17s/it][A

{'eps': 0, 'objective/kl': 93.57998657226562, 'objective/entropy': 38.98699951171875, 'objective/non_score_reward': -4.678999423980713, 'objective/rlhf_reward': -6.9435625076293945, 'objective/scores': -2.2645630836486816, 'policy/approxkl_avg': 0.012007452547550201, 'policy/clipfrac_avg': 0.0695754736661911, 'loss/policy_avg': -0.02719816565513611, 'loss/value_avg': 0.5574267506599426, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6972351670265198, 'val/ratio': 0.9965386986732483, 'val/ratio_var': 1.721489206829574e-05, 'val/num_eos_tokens': 0, 'lr': 1.5139637432631066e-05, 'episode': 5696, 'epoch': 0.7}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:06:54<1:05:39, 131kB/s]
 70%|██████▉   | 1425/2041 [2:03:36<53:00,  5.16s/it][A

{'eps': 0, 'objective/kl': 112.6122055053711, 'objective/entropy': 39.81077194213867, 'objective/non_score_reward': -5.630610466003418, 'objective/rlhf_reward': -8.400495529174805, 'objective/scores': -2.7698850631713867, 'policy/approxkl_avg': 0.010813710279762745, 'policy/clipfrac_avg': 0.08372640609741211, 'loss/policy_avg': -0.032954830676317215, 'loss/value_avg': 0.6520885825157166, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7437006235122681, 'val/ratio': 1.003035068511963, 'val/ratio_var': 4.572598754748469e-06, 'val/num_eos_tokens': 0, 'lr': 1.5115139637432632e-05, 'episode': 5700, 'epoch': 0.7}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:06:59<1:05:39, 131kB/s]
 70%|██████▉   | 1426/2041 [2:03:41<53:02,  5.18s/it][A

{'eps': 0, 'objective/kl': 118.33853149414062, 'objective/entropy': 49.703590393066406, 'objective/non_score_reward': -5.916926860809326, 'objective/rlhf_reward': -8.375703811645508, 'objective/scores': -2.4587767124176025, 'policy/approxkl_avg': 0.012998068705201149, 'policy/clipfrac_avg': 0.0966981053352356, 'loss/policy_avg': -0.036251507699489594, 'loss/value_avg': 0.7906873822212219, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8243322372436523, 'val/ratio': 0.9783584475517273, 'val/ratio_var': 0.0003642095543909818, 'val/num_eos_tokens': 0, 'lr': 1.5090641842234199e-05, 'episode': 5704, 'epoch': 0.7}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:07:05<1:05:39, 131kB/s]
 70%|██████▉   | 1427/2041 [2:03:46<52:55,  5.17s/it][A

{'eps': 0, 'objective/kl': 88.15965270996094, 'objective/entropy': 34.99971008300781, 'objective/non_score_reward': -4.40798282623291, 'objective/rlhf_reward': -6.3153252601623535, 'objective/scores': -1.907342553138733, 'policy/approxkl_avg': 0.009062510915100574, 'policy/clipfrac_avg': 0.060141511261463165, 'loss/policy_avg': -0.022959738969802856, 'loss/value_avg': 0.31119298934936523, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.5195585489273071, 'val/ratio': 0.994066596031189, 'val/ratio_var': 2.3363454602076672e-05, 'val/num_eos_tokens': 0, 'lr': 1.5066144047035768e-05, 'episode': 5708, 'epoch': 0.7}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:07:10<1:05:39, 131kB/s]
 70%|██████▉   | 1428/2041 [2:03:51<52:39,  5.15s/it][A

{'eps': 0, 'objective/kl': 98.05242156982422, 'objective/entropy': 40.45610046386719, 'objective/non_score_reward': -4.902621269226074, 'objective/rlhf_reward': -6.1776838302612305, 'objective/scores': -1.2750625610351562, 'policy/approxkl_avg': 0.02180001139640808, 'policy/clipfrac_avg': 0.08608490973711014, 'loss/policy_avg': -0.028371360152959824, 'loss/value_avg': 0.5200560092926025, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6646420955657959, 'val/ratio': 1.0290193557739258, 'val/ratio_var': 0.0009609712287783623, 'val/num_eos_tokens': 0, 'lr': 1.5041646251837335e-05, 'episode': 5712, 'epoch': 0.7}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:07:15<1:05:39, 131kB/s]
 70%|███████   | 1429/2041 [2:03:56<52:45,  5.17s/it][A

{'eps': 0, 'objective/kl': 95.12078857421875, 'objective/entropy': 28.886335372924805, 'objective/non_score_reward': -4.756039619445801, 'objective/rlhf_reward': -6.864898681640625, 'objective/scores': -2.1088593006134033, 'policy/approxkl_avg': 0.0065807136707007885, 'policy/clipfrac_avg': 0.05070754513144493, 'loss/policy_avg': -0.021114211529493332, 'loss/value_avg': 0.4308391213417053, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.5248264670372009, 'val/ratio': 0.9874186515808105, 'val/ratio_var': 0.00013295600365381688, 'val/num_eos_tokens': 0, 'lr': 1.5017148456638903e-05, 'episode': 5716, 'epoch': 0.7}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:07:23<1:05:39, 131kB/s]
 70%|███████   | 1430/2041 [2:04:05<1:02:30,  6.14s/it][A

{'eps': 0, 'objective/kl': 104.25712585449219, 'objective/entropy': 34.46951675415039, 'objective/non_score_reward': -5.212856292724609, 'objective/rlhf_reward': -7.380937099456787, 'objective/scores': -2.1680808067321777, 'policy/approxkl_avg': 0.009515631943941116, 'policy/clipfrac_avg': 0.07311321049928665, 'loss/policy_avg': -0.028659433126449585, 'loss/value_avg': 0.48578566312789917, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6158179044723511, 'val/ratio': 0.9803123474121094, 'val/ratio_var': 0.00030289628193713725, 'val/num_eos_tokens': 0, 'lr': 1.499265066144047e-05, 'episode': 5720, 'epoch': 0.7}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:07:28<1:05:39, 131kB/s]
 70%|███████   | 1431/2041 [2:04:10<59:16,  5.83s/it]  [A

{'eps': 0, 'objective/kl': 96.64686584472656, 'objective/entropy': 43.947975158691406, 'objective/non_score_reward': -4.832343101501465, 'objective/rlhf_reward': -7.114134311676025, 'objective/scores': -2.2817912101745605, 'policy/approxkl_avg': 0.008460365235805511, 'policy/clipfrac_avg': 0.05424528196454048, 'loss/policy_avg': -0.023235760629177094, 'loss/value_avg': 0.4665846824645996, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6651031374931335, 'val/ratio': 0.9924068450927734, 'val/ratio_var': 3.937545625376515e-05, 'val/num_eos_tokens': 0, 'lr': 1.4968152866242039e-05, 'episode': 5724, 'epoch': 0.7}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:07:33<1:05:39, 131kB/s]
 70%|███████   | 1432/2041 [2:04:15<56:59,  5.62s/it][A

{'eps': 0, 'objective/kl': 106.1640396118164, 'objective/entropy': 35.21969223022461, 'objective/non_score_reward': -5.308201789855957, 'objective/rlhf_reward': -6.900878429412842, 'objective/scores': -1.5926767587661743, 'policy/approxkl_avg': 0.007241922430694103, 'policy/clipfrac_avg': 0.056603774428367615, 'loss/policy_avg': -0.025577502325177193, 'loss/value_avg': 0.42657437920570374, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6810866594314575, 'val/ratio': 0.9830442667007446, 'val/ratio_var': 0.0002434135094517842, 'val/num_eos_tokens': 0, 'lr': 1.4943655071043609e-05, 'episode': 5728, 'epoch': 0.7}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:07:39<1:05:39, 131kB/s]
 70%|███████   | 1433/2041 [2:04:20<55:31,  5.48s/it][A

{'eps': 0, 'objective/kl': 87.02984619140625, 'objective/entropy': 20.816875457763672, 'objective/non_score_reward': -4.351491928100586, 'objective/rlhf_reward': -6.494871139526367, 'objective/scores': -2.143378973007202, 'policy/approxkl_avg': 0.006341388449072838, 'policy/clipfrac_avg': 0.0554245263338089, 'loss/policy_avg': -0.018970748409628868, 'loss/value_avg': 0.40933263301849365, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.4758882224559784, 'val/ratio': 0.9977409839630127, 'val/ratio_var': 2.449173507557134e-06, 'val/num_eos_tokens': 0, 'lr': 1.4919157275845175e-05, 'episode': 5732, 'epoch': 0.7}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:07:44<1:05:39, 131kB/s]
 70%|███████   | 1434/2041 [2:04:25<54:38,  5.40s/it][A

{'eps': 0, 'objective/kl': 98.00814056396484, 'objective/entropy': 28.07415008544922, 'objective/non_score_reward': -4.900407314300537, 'objective/rlhf_reward': -7.13302755355835, 'objective/scores': -2.2326202392578125, 'policy/approxkl_avg': 0.007470675744116306, 'policy/clipfrac_avg': 0.061320751905441284, 'loss/policy_avg': -0.02523610182106495, 'loss/value_avg': 0.511478066444397, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6301579475402832, 'val/ratio': 0.9928734302520752, 'val/ratio_var': 4.238863766659051e-05, 'val/num_eos_tokens': 0, 'lr': 1.4894659480646741e-05, 'episode': 5736, 'epoch': 0.7}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:07:49<1:05:39, 131kB/s]
 70%|███████   | 1435/2041 [2:04:31<53:41,  5.32s/it][A

{'eps': 0, 'objective/kl': 100.66947937011719, 'objective/entropy': 29.05499267578125, 'objective/non_score_reward': -5.033473968505859, 'objective/rlhf_reward': -6.974571704864502, 'objective/scores': -1.9410977363586426, 'policy/approxkl_avg': 0.008830397389829159, 'policy/clipfrac_avg': 0.0625, 'loss/policy_avg': -0.026645749807357788, 'loss/value_avg': 0.5347423553466797, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6489564180374146, 'val/ratio': 0.9826885461807251, 'val/ratio_var': 0.00020384765230119228, 'val/num_eos_tokens': 0, 'lr': 1.4870161685448311e-05, 'episode': 5740, 'epoch': 0.7}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:07:54<1:05:39, 131kB/s]
 70%|███████   | 1436/2041 [2:04:36<53:05,  5.26s/it][A

{'eps': 0, 'objective/kl': 105.56246185302734, 'objective/entropy': 32.08462142944336, 'objective/non_score_reward': -5.278123378753662, 'objective/rlhf_reward': -6.94550895690918, 'objective/scores': -1.6673853397369385, 'policy/approxkl_avg': 0.02399885654449463, 'policy/clipfrac_avg': 0.08962263911962509, 'loss/policy_avg': -0.029472772032022476, 'loss/value_avg': 0.5081143379211426, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.5671440362930298, 'val/ratio': 0.9853394031524658, 'val/ratio_var': 0.0001286118640564382, 'val/num_eos_tokens': 0, 'lr': 1.4845663890249877e-05, 'episode': 5744, 'epoch': 0.7}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:07:59<1:05:39, 131kB/s]
 70%|███████   | 1437/2041 [2:04:41<52:41,  5.23s/it][A

{'eps': 0, 'objective/kl': 106.90774536132812, 'objective/entropy': 44.553314208984375, 'objective/non_score_reward': -5.3453874588012695, 'objective/rlhf_reward': -6.928233623504639, 'objective/scores': -1.5828461647033691, 'policy/approxkl_avg': 0.014865880832076073, 'policy/clipfrac_avg': 0.07547169923782349, 'loss/policy_avg': -0.027855122461915016, 'loss/value_avg': 0.5732367038726807, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.778616189956665, 'val/ratio': 0.9770424365997314, 'val/ratio_var': 0.0004028554249089211, 'val/num_eos_tokens': 0, 'lr': 1.4821166095051445e-05, 'episode': 5748, 'epoch': 0.7}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:08:04<1:05:39, 131kB/s]
 70%|███████   | 1438/2041 [2:04:46<52:10,  5.19s/it][A

{'eps': 0, 'objective/kl': 98.2598876953125, 'objective/entropy': 41.56904602050781, 'objective/non_score_reward': -4.912994861602783, 'objective/rlhf_reward': -6.661558151245117, 'objective/scores': -1.7485631704330444, 'policy/approxkl_avg': 0.011048061773180962, 'policy/clipfrac_avg': 0.07547169923782349, 'loss/policy_avg': -0.02842017263174057, 'loss/value_avg': 0.6079801321029663, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.791674017906189, 'val/ratio': 0.9820542335510254, 'val/ratio_var': 0.0002382568345637992, 'val/num_eos_tokens': 0, 'lr': 1.4796668299853013e-05, 'episode': 5752, 'epoch': 0.7}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:08:09<1:05:39, 131kB/s]
 71%|███████   | 1439/2041 [2:04:51<51:51,  5.17s/it][A

{'eps': 0, 'objective/kl': 91.00147247314453, 'objective/entropy': 37.14533996582031, 'objective/non_score_reward': -4.550073623657227, 'objective/rlhf_reward': -6.244544982910156, 'objective/scores': -1.6944714784622192, 'policy/approxkl_avg': 0.010368390940129757, 'policy/clipfrac_avg': 0.0766509473323822, 'loss/policy_avg': -0.02991134487092495, 'loss/value_avg': 0.6062309741973877, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7698118090629578, 'val/ratio': 0.9832019209861755, 'val/ratio_var': 0.00023266568314284086, 'val/num_eos_tokens': 0, 'lr': 1.4772170504654581e-05, 'episode': 5756, 'epoch': 0.71}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:08:15<1:05:39, 131kB/s]
 71%|███████   | 1440/2041 [2:04:56<52:03,  5.20s/it][A

{'eps': 0, 'objective/kl': 92.53439331054688, 'objective/entropy': 39.425636291503906, 'objective/non_score_reward': -4.6267194747924805, 'objective/rlhf_reward': -5.992071151733398, 'objective/scores': -1.365351915359497, 'policy/approxkl_avg': 0.00921687763184309, 'policy/clipfrac_avg': 0.06721697747707367, 'loss/policy_avg': -0.026190120726823807, 'loss/value_avg': 0.36102867126464844, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7950511574745178, 'val/ratio': 0.9894134998321533, 'val/ratio_var': 8.426405838690698e-05, 'val/num_eos_tokens': 0, 'lr': 1.4747672709456151e-05, 'episode': 5760, 'epoch': 0.71}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:08:20<1:05:39, 131kB/s]
 71%|███████   | 1441/2041 [2:05:01<51:41,  5.17s/it][A

{'eps': 0, 'objective/kl': 86.61778259277344, 'objective/entropy': 54.95591735839844, 'objective/non_score_reward': -4.3308892250061035, 'objective/rlhf_reward': -6.160981178283691, 'objective/scores': -1.830092191696167, 'policy/approxkl_avg': 0.018419817090034485, 'policy/clipfrac_avg': 0.08726415783166885, 'loss/policy_avg': -0.03124994970858097, 'loss/value_avg': 0.5335387587547302, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8863554000854492, 'val/ratio': 0.9742111563682556, 'val/ratio_var': 0.0004536832857411355, 'val/num_eos_tokens': 0, 'lr': 1.4723174914257718e-05, 'episode': 5764, 'epoch': 0.71}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:08:25<1:05:39, 131kB/s]
 71%|███████   | 1442/2041 [2:05:07<51:33,  5.16s/it][A

{'eps': 0, 'objective/kl': 102.52268981933594, 'objective/entropy': 41.27435302734375, 'objective/non_score_reward': -5.126134872436523, 'objective/rlhf_reward': -7.264927864074707, 'objective/scores': -2.1387929916381836, 'policy/approxkl_avg': 0.012170820496976376, 'policy/clipfrac_avg': 0.08254717290401459, 'loss/policy_avg': -0.03634867072105408, 'loss/value_avg': 0.5145419836044312, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8929307460784912, 'val/ratio': 0.9964842200279236, 'val/ratio_var': 8.428002729488071e-06, 'val/num_eos_tokens': 0, 'lr': 1.4698677119059284e-05, 'episode': 5768, 'epoch': 0.71}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:08:30<1:05:39, 131kB/s]
 71%|███████   | 1443/2041 [2:05:12<51:26,  5.16s/it][A

{'eps': 0, 'objective/kl': 106.78096771240234, 'objective/entropy': 56.95713424682617, 'objective/non_score_reward': -5.339048385620117, 'objective/rlhf_reward': -7.007759094238281, 'objective/scores': -1.6687108278274536, 'policy/approxkl_avg': 0.007627634797245264, 'policy/clipfrac_avg': 0.06485848873853683, 'loss/policy_avg': -0.030587993562221527, 'loss/value_avg': 0.7658128142356873, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0287179946899414, 'val/ratio': 0.992659330368042, 'val/ratio_var': 3.6142471799394116e-05, 'val/num_eos_tokens': 0, 'lr': 1.4674179323860854e-05, 'episode': 5772, 'epoch': 0.71}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:08:35<1:05:39, 131kB/s]
 71%|███████   | 1444/2041 [2:05:17<51:14,  5.15s/it][A

{'eps': 0, 'objective/kl': 94.22402954101562, 'objective/entropy': 55.366416931152344, 'objective/non_score_reward': -4.7112016677856445, 'objective/rlhf_reward': -6.853672981262207, 'objective/scores': -2.1424713134765625, 'policy/approxkl_avg': 0.012527143582701683, 'policy/clipfrac_avg': 0.09433962404727936, 'loss/policy_avg': -0.03383718803524971, 'loss/value_avg': 0.5174983739852905, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8819327354431152, 'val/ratio': 1.0166008472442627, 'val/ratio_var': 0.00026240048464387655, 'val/num_eos_tokens': 0, 'lr': 1.464968152866242e-05, 'episode': 5776, 'epoch': 0.71}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:08:40<1:05:39, 131kB/s]
 71%|███████   | 1445/2041 [2:05:22<50:57,  5.13s/it][A

{'eps': 0, 'objective/kl': 101.34676361083984, 'objective/entropy': 51.244468688964844, 'objective/non_score_reward': -5.067337989807129, 'objective/rlhf_reward': -7.212762355804443, 'objective/scores': -2.1454243659973145, 'policy/approxkl_avg': 0.011952065862715244, 'policy/clipfrac_avg': 0.08962263911962509, 'loss/policy_avg': -0.03634486347436905, 'loss/value_avg': 0.5178616046905518, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9592466354370117, 'val/ratio': 1.000728726387024, 'val/ratio_var': 3.5627433589979773e-06, 'val/num_eos_tokens': 0, 'lr': 1.462518373346399e-05, 'episode': 5780, 'epoch': 0.71}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:08:45<1:05:39, 131kB/s]
 71%|███████   | 1446/2041 [2:05:27<50:47,  5.12s/it][A

{'eps': 0, 'objective/kl': 91.25495147705078, 'objective/entropy': 50.00184631347656, 'objective/non_score_reward': -4.562747955322266, 'objective/rlhf_reward': -7.060436248779297, 'objective/scores': -2.4976882934570312, 'policy/approxkl_avg': 0.008779325522482395, 'policy/clipfrac_avg': 0.06485848873853683, 'loss/policy_avg': -0.02913706935942173, 'loss/value_avg': 0.37293508648872375, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9907208681106567, 'val/ratio': 0.9866939783096313, 'val/ratio_var': 0.00012052689999109134, 'val/num_eos_tokens': 0, 'lr': 1.4600685938265558e-05, 'episode': 5784, 'epoch': 0.71}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:08:51<1:05:39, 131kB/s]
 71%|███████   | 1447/2041 [2:05:32<50:43,  5.12s/it][A

{'eps': 0, 'objective/kl': 89.25933837890625, 'objective/entropy': 56.83781051635742, 'objective/non_score_reward': -4.4629669189453125, 'objective/rlhf_reward': -6.046846389770508, 'objective/scores': -1.5838797092437744, 'policy/approxkl_avg': 0.01294020377099514, 'policy/clipfrac_avg': 0.09080188721418381, 'loss/policy_avg': -0.038735199719667435, 'loss/value_avg': 0.47438734769821167, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0215164422988892, 'val/ratio': 0.9932807683944702, 'val/ratio_var': 2.5922521672328003e-05, 'val/num_eos_tokens': 0, 'lr': 1.4576188143067124e-05, 'episode': 5788, 'epoch': 0.71}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:08:56<1:05:39, 131kB/s]
 71%|███████   | 1448/2041 [2:05:37<50:39,  5.13s/it][A

{'eps': 0, 'objective/kl': 98.087890625, 'objective/entropy': 50.70321273803711, 'objective/non_score_reward': -4.904394626617432, 'objective/rlhf_reward': -6.8624467849731445, 'objective/scores': -1.9580522775650024, 'policy/approxkl_avg': 0.010572351515293121, 'policy/clipfrac_avg': 0.08844339847564697, 'loss/policy_avg': -0.03518311679363251, 'loss/value_avg': 0.52560955286026, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.952434778213501, 'val/ratio': 0.9891065955162048, 'val/ratio_var': 0.00011308168905088678, 'val/num_eos_tokens': 0, 'lr': 1.4551690347868694e-05, 'episode': 5792, 'epoch': 0.71}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:09:01<1:05:39, 131kB/s]
 71%|███████   | 1449/2041 [2:05:42<50:39,  5.13s/it][A

{'eps': 0, 'objective/kl': 88.86280059814453, 'objective/entropy': 33.80999755859375, 'objective/non_score_reward': -4.443140029907227, 'objective/rlhf_reward': -6.13300895690918, 'objective/scores': -1.6898690462112427, 'policy/approxkl_avg': 0.027649980038404465, 'policy/clipfrac_avg': 0.07311320304870605, 'loss/policy_avg': -0.027948955073952675, 'loss/value_avg': 0.3701499402523041, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7144448161125183, 'val/ratio': 1.0640469789505005, 'val/ratio_var': 0.0022508176043629646, 'val/num_eos_tokens': 0, 'lr': 1.452719255267026e-05, 'episode': 5796, 'epoch': 0.71}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:09:06<1:05:39, 131kB/s]
 71%|███████   | 1450/2041 [2:05:48<50:38,  5.14s/it][A

{'eps': 0, 'objective/kl': 94.99089813232422, 'objective/entropy': 46.457130432128906, 'objective/non_score_reward': -4.749545097351074, 'objective/rlhf_reward': -6.1576995849609375, 'objective/scores': -1.4081542491912842, 'policy/approxkl_avg': 0.014153939671814442, 'policy/clipfrac_avg': 0.08254717290401459, 'loss/policy_avg': -0.03153764829039574, 'loss/value_avg': 0.44059526920318604, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8551437854766846, 'val/ratio': 0.9876309037208557, 'val/ratio_var': 9.077533468371257e-05, 'val/num_eos_tokens': 0, 'lr': 1.4502694757471827e-05, 'episode': 5800, 'epoch': 0.71}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:09:11<1:05:39, 131kB/s]
 71%|███████   | 1451/2041 [2:05:53<50:34,  5.14s/it][A

{'eps': 0, 'objective/kl': 103.69883728027344, 'objective/entropy': 58.383602142333984, 'objective/non_score_reward': -5.18494176864624, 'objective/rlhf_reward': -7.298463821411133, 'objective/scores': -2.1135220527648926, 'policy/approxkl_avg': 0.013034430332481861, 'policy/clipfrac_avg': 0.09198113530874252, 'loss/policy_avg': -0.035014014691114426, 'loss/value_avg': 0.7649235129356384, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0558758974075317, 'val/ratio': 1.0058956146240234, 'val/ratio_var': 3.7430349038913846e-05, 'val/num_eos_tokens': 0, 'lr': 1.4478196962273396e-05, 'episode': 5804, 'epoch': 0.71}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:09:16<1:05:39, 131kB/s]
 71%|███████   | 1452/2041 [2:05:58<50:35,  5.15s/it][A

{'eps': 0, 'objective/kl': 89.16189575195312, 'objective/entropy': 47.352256774902344, 'objective/non_score_reward': -4.458094596862793, 'objective/rlhf_reward': -6.061026573181152, 'objective/scores': -1.6029322147369385, 'policy/approxkl_avg': 0.021180106326937675, 'policy/clipfrac_avg': 0.0837264135479927, 'loss/policy_avg': -0.030111949890851974, 'loss/value_avg': 0.2895256280899048, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.866193413734436, 'val/ratio': 0.9844306707382202, 'val/ratio_var': 0.00018672562146093696, 'val/num_eos_tokens': 0, 'lr': 1.4453699167074963e-05, 'episode': 5808, 'epoch': 0.71}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:09:22<1:05:39, 131kB/s]
 71%|███████   | 1453/2041 [2:06:03<50:38,  5.17s/it][A

{'eps': 0, 'objective/kl': 91.118896484375, 'objective/entropy': 41.870758056640625, 'objective/non_score_reward': -4.555944919586182, 'objective/rlhf_reward': -6.484763145446777, 'objective/scores': -1.9288179874420166, 'policy/approxkl_avg': 0.030641887336969376, 'policy/clipfrac_avg': 0.09080187976360321, 'loss/policy_avg': -0.03132122382521629, 'loss/value_avg': 0.5125657916069031, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7944713830947876, 'val/ratio': 1.0175292491912842, 'val/ratio_var': 0.00024477342958562076, 'val/num_eos_tokens': 0, 'lr': 1.4429201371876532e-05, 'episode': 5812, 'epoch': 0.71}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:09:27<1:05:39, 131kB/s]
 71%|███████   | 1454/2041 [2:06:08<50:15,  5.14s/it][A

{'eps': 0, 'objective/kl': 106.9085464477539, 'objective/entropy': 58.612022399902344, 'objective/non_score_reward': -5.345427513122559, 'objective/rlhf_reward': -6.535968780517578, 'objective/scores': -1.190541386604309, 'policy/approxkl_avg': 0.01436996553093195, 'policy/clipfrac_avg': 0.09080187976360321, 'loss/policy_avg': -0.034819286316633224, 'loss/value_avg': 0.600191593170166, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9704576134681702, 'val/ratio': 0.972628116607666, 'val/ratio_var': 0.0005496221128851175, 'val/num_eos_tokens': 0, 'lr': 1.44047035766781e-05, 'episode': 5816, 'epoch': 0.71}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:09:32<1:05:39, 131kB/s]
 71%|███████▏  | 1455/2041 [2:06:13<49:53,  5.11s/it][A

{'eps': 0, 'objective/kl': 87.29252624511719, 'objective/entropy': 37.81439971923828, 'objective/non_score_reward': -4.364626407623291, 'objective/rlhf_reward': -5.8244242668151855, 'objective/scores': -1.4597978591918945, 'policy/approxkl_avg': 0.00543517991900444, 'policy/clipfrac_avg': 0.051886796951293945, 'loss/policy_avg': -0.0232381634414196, 'loss/value_avg': 0.40318042039871216, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7241880297660828, 'val/ratio': 1.0020142793655396, 'val/ratio_var': 2.6385423552710563e-06, 'val/num_eos_tokens': 0, 'lr': 1.4380205781479667e-05, 'episode': 5820, 'epoch': 0.71}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:09:37<1:05:39, 131kB/s]
 71%|███████▏  | 1456/2041 [2:06:18<49:49,  5.11s/it][A

{'eps': 0, 'objective/kl': 101.32246398925781, 'objective/entropy': 45.041114807128906, 'objective/non_score_reward': -5.0661234855651855, 'objective/rlhf_reward': -7.118044853210449, 'objective/scores': -2.0519213676452637, 'policy/approxkl_avg': 0.005390619393438101, 'policy/clipfrac_avg': 0.0554245263338089, 'loss/policy_avg': -0.027326537296175957, 'loss/value_avg': 0.5224159955978394, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9182395935058594, 'val/ratio': 0.9916249513626099, 'val/ratio_var': 6.0589529311982915e-05, 'val/num_eos_tokens': 0, 'lr': 1.4355707986281237e-05, 'episode': 5824, 'epoch': 0.71}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:09:42<1:05:39, 131kB/s]
 71%|███████▏  | 1457/2041 [2:06:24<49:59,  5.14s/it][A

{'eps': 0, 'objective/kl': 101.52906799316406, 'objective/entropy': 56.78984069824219, 'objective/non_score_reward': -5.07645320892334, 'objective/rlhf_reward': -6.766077041625977, 'objective/scores': -1.6896239519119263, 'policy/approxkl_avg': 0.00670778751373291, 'policy/clipfrac_avg': 0.06367924064397812, 'loss/policy_avg': -0.028580376878380775, 'loss/value_avg': 0.42900168895721436, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.02987539768219, 'val/ratio': 0.9871026873588562, 'val/ratio_var': 0.00013153506733942777, 'val/num_eos_tokens': 0, 'lr': 1.4331210191082803e-05, 'episode': 5828, 'epoch': 0.71}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:09:47<1:05:39, 131kB/s]
 71%|███████▏  | 1458/2041 [2:06:29<49:52,  5.13s/it][A

{'eps': 0, 'objective/kl': 101.55397033691406, 'objective/entropy': 70.4472427368164, 'objective/non_score_reward': -5.077698230743408, 'objective/rlhf_reward': -6.811431884765625, 'objective/scores': -1.7337334156036377, 'policy/approxkl_avg': 0.013877053745090961, 'policy/clipfrac_avg': 0.07547169923782349, 'loss/policy_avg': -0.03144211322069168, 'loss/value_avg': 0.5921316146850586, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2593398094177246, 'val/ratio': 0.9811751842498779, 'val/ratio_var': 0.00023940247774589807, 'val/num_eos_tokens': 0, 'lr': 1.4306712395884373e-05, 'episode': 5832, 'epoch': 0.71}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:09:52<1:05:39, 131kB/s]
 71%|███████▏  | 1459/2041 [2:06:34<49:45,  5.13s/it][A

{'eps': 0, 'objective/kl': 94.69843292236328, 'objective/entropy': 59.377235412597656, 'objective/non_score_reward': -4.734921455383301, 'objective/rlhf_reward': -5.810786247253418, 'objective/scores': -1.0758649110794067, 'policy/approxkl_avg': 0.0051638660952448845, 'policy/clipfrac_avg': 0.04952830448746681, 'loss/policy_avg': -0.026200799271464348, 'loss/value_avg': 0.36606669425964355, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.081095814704895, 'val/ratio': 0.9956670999526978, 'val/ratio_var': 1.9756587789743207e-05, 'val/num_eos_tokens': 0, 'lr': 1.4282214600685939e-05, 'episode': 5836, 'epoch': 0.71}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:09:57<1:05:39, 131kB/s]
 72%|███████▏  | 1460/2041 [2:06:39<49:28,  5.11s/it][A

{'eps': 0, 'objective/kl': 98.38129425048828, 'objective/entropy': 47.59471893310547, 'objective/non_score_reward': -4.919065475463867, 'objective/rlhf_reward': -6.746950149536133, 'objective/scores': -1.8278844356536865, 'policy/approxkl_avg': 0.008119439706206322, 'policy/clipfrac_avg': 0.06721697747707367, 'loss/policy_avg': -0.027413271367549896, 'loss/value_avg': 0.5193413496017456, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.952542245388031, 'val/ratio': 0.9846645593643188, 'val/ratio_var': 0.0001979415537789464, 'val/num_eos_tokens': 0, 'lr': 1.4257716805487505e-05, 'episode': 5840, 'epoch': 0.72}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:10:02<1:05:39, 131kB/s]
 72%|███████▏  | 1461/2041 [2:06:44<49:37,  5.13s/it][A

{'eps': 0, 'objective/kl': 102.6602554321289, 'objective/entropy': 64.1618881225586, 'objective/non_score_reward': -5.1330132484436035, 'objective/rlhf_reward': -6.758780479431152, 'objective/scores': -1.6257671117782593, 'policy/approxkl_avg': 0.007760718930512667, 'policy/clipfrac_avg': 0.0766509398818016, 'loss/policy_avg': -0.03099876083433628, 'loss/value_avg': 0.43298232555389404, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.047989845275879, 'val/ratio': 0.9900094270706177, 'val/ratio_var': 6.786506128264591e-05, 'val/num_eos_tokens': 0, 'lr': 1.4233219010289075e-05, 'episode': 5844, 'epoch': 0.72}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:10:08<1:05:39, 131kB/s]
 72%|███████▏  | 1462/2041 [2:06:49<49:31,  5.13s/it][A

{'eps': 0, 'objective/kl': 107.13992309570312, 'objective/entropy': 58.078773498535156, 'objective/non_score_reward': -5.356996536254883, 'objective/rlhf_reward': -7.206501007080078, 'objective/scores': -1.8495044708251953, 'policy/approxkl_avg': 0.009851628914475441, 'policy/clipfrac_avg': 0.08490566164255142, 'loss/policy_avg': -0.036684561520814896, 'loss/value_avg': 0.5513486862182617, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0388073921203613, 'val/ratio': 0.9991878271102905, 'val/ratio_var': 5.603062049885921e-07, 'val/num_eos_tokens': 0, 'lr': 1.4208721215090643e-05, 'episode': 5848, 'epoch': 0.72}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:10:13<1:05:39, 131kB/s]
 72%|███████▏  | 1463/2041 [2:06:54<49:39,  5.16s/it][A

{'eps': 0, 'objective/kl': 82.66229248046875, 'objective/entropy': 41.74180603027344, 'objective/non_score_reward': -4.133114814758301, 'objective/rlhf_reward': -5.840973854064941, 'objective/scores': -1.7078588008880615, 'policy/approxkl_avg': 0.008366812020540237, 'policy/clipfrac_avg': 0.0625, 'loss/policy_avg': -0.02773563750088215, 'loss/value_avg': 0.4400661885738373, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7856264710426331, 'val/ratio': 0.9840857982635498, 'val/ratio_var': 0.0002140283613698557, 'val/num_eos_tokens': 0, 'lr': 1.418422341989221e-05, 'episode': 5852, 'epoch': 0.72}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:10:18<1:05:39, 131kB/s]
 72%|███████▏  | 1464/2041 [2:07:00<49:35,  5.16s/it][A

{'eps': 0, 'objective/kl': 89.26094055175781, 'objective/entropy': 49.77869415283203, 'objective/non_score_reward': -4.463047027587891, 'objective/rlhf_reward': -5.9229278564453125, 'objective/scores': -1.4598805904388428, 'policy/approxkl_avg': 0.006508702877908945, 'policy/clipfrac_avg': 0.06839622557163239, 'loss/policy_avg': -0.03037269413471222, 'loss/value_avg': 0.36864712834358215, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0086572170257568, 'val/ratio': 0.9876582026481628, 'val/ratio_var': 0.00012850544590037316, 'val/num_eos_tokens': 0, 'lr': 1.415972562469378e-05, 'episode': 5856, 'epoch': 0.72}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:10:23<1:05:39, 131kB/s]
 72%|███████▏  | 1465/2041 [2:07:05<49:28,  5.15s/it][A

{'eps': 0, 'objective/kl': 100.38993072509766, 'objective/entropy': 56.11506271362305, 'objective/non_score_reward': -5.019496917724609, 'objective/rlhf_reward': -6.731021404266357, 'objective/scores': -1.711524486541748, 'policy/approxkl_avg': 0.007192707620561123, 'policy/clipfrac_avg': 0.07547169923782349, 'loss/policy_avg': -0.030919188633561134, 'loss/value_avg': 0.5175796747207642, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9719008207321167, 'val/ratio': 0.9816473126411438, 'val/ratio_var': 0.0002726372331380844, 'val/num_eos_tokens': 0, 'lr': 1.4135227829495346e-05, 'episode': 5860, 'epoch': 0.72}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:10:28<1:05:39, 131kB/s]
 72%|███████▏  | 1466/2041 [2:07:10<49:32,  5.17s/it][A

{'eps': 0, 'objective/kl': 97.82078552246094, 'objective/entropy': 59.980384826660156, 'objective/non_score_reward': -4.891039848327637, 'objective/rlhf_reward': -7.145751953125, 'objective/scores': -2.2547121047973633, 'policy/approxkl_avg': 0.011073133908212185, 'policy/clipfrac_avg': 0.0837264209985733, 'loss/policy_avg': -0.034040920436382294, 'loss/value_avg': 0.5367600917816162, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9855281114578247, 'val/ratio': 0.9852623343467712, 'val/ratio_var': 0.00016227761807385832, 'val/num_eos_tokens': 0, 'lr': 1.4110730034296915e-05, 'episode': 5864, 'epoch': 0.72}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:10:33<1:05:39, 131kB/s]
 72%|███████▏  | 1467/2041 [2:07:15<49:06,  5.13s/it][A

{'eps': 0, 'objective/kl': 104.25164794921875, 'objective/entropy': 46.52467346191406, 'objective/non_score_reward': -5.212582111358643, 'objective/rlhf_reward': -7.092785835266113, 'objective/scores': -1.8802036046981812, 'policy/approxkl_avg': 0.006326794158667326, 'policy/clipfrac_avg': 0.06132075563073158, 'loss/policy_avg': -0.022585652768611908, 'loss/value_avg': 0.530820906162262, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8468180894851685, 'val/ratio': 0.9962831735610962, 'val/ratio_var': 9.59051521931542e-06, 'val/num_eos_tokens': 0, 'lr': 1.4086232239098482e-05, 'episode': 5868, 'epoch': 0.72}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:10:39<1:05:39, 131kB/s]
 72%|███████▏  | 1468/2041 [2:07:20<49:08,  5.15s/it][A

{'eps': 0, 'objective/kl': 79.64701843261719, 'objective/entropy': 28.58983612060547, 'objective/non_score_reward': -3.982351303100586, 'objective/rlhf_reward': -6.582219123840332, 'objective/scores': -2.599867820739746, 'policy/approxkl_avg': 0.004390479531139135, 'policy/clipfrac_avg': 0.04245283082127571, 'loss/policy_avg': -0.021798433735966682, 'loss/value_avg': 0.5868872404098511, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6521344780921936, 'val/ratio': 0.9989410638809204, 'val/ratio_var': 6.398328764589678e-07, 'val/num_eos_tokens': 0, 'lr': 1.4061734443900048e-05, 'episode': 5872, 'epoch': 0.72}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:10:44<1:05:39, 131kB/s]
 72%|███████▏  | 1469/2041 [2:07:25<49:17,  5.17s/it][A

{'eps': 0, 'objective/kl': 92.24699401855469, 'objective/entropy': 48.11475372314453, 'objective/non_score_reward': -4.612349510192871, 'objective/rlhf_reward': -6.92948579788208, 'objective/scores': -2.317136287689209, 'policy/approxkl_avg': 0.009269044734537601, 'policy/clipfrac_avg': 0.08136792480945587, 'loss/policy_avg': -0.03205358237028122, 'loss/value_avg': 0.537127673625946, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9112796783447266, 'val/ratio': 1.004791021347046, 'val/ratio_var': 3.3076616091420874e-05, 'val/num_eos_tokens': 0, 'lr': 1.4037236648701618e-05, 'episode': 5876, 'epoch': 0.72}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:10:49<1:05:39, 131kB/s]
 72%|███████▏  | 1470/2041 [2:07:31<49:26,  5.19s/it][A

{'eps': 0, 'objective/kl': 106.50138092041016, 'objective/entropy': 45.89312744140625, 'objective/non_score_reward': -5.325068950653076, 'objective/rlhf_reward': -7.499748706817627, 'objective/scores': -2.174679756164551, 'policy/approxkl_avg': 0.017998095601797104, 'policy/clipfrac_avg': 0.07429245859384537, 'loss/policy_avg': -0.031569208949804306, 'loss/value_avg': 0.7172600030899048, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8651098012924194, 'val/ratio': 0.9783385992050171, 'val/ratio_var': 0.00037061251350678504, 'val/num_eos_tokens': 0, 'lr': 1.4012738853503186e-05, 'episode': 5880, 'epoch': 0.72}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:10:54<1:05:39, 131kB/s]
 72%|███████▏  | 1471/2041 [2:07:36<49:14,  5.18s/it][A

{'eps': 0, 'objective/kl': 89.22767639160156, 'objective/entropy': 38.160858154296875, 'objective/non_score_reward': -4.461383819580078, 'objective/rlhf_reward': -5.464412212371826, 'objective/scores': -1.0030285120010376, 'policy/approxkl_avg': 0.008117985911667347, 'policy/clipfrac_avg': 0.07311320304870605, 'loss/policy_avg': -0.025946369394659996, 'loss/value_avg': 0.5751218795776367, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7255799174308777, 'val/ratio': 0.979769229888916, 'val/ratio_var': 0.0003380371490493417, 'val/num_eos_tokens': 0, 'lr': 1.3988241058304754e-05, 'episode': 5884, 'epoch': 0.72}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:10:59<1:05:39, 131kB/s]
 72%|███████▏  | 1472/2041 [2:07:41<48:55,  5.16s/it][A

{'eps': 0, 'objective/kl': 87.6885986328125, 'objective/entropy': 53.10704803466797, 'objective/non_score_reward': -4.384429931640625, 'objective/rlhf_reward': -5.600970268249512, 'objective/scores': -1.2165400981903076, 'policy/approxkl_avg': 0.006894347257912159, 'policy/clipfrac_avg': 0.061320751905441284, 'loss/policy_avg': -0.02783133275806904, 'loss/value_avg': 0.41466206312179565, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8353109955787659, 'val/ratio': 0.9839958548545837, 'val/ratio_var': 0.00020524229330476373, 'val/num_eos_tokens': 0, 'lr': 1.3963743263106322e-05, 'episode': 5888, 'epoch': 0.72}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:11:04<1:05:39, 131kB/s]
 72%|███████▏  | 1473/2041 [2:07:46<48:42,  5.14s/it][A

{'eps': 0, 'objective/kl': 77.55708312988281, 'objective/entropy': 36.477272033691406, 'objective/non_score_reward': -3.8778538703918457, 'objective/rlhf_reward': -5.6795454025268555, 'objective/scores': -1.8016916513442993, 'policy/approxkl_avg': 0.008967717178165913, 'policy/clipfrac_avg': 0.06839622557163239, 'loss/policy_avg': -0.025609329342842102, 'loss/value_avg': 0.46433451771736145, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.666756272315979, 'val/ratio': 0.9828444123268127, 'val/ratio_var': 0.00021447842300403863, 'val/num_eos_tokens': 0, 'lr': 1.3939245467907888e-05, 'episode': 5892, 'epoch': 0.72}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:11:10<1:05:39, 131kB/s]
 72%|███████▏  | 1474/2041 [2:07:51<48:34,  5.14s/it][A

{'eps': 0, 'objective/kl': 95.84107208251953, 'objective/entropy': 48.18832015991211, 'objective/non_score_reward': -4.79205322265625, 'objective/rlhf_reward': -6.616701126098633, 'objective/scores': -1.8246479034423828, 'policy/approxkl_avg': 0.004809168167412281, 'policy/clipfrac_avg': 0.04716981202363968, 'loss/policy_avg': -0.02643943391740322, 'loss/value_avg': 0.4870131015777588, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8640744686126709, 'val/ratio': 0.9863635301589966, 'val/ratio_var': 0.0001661724381847307, 'val/num_eos_tokens': 0, 'lr': 1.3914747672709458e-05, 'episode': 5896, 'epoch': 0.72}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:11:15<1:05:39, 131kB/s]
 72%|███████▏  | 1475/2041 [2:07:56<48:25,  5.13s/it][A

{'eps': 0, 'objective/kl': 82.1905517578125, 'objective/entropy': 47.85688018798828, 'objective/non_score_reward': -4.109527587890625, 'objective/rlhf_reward': -5.673648357391357, 'objective/scores': -1.5641207695007324, 'policy/approxkl_avg': 0.011947055347263813, 'policy/clipfrac_avg': 0.07900943607091904, 'loss/policy_avg': -0.027719371020793915, 'loss/value_avg': 0.522305428981781, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7154353857040405, 'val/ratio': 0.993337869644165, 'val/ratio_var': 3.5770855902228504e-05, 'val/num_eos_tokens': 0, 'lr': 1.3890249877511024e-05, 'episode': 5900, 'epoch': 0.72}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:11:20<1:05:39, 131kB/s]
 72%|███████▏  | 1476/2041 [2:08:01<48:21,  5.13s/it][A

{'eps': 0, 'objective/kl': 105.14295959472656, 'objective/entropy': 37.173004150390625, 'objective/non_score_reward': -5.257147789001465, 'objective/rlhf_reward': -7.3248090744018555, 'objective/scores': -2.0676610469818115, 'policy/approxkl_avg': 0.0059553105384111404, 'policy/clipfrac_avg': 0.05896226316690445, 'loss/policy_avg': -0.026591859757900238, 'loss/value_avg': 0.6189606189727783, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7877047657966614, 'val/ratio': 0.9780977368354797, 'val/ratio_var': 0.0004002659989055246, 'val/num_eos_tokens': 0, 'lr': 1.386575208231259e-05, 'episode': 5904, 'epoch': 0.72}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:11:25<1:05:39, 131kB/s]
 72%|███████▏  | 1477/2041 [2:08:07<48:27,  5.16s/it][A

{'eps': 0, 'objective/kl': 105.57952880859375, 'objective/entropy': 51.40394592285156, 'objective/non_score_reward': -5.2789764404296875, 'objective/rlhf_reward': -6.642530918121338, 'objective/scores': -1.3635544776916504, 'policy/approxkl_avg': 0.013523319736123085, 'policy/clipfrac_avg': 0.09080188721418381, 'loss/policy_avg': -0.036101557314395905, 'loss/value_avg': 0.5633895397186279, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8137308955192566, 'val/ratio': 0.9868103861808777, 'val/ratio_var': 0.0001060046415659599, 'val/num_eos_tokens': 0, 'lr': 1.384125428711416e-05, 'episode': 5908, 'epoch': 0.72}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:11:30<1:05:39, 131kB/s]
 72%|███████▏  | 1478/2041 [2:08:12<48:36,  5.18s/it][A

{'eps': 0, 'objective/kl': 91.43504333496094, 'objective/entropy': 32.48375701904297, 'objective/non_score_reward': -4.571752071380615, 'objective/rlhf_reward': -6.690307140350342, 'objective/scores': -2.1185550689697266, 'policy/approxkl_avg': 0.005956889595836401, 'policy/clipfrac_avg': 0.044811323285102844, 'loss/policy_avg': -0.023988734930753708, 'loss/value_avg': 0.33942362666130066, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6657983660697937, 'val/ratio': 0.9914278984069824, 'val/ratio_var': 6.32016672170721e-05, 'val/num_eos_tokens': 0, 'lr': 1.3816756491915728e-05, 'episode': 5912, 'epoch': 0.72}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:11:35<1:05:39, 131kB/s]
 72%|███████▏  | 1479/2041 [2:08:17<48:16,  5.15s/it][A

{'eps': 0, 'objective/kl': 103.68869018554688, 'objective/entropy': 43.892181396484375, 'objective/non_score_reward': -5.18443489074707, 'objective/rlhf_reward': -7.56925630569458, 'objective/scores': -2.3848214149475098, 'policy/approxkl_avg': 0.008732328191399574, 'policy/clipfrac_avg': 0.08490565419197083, 'loss/policy_avg': -0.03414345532655716, 'loss/value_avg': 0.5949584245681763, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8185445070266724, 'val/ratio': 0.9835033416748047, 'val/ratio_var': 0.00020046376448590308, 'val/num_eos_tokens': 0, 'lr': 1.3792258696717298e-05, 'episode': 5916, 'epoch': 0.72}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:11:41<1:05:39, 131kB/s]
 73%|███████▎  | 1480/2041 [2:08:22<48:24,  5.18s/it][A

{'eps': 0, 'objective/kl': 94.04722595214844, 'objective/entropy': 41.11039733886719, 'objective/non_score_reward': -4.702361583709717, 'objective/rlhf_reward': -6.902702331542969, 'objective/scores': -2.200340986251831, 'policy/approxkl_avg': 0.009040161967277527, 'policy/clipfrac_avg': 0.06839622557163239, 'loss/policy_avg': -0.026882685720920563, 'loss/value_avg': 0.45104801654815674, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7897786498069763, 'val/ratio': 0.9884867072105408, 'val/ratio_var': 9.91166234598495e-05, 'val/num_eos_tokens': 0, 'lr': 1.3767760901518865e-05, 'episode': 5920, 'epoch': 0.73}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:11:46<1:05:39, 131kB/s]
 73%|███████▎  | 1481/2041 [2:08:27<48:35,  5.21s/it][A

{'eps': 0, 'objective/kl': 89.54535675048828, 'objective/entropy': 37.09296798706055, 'objective/non_score_reward': -4.477267742156982, 'objective/rlhf_reward': -6.040752410888672, 'objective/scores': -1.5634846687316895, 'policy/approxkl_avg': 0.009680007584393024, 'policy/clipfrac_avg': 0.06603773683309555, 'loss/policy_avg': -0.027537979185581207, 'loss/value_avg': 0.4922664761543274, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.763562798500061, 'val/ratio': 0.986099362373352, 'val/ratio_var': 0.00014975610247347504, 'val/num_eos_tokens': 0, 'lr': 1.3743263106320431e-05, 'episode': 5924, 'epoch': 0.73}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:11:51<1:05:39, 131kB/s]
 73%|███████▎  | 1482/2041 [2:08:33<48:42,  5.23s/it][A

{'eps': 0, 'objective/kl': 85.2162857055664, 'objective/entropy': 41.857704162597656, 'objective/non_score_reward': -4.260814666748047, 'objective/rlhf_reward': -6.224311828613281, 'objective/scores': -1.9634971618652344, 'policy/approxkl_avg': 0.007164954207837582, 'policy/clipfrac_avg': 0.053066037595272064, 'loss/policy_avg': -0.02805359847843647, 'loss/value_avg': 0.4640503525733948, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8417907357215881, 'val/ratio': 0.9957208037376404, 'val/ratio_var': 1.0714051313698292e-05, 'val/num_eos_tokens': 0, 'lr': 1.3718765311122e-05, 'episode': 5928, 'epoch': 0.73}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:11:56<1:05:39, 131kB/s]
 73%|███████▎  | 1483/2041 [2:08:38<48:34,  5.22s/it][A

{'eps': 0, 'objective/kl': 114.7538070678711, 'objective/entropy': 52.42469787597656, 'objective/non_score_reward': -5.737689971923828, 'objective/rlhf_reward': -6.564825057983398, 'objective/scores': -0.8271350860595703, 'policy/approxkl_avg': 0.009421228431165218, 'policy/clipfrac_avg': 0.07311321049928665, 'loss/policy_avg': -0.03201545402407646, 'loss/value_avg': 0.5556476712226868, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8752643465995789, 'val/ratio': 0.9761918783187866, 'val/ratio_var': 0.00047126170829869807, 'val/num_eos_tokens': 0, 'lr': 1.3694267515923567e-05, 'episode': 5932, 'epoch': 0.73}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:12:01<1:05:39, 131kB/s]
 73%|███████▎  | 1484/2041 [2:08:43<48:16,  5.20s/it][A

{'eps': 0, 'objective/kl': 93.54568481445312, 'objective/entropy': 52.385841369628906, 'objective/non_score_reward': -4.6772847175598145, 'objective/rlhf_reward': -6.366788864135742, 'objective/scores': -1.6895040273666382, 'policy/approxkl_avg': 0.011271726340055466, 'policy/clipfrac_avg': 0.07075472176074982, 'loss/policy_avg': -0.029548000544309616, 'loss/value_avg': 0.4513399302959442, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9908871054649353, 'val/ratio': 0.9798084497451782, 'val/ratio_var': 0.00030631941626779735, 'val/num_eos_tokens': 0, 'lr': 1.3669769720725133e-05, 'episode': 5936, 'epoch': 0.73}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:12:07<1:05:39, 131kB/s]
 73%|███████▎  | 1485/2041 [2:08:48<48:02,  5.18s/it][A

{'eps': 0, 'objective/kl': 82.67611694335938, 'objective/entropy': 41.56876754760742, 'objective/non_score_reward': -4.133806228637695, 'objective/rlhf_reward': -6.341218948364258, 'objective/scores': -2.2074127197265625, 'policy/approxkl_avg': 0.005798152182251215, 'policy/clipfrac_avg': 0.049528300762176514, 'loss/policy_avg': -0.02298315055668354, 'loss/value_avg': 0.3737248182296753, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7732474207878113, 'val/ratio': 0.981940507888794, 'val/ratio_var': 0.00028413505060598254, 'val/num_eos_tokens': 0, 'lr': 1.3645271925526703e-05, 'episode': 5940, 'epoch': 0.73}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:12:12<1:05:39, 131kB/s]
 73%|███████▎  | 1486/2041 [2:08:53<47:53,  5.18s/it][A

{'eps': 0, 'objective/kl': 82.80747985839844, 'objective/entropy': 52.852718353271484, 'objective/non_score_reward': -4.140374183654785, 'objective/rlhf_reward': -6.5152997970581055, 'objective/scores': -2.374925374984741, 'policy/approxkl_avg': 0.007371873594820499, 'policy/clipfrac_avg': 0.06839622557163239, 'loss/policy_avg': -0.031743839383125305, 'loss/value_avg': 0.3893619775772095, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9182256460189819, 'val/ratio': 0.9881583452224731, 'val/ratio_var': 0.00010605928400764242, 'val/num_eos_tokens': 0, 'lr': 1.3620774130328271e-05, 'episode': 5944, 'epoch': 0.73}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:12:17<1:05:39, 131kB/s]
 73%|███████▎  | 1487/2041 [2:08:58<47:45,  5.17s/it][A

{'eps': 0, 'objective/kl': 75.80618286132812, 'objective/entropy': 30.660202026367188, 'objective/non_score_reward': -3.790308952331543, 'objective/rlhf_reward': -5.899306297302246, 'objective/scores': -2.108997106552124, 'policy/approxkl_avg': 0.006407410837709904, 'policy/clipfrac_avg': 0.049528300762176514, 'loss/policy_avg': -0.024096710607409477, 'loss/value_avg': 0.25515472888946533, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7325041890144348, 'val/ratio': 0.9841970205307007, 'val/ratio_var': 0.00021933276730123907, 'val/num_eos_tokens': 0, 'lr': 1.359627633512984e-05, 'episode': 5948, 'epoch': 0.73}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:12:22<1:05:39, 131kB/s]
 73%|███████▎  | 1488/2041 [2:09:04<47:48,  5.19s/it][A

{'eps': 0, 'objective/kl': 98.3410873413086, 'objective/entropy': 73.81102752685547, 'objective/non_score_reward': -4.917054653167725, 'objective/rlhf_reward': -6.3055877685546875, 'objective/scores': -1.3885328769683838, 'policy/approxkl_avg': 0.010350381955504417, 'policy/clipfrac_avg': 0.08136792480945587, 'loss/policy_avg': -0.03719502314925194, 'loss/value_avg': 0.4459872543811798, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2568304538726807, 'val/ratio': 0.99442058801651, 'val/ratio_var': 2.4794329874566756e-05, 'val/num_eos_tokens': 0, 'lr': 1.3571778539931407e-05, 'episode': 5952, 'epoch': 0.73}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:12:27<1:05:39, 131kB/s]
 73%|███████▎  | 1489/2041 [2:09:09<47:45,  5.19s/it][A

{'eps': 0, 'objective/kl': 97.96943664550781, 'objective/entropy': 54.772789001464844, 'objective/non_score_reward': -4.898472309112549, 'objective/rlhf_reward': -6.2919111251831055, 'objective/scores': -1.3934385776519775, 'policy/approxkl_avg': 0.013981025665998459, 'policy/clipfrac_avg': 0.08136792480945587, 'loss/policy_avg': -0.03453996777534485, 'loss/value_avg': 0.47561126947402954, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0020461082458496, 'val/ratio': 0.9798711538314819, 'val/ratio_var': 0.0002814891922753304, 'val/num_eos_tokens': 0, 'lr': 1.3547280744732974e-05, 'episode': 5956, 'epoch': 0.73}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:12:32<1:05:39, 131kB/s]
 73%|███████▎  | 1490/2041 [2:09:14<47:31,  5.18s/it][A

{'eps': 0, 'objective/kl': 89.55486297607422, 'objective/entropy': 38.96006774902344, 'objective/non_score_reward': -4.477743148803711, 'objective/rlhf_reward': -5.606908798217773, 'objective/scores': -1.129165530204773, 'policy/approxkl_avg': 0.008438223972916603, 'policy/clipfrac_avg': 0.06839622557163239, 'loss/policy_avg': -0.026884574443101883, 'loss/value_avg': 0.4591209888458252, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7475529909133911, 'val/ratio': 0.9907746315002441, 'val/ratio_var': 6.105079228291288e-05, 'val/num_eos_tokens': 0, 'lr': 1.3522782949534543e-05, 'episode': 5960, 'epoch': 0.73}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:12:38<1:05:39, 131kB/s]
 73%|███████▎  | 1491/2041 [2:09:19<47:14,  5.15s/it][A

{'eps': 0, 'objective/kl': 96.46644592285156, 'objective/entropy': 53.08652877807617, 'objective/non_score_reward': -4.823322296142578, 'objective/rlhf_reward': -6.173783302307129, 'objective/scores': -1.3504607677459717, 'policy/approxkl_avg': 0.009639101102948189, 'policy/clipfrac_avg': 0.056603770703077316, 'loss/policy_avg': -0.026751937344670296, 'loss/value_avg': 0.4646369516849518, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0798799991607666, 'val/ratio': 0.9752941131591797, 'val/ratio_var': 0.00048216161667369306, 'val/num_eos_tokens': 0, 'lr': 1.349828515433611e-05, 'episode': 5964, 'epoch': 0.73}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:12:43<1:05:39, 131kB/s]
 73%|███████▎  | 1492/2041 [2:09:24<47:05,  5.15s/it][A

{'eps': 0, 'objective/kl': 73.28314208984375, 'objective/entropy': 37.81350326538086, 'objective/non_score_reward': -3.664156913757324, 'objective/rlhf_reward': -5.2706475257873535, 'objective/scores': -1.6064906120300293, 'policy/approxkl_avg': 0.005881319288164377, 'policy/clipfrac_avg': 0.0625, 'loss/policy_avg': -0.025268064811825752, 'loss/value_avg': 0.24728471040725708, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8058007955551147, 'val/ratio': 0.9945403337478638, 'val/ratio_var': 1.7019028746290132e-05, 'val/num_eos_tokens': 0, 'lr': 1.347378735913768e-05, 'episode': 5968, 'epoch': 0.73}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:12:48<1:05:39, 131kB/s]
 73%|███████▎  | 1493/2041 [2:09:29<46:43,  5.12s/it][A

{'eps': 0, 'objective/kl': 91.311279296875, 'objective/entropy': 42.879173278808594, 'objective/non_score_reward': -4.565564155578613, 'objective/rlhf_reward': -6.131601333618164, 'objective/scores': -1.5660371780395508, 'policy/approxkl_avg': 0.005727799143642187, 'policy/clipfrac_avg': 0.060141511261463165, 'loss/policy_avg': -0.02385684847831726, 'loss/value_avg': 0.4217495024204254, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.802790641784668, 'val/ratio': 0.9881576299667358, 'val/ratio_var': 0.00012086174683645368, 'val/num_eos_tokens': 0, 'lr': 1.3449289563939246e-05, 'episode': 5972, 'epoch': 0.73}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:12:53<1:05:39, 131kB/s]
 73%|███████▎  | 1494/2041 [2:09:34<46:43,  5.12s/it][A

{'eps': 0, 'objective/kl': 90.39320373535156, 'objective/entropy': 46.621116638183594, 'objective/non_score_reward': -4.519659996032715, 'objective/rlhf_reward': -6.301466464996338, 'objective/scores': -1.781806468963623, 'policy/approxkl_avg': 0.010098683647811413, 'policy/clipfrac_avg': 0.08962264657020569, 'loss/policy_avg': -0.033787913620471954, 'loss/value_avg': 0.38423675298690796, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.899702250957489, 'val/ratio': 0.9837323427200317, 'val/ratio_var': 0.00021066004410386086, 'val/num_eos_tokens': 0, 'lr': 1.3424791768740814e-05, 'episode': 5976, 'epoch': 0.73}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:12:58<1:05:39, 131kB/s]
 73%|███████▎  | 1495/2041 [2:09:40<46:37,  5.12s/it][A

{'eps': 0, 'objective/kl': 94.85919952392578, 'objective/entropy': 43.24465560913086, 'objective/non_score_reward': -4.742959976196289, 'objective/rlhf_reward': -7.125298500061035, 'objective/scores': -2.382338762283325, 'policy/approxkl_avg': 0.009523038752377033, 'policy/clipfrac_avg': 0.0695754736661911, 'loss/policy_avg': -0.028527308255434036, 'loss/value_avg': 0.6088840365409851, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0157619714736938, 'val/ratio': 0.9825371503829956, 'val/ratio_var': 0.00021868616749998182, 'val/num_eos_tokens': 0, 'lr': 1.3400293973542383e-05, 'episode': 5980, 'epoch': 0.73}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:13:03<1:05:39, 131kB/s]
 73%|███████▎  | 1496/2041 [2:09:45<46:28,  5.12s/it][A

{'eps': 0, 'objective/kl': 83.53181457519531, 'objective/entropy': 59.9533805847168, 'objective/non_score_reward': -4.176590442657471, 'objective/rlhf_reward': -6.231479167938232, 'objective/scores': -2.0548887252807617, 'policy/approxkl_avg': 0.008020291104912758, 'policy/clipfrac_avg': 0.06721697747707367, 'loss/policy_avg': -0.028636571019887924, 'loss/value_avg': 0.26077336072921753, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.112208366394043, 'val/ratio': 0.9972337484359741, 'val/ratio_var': 4.063411324750632e-06, 'val/num_eos_tokens': 0, 'lr': 1.337579617834395e-05, 'episode': 5984, 'epoch': 0.73}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:13:08<1:05:39, 131kB/s]
 73%|███████▎  | 1497/2041 [2:09:50<46:27,  5.12s/it][A

{'eps': 0, 'objective/kl': 78.67000579833984, 'objective/entropy': 54.390960693359375, 'objective/non_score_reward': -3.933500289916992, 'objective/rlhf_reward': -5.725838661193848, 'objective/scores': -1.7923386096954346, 'policy/approxkl_avg': 0.009387899190187454, 'policy/clipfrac_avg': 0.07429245114326477, 'loss/policy_avg': -0.02678735740482807, 'loss/value_avg': 0.40186989307403564, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9947248101234436, 'val/ratio': 0.9780617952346802, 'val/ratio_var': 0.00039946858305484056, 'val/num_eos_tokens': 0, 'lr': 1.3351298383145516e-05, 'episode': 5988, 'epoch': 0.73}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:13:14<1:05:39, 131kB/s]
 73%|███████▎  | 1498/2041 [2:09:55<46:42,  5.16s/it][A

{'eps': 0, 'objective/kl': 77.17866516113281, 'objective/entropy': 45.931800842285156, 'objective/non_score_reward': -3.858933448791504, 'objective/rlhf_reward': -5.483844757080078, 'objective/scores': -1.6249114274978638, 'policy/approxkl_avg': 0.008353246375918388, 'policy/clipfrac_avg': 0.06839622557163239, 'loss/policy_avg': -0.026248591020703316, 'loss/value_avg': 0.3749092221260071, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8821423053741455, 'val/ratio': 0.9894495606422424, 'val/ratio_var': 8.69421346578747e-05, 'val/num_eos_tokens': 0, 'lr': 1.3326800587947086e-05, 'episode': 5992, 'epoch': 0.73}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:13:19<1:05:39, 131kB/s]
 73%|███████▎  | 1499/2041 [2:10:00<46:46,  5.18s/it][A

{'eps': 0, 'objective/kl': 95.76766967773438, 'objective/entropy': 60.063655853271484, 'objective/non_score_reward': -4.788383483886719, 'objective/rlhf_reward': -6.812257766723633, 'objective/scores': -2.023874521255493, 'policy/approxkl_avg': 0.010765766724944115, 'policy/clipfrac_avg': 0.07783018797636032, 'loss/policy_avg': -0.03155943378806114, 'loss/value_avg': 0.4306682348251343, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0997180938720703, 'val/ratio': 0.9758822917938232, 'val/ratio_var': 0.00045153440441936255, 'val/num_eos_tokens': 0, 'lr': 1.3302302792748652e-05, 'episode': 5996, 'epoch': 0.73}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:13:24<1:05:39, 131kB/s]
 73%|███████▎  | 1500/2041 [2:10:06<46:47,  5.19s/it][A

{'eps': 0, 'objective/kl': 82.10670471191406, 'objective/entropy': 37.045650482177734, 'objective/non_score_reward': -4.105335235595703, 'objective/rlhf_reward': -6.2662200927734375, 'objective/scores': -2.1608846187591553, 'policy/approxkl_avg': 0.00790939386934042, 'policy/clipfrac_avg': 0.049528300762176514, 'loss/policy_avg': -0.02199019305408001, 'loss/value_avg': 0.36390459537506104, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7791751623153687, 'val/ratio': 0.9847870469093323, 'val/ratio_var': 0.00018898215785156935, 'val/num_eos_tokens': 0, 'lr': 1.3277804997550222e-05, 'episode': 6000, 'epoch': 0.74}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:13:48<1:05:39, 131kB/s]
 74%|███████▎  | 1501/2041 [2:10:29<1:36:19, 10.70s/it][A

{'eps': 0, 'objective/kl': 78.25118255615234, 'objective/entropy': 38.844085693359375, 'objective/non_score_reward': -3.9125592708587646, 'objective/rlhf_reward': -5.91628360748291, 'objective/scores': -2.0037245750427246, 'policy/approxkl_avg': 0.0067556267604231834, 'policy/clipfrac_avg': 0.056603774428367615, 'loss/policy_avg': -0.022964324802160263, 'loss/value_avg': 0.30955949425697327, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6836011409759521, 'val/ratio': 0.9922711849212646, 'val/ratio_var': 4.1176946979248896e-05, 'val/num_eos_tokens': 0, 'lr': 1.3253307202351788e-05, 'episode': 6004, 'epoch': 0.74}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:13:53<1:05:39, 131kB/s]
 74%|███████▎  | 1502/2041 [2:10:34<1:21:21,  9.06s/it][A

{'eps': 0, 'objective/kl': 95.25326538085938, 'objective/entropy': 44.690521240234375, 'objective/non_score_reward': -4.7626633644104, 'objective/rlhf_reward': -6.3589019775390625, 'objective/scores': -1.5962388515472412, 'policy/approxkl_avg': 0.008196039125323296, 'policy/clipfrac_avg': 0.056603774428367615, 'loss/policy_avg': -0.024299297481775284, 'loss/value_avg': 0.34498128294944763, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8122233152389526, 'val/ratio': 0.9891027212142944, 'val/ratio_var': 8.758457261137664e-05, 'val/num_eos_tokens': 0, 'lr': 1.3228809407153356e-05, 'episode': 6008, 'epoch': 0.74}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:13:58<1:05:39, 131kB/s]
 74%|███████▎  | 1503/2041 [2:10:39<1:10:42,  7.89s/it][A

{'eps': 0, 'objective/kl': 86.13522338867188, 'objective/entropy': 46.42509460449219, 'objective/non_score_reward': -4.306760787963867, 'objective/rlhf_reward': -6.269908905029297, 'objective/scores': -1.9631482362747192, 'policy/approxkl_avg': 0.012389185838401318, 'policy/clipfrac_avg': 0.0625, 'loss/policy_avg': -0.02800677716732025, 'loss/value_avg': 0.3401554822921753, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8902029991149902, 'val/ratio': 0.9807758927345276, 'val/ratio_var': 0.0002612092939671129, 'val/num_eos_tokens': 0, 'lr': 1.3204311611954926e-05, 'episode': 6012, 'epoch': 0.74}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:14:03<1:05:39, 131kB/s]
 74%|███████▎  | 1504/2041 [2:10:45<1:03:11,  7.06s/it][A

{'eps': 0, 'objective/kl': 94.37362670898438, 'objective/entropy': 59.63570022583008, 'objective/non_score_reward': -4.718681335449219, 'objective/rlhf_reward': -6.5240936279296875, 'objective/scores': -1.8054125308990479, 'policy/approxkl_avg': 0.015425832010805607, 'policy/clipfrac_avg': 0.10023584961891174, 'loss/policy_avg': -0.03625066578388214, 'loss/value_avg': 0.35428452491760254, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0641138553619385, 'val/ratio': 0.9998123049736023, 'val/ratio_var': 6.651885996689089e-06, 'val/num_eos_tokens': 0, 'lr': 1.3179813816756492e-05, 'episode': 6016, 'epoch': 0.74}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:14:08<1:05:39, 131kB/s]
 74%|███████▎  | 1505/2041 [2:10:50<57:54,  6.48s/it]  [A

{'eps': 0, 'objective/kl': 85.88162231445312, 'objective/entropy': 64.64789581298828, 'objective/non_score_reward': -4.294081687927246, 'objective/rlhf_reward': -6.43072509765625, 'objective/scores': -2.136643409729004, 'policy/approxkl_avg': 0.014486860483884811, 'policy/clipfrac_avg': 0.09787736088037491, 'loss/policy_avg': -0.037314273416996, 'loss/value_avg': 0.48426371812820435, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1313788890838623, 'val/ratio': 0.9919912815093994, 'val/ratio_var': 3.960622780141421e-05, 'val/num_eos_tokens': 0, 'lr': 1.3155316021558062e-05, 'episode': 6020, 'epoch': 0.74}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:14:13<1:05:39, 131kB/s]
 74%|███████▍  | 1506/2041 [2:10:55<54:05,  6.07s/it][A

{'eps': 0, 'objective/kl': 93.67991638183594, 'objective/entropy': 52.607765197753906, 'objective/non_score_reward': -4.683995246887207, 'objective/rlhf_reward': -6.141712188720703, 'objective/scores': -1.457716703414917, 'policy/approxkl_avg': 0.007579459343105555, 'policy/clipfrac_avg': 0.07311320304870605, 'loss/policy_avg': -0.026324395090341568, 'loss/value_avg': 0.3349802494049072, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8675442934036255, 'val/ratio': 0.981667160987854, 'val/ratio_var': 0.000256591010838747, 'val/num_eos_tokens': 0, 'lr': 1.3130818226359629e-05, 'episode': 6024, 'epoch': 0.74}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:14:18<1:05:39, 131kB/s]
 74%|███████▍  | 1507/2041 [2:11:00<51:42,  5.81s/it][A

{'eps': 0, 'objective/kl': 83.97138977050781, 'objective/entropy': 45.96445083618164, 'objective/non_score_reward': -4.198569297790527, 'objective/rlhf_reward': -6.5814008712768555, 'objective/scores': -2.382831573486328, 'policy/approxkl_avg': 0.005603234749287367, 'policy/clipfrac_avg': 0.057783015072345734, 'loss/policy_avg': -0.026964761316776276, 'loss/value_avg': 0.414156973361969, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9183642268180847, 'val/ratio': 0.9927082061767578, 'val/ratio_var': 3.3741907827788964e-05, 'val/num_eos_tokens': 0, 'lr': 1.3106320431161195e-05, 'episode': 6028, 'epoch': 0.74}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:14:24<1:05:39, 131kB/s]
 74%|███████▍  | 1508/2041 [2:11:05<49:56,  5.62s/it][A

{'eps': 0, 'objective/kl': 94.63798522949219, 'objective/entropy': 41.14537811279297, 'objective/non_score_reward': -4.731899261474609, 'objective/rlhf_reward': -6.362320899963379, 'objective/scores': -1.6304214000701904, 'policy/approxkl_avg': 0.004152117762714624, 'policy/clipfrac_avg': 0.03655660152435303, 'loss/policy_avg': -0.016438379883766174, 'loss/value_avg': 0.33730176091194153, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.759920060634613, 'val/ratio': 0.9922125339508057, 'val/ratio_var': 5.565362152992748e-05, 'val/num_eos_tokens': 0, 'lr': 1.3081822635962765e-05, 'episode': 6032, 'epoch': 0.74}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:14:29<1:05:39, 131kB/s]
 74%|███████▍  | 1509/2041 [2:11:10<48:39,  5.49s/it][A

{'eps': 0, 'objective/kl': 88.67728424072266, 'objective/entropy': 43.69785690307617, 'objective/non_score_reward': -4.433864593505859, 'objective/rlhf_reward': -6.395730018615723, 'objective/scores': -1.9618654251098633, 'policy/approxkl_avg': 0.004968148190528154, 'policy/clipfrac_avg': 0.048349056392908096, 'loss/policy_avg': -0.024085547775030136, 'loss/value_avg': 0.3107708692550659, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8184792995452881, 'val/ratio': 0.983237624168396, 'val/ratio_var': 0.00022062030620872974, 'val/num_eos_tokens': 0, 'lr': 1.3057324840764331e-05, 'episode': 6036, 'epoch': 0.74}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:14:34<1:05:39, 131kB/s]
 74%|███████▍  | 1510/2041 [2:11:15<47:33,  5.37s/it][A

{'eps': 0, 'objective/kl': 86.56915283203125, 'objective/entropy': 49.08511734008789, 'objective/non_score_reward': -4.328457832336426, 'objective/rlhf_reward': -6.442974090576172, 'objective/scores': -2.114516258239746, 'policy/approxkl_avg': 0.007205228786915541, 'policy/clipfrac_avg': 0.0624999962747097, 'loss/policy_avg': -0.02940881997346878, 'loss/value_avg': 0.40761038661003113, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8826977014541626, 'val/ratio': 0.982065737247467, 'val/ratio_var': 0.00025406168424524367, 'val/num_eos_tokens': 0, 'lr': 1.3032827045565899e-05, 'episode': 6040, 'epoch': 0.74}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:14:39<1:05:39, 131kB/s]
 74%|███████▍  | 1511/2041 [2:11:21<46:37,  5.28s/it][A

{'eps': 0, 'objective/kl': 84.65074920654297, 'objective/entropy': 38.97317886352539, 'objective/non_score_reward': -4.232537269592285, 'objective/rlhf_reward': -5.846685409545898, 'objective/scores': -1.6141479015350342, 'policy/approxkl_avg': 0.004920210223644972, 'policy/clipfrac_avg': 0.056603774428367615, 'loss/policy_avg': -0.02392091602087021, 'loss/value_avg': 0.3813546895980835, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7082811594009399, 'val/ratio': 0.9910727143287659, 'val/ratio_var': 7.125815318431705e-05, 'val/num_eos_tokens': 0, 'lr': 1.3008329250367469e-05, 'episode': 6044, 'epoch': 0.74}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:14:44<1:05:39, 131kB/s]
 74%|███████▍  | 1512/2041 [2:11:26<46:16,  5.25s/it][A

{'eps': 0, 'objective/kl': 82.14805603027344, 'objective/entropy': 45.30799865722656, 'objective/non_score_reward': -4.107402801513672, 'objective/rlhf_reward': -6.170698165893555, 'objective/scores': -2.063295364379883, 'policy/approxkl_avg': 0.010488529689610004, 'policy/clipfrac_avg': 0.07075472176074982, 'loss/policy_avg': -0.031031008809804916, 'loss/value_avg': 0.31514278054237366, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8371235728263855, 'val/ratio': 0.9886660575866699, 'val/ratio_var': 7.973417814355344e-05, 'val/num_eos_tokens': 0, 'lr': 1.2983831455169035e-05, 'episode': 6048, 'epoch': 0.74}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:14:49<1:05:39, 131kB/s]
 74%|███████▍  | 1513/2041 [2:11:31<46:04,  5.24s/it][A

{'eps': 0, 'objective/kl': 86.83287048339844, 'objective/entropy': 40.721439361572266, 'objective/non_score_reward': -4.341643333435059, 'objective/rlhf_reward': -6.0090718269348145, 'objective/scores': -1.6674283742904663, 'policy/approxkl_avg': 0.03599634766578674, 'policy/clipfrac_avg': 0.05896226316690445, 'loss/policy_avg': -0.023583728820085526, 'loss/value_avg': 0.34091126918792725, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8223462700843811, 'val/ratio': 0.9909811019897461, 'val/ratio_var': 4.9128899263450876e-05, 'val/num_eos_tokens': 0, 'lr': 1.2959333659970605e-05, 'episode': 6052, 'epoch': 0.74}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:14:55<1:05:39, 131kB/s]
 74%|███████▍  | 1514/2041 [2:11:36<45:49,  5.22s/it][A

{'eps': 0, 'objective/kl': 97.7388916015625, 'objective/entropy': 45.557865142822266, 'objective/non_score_reward': -4.886944770812988, 'objective/rlhf_reward': -6.265136241912842, 'objective/scores': -1.378191590309143, 'policy/approxkl_avg': 0.010068103671073914, 'policy/clipfrac_avg': 0.05424527823925018, 'loss/policy_avg': -0.028223812580108643, 'loss/value_avg': 0.4606250822544098, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8425257205963135, 'val/ratio': 0.9948279857635498, 'val/ratio_var': 1.5661453289794736e-05, 'val/num_eos_tokens': 0, 'lr': 1.2934835864772171e-05, 'episode': 6056, 'epoch': 0.74}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:15:00<1:05:39, 131kB/s]
 74%|███████▍  | 1515/2041 [2:11:41<45:26,  5.18s/it][A

{'eps': 0, 'objective/kl': 88.8411865234375, 'objective/entropy': 39.00799560546875, 'objective/non_score_reward': -4.442059516906738, 'objective/rlhf_reward': -6.214182376861572, 'objective/scores': -1.772122859954834, 'policy/approxkl_avg': 0.005545094609260559, 'policy/clipfrac_avg': 0.05070754885673523, 'loss/policy_avg': -0.021635811775922775, 'loss/value_avg': 0.42601916193962097, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7804749608039856, 'val/ratio': 0.9939519762992859, 'val/ratio_var': 2.2823616745881736e-05, 'val/num_eos_tokens': 0, 'lr': 1.2910338069573738e-05, 'episode': 6060, 'epoch': 0.74}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:15:05<1:05:39, 131kB/s]
 74%|███████▍  | 1516/2041 [2:11:46<45:20,  5.18s/it][A

{'eps': 0, 'objective/kl': 89.05328369140625, 'objective/entropy': 58.78426742553711, 'objective/non_score_reward': -4.452664375305176, 'objective/rlhf_reward': -6.311212539672852, 'objective/scores': -1.8585479259490967, 'policy/approxkl_avg': 0.006931445095688105, 'policy/clipfrac_avg': 0.0695754736661911, 'loss/policy_avg': -0.028691507875919342, 'loss/value_avg': 0.39604225754737854, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.008302927017212, 'val/ratio': 0.9894583225250244, 'val/ratio_var': 8.747776882955804e-05, 'val/num_eos_tokens': 0, 'lr': 1.2885840274375307e-05, 'episode': 6064, 'epoch': 0.74}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:15:10<1:05:39, 131kB/s]
 74%|███████▍  | 1517/2041 [2:11:52<45:24,  5.20s/it][A

{'eps': 0, 'objective/kl': 91.66698455810547, 'objective/entropy': 43.09302520751953, 'objective/non_score_reward': -4.583349227905273, 'objective/rlhf_reward': -6.043871879577637, 'objective/scores': -1.4605224132537842, 'policy/approxkl_avg': 0.007057417184114456, 'policy/clipfrac_avg': 0.056603774428367615, 'loss/policy_avg': -0.025260696187615395, 'loss/value_avg': 0.2315005511045456, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8358794450759888, 'val/ratio': 0.9952593445777893, 'val/ratio_var': 1.865644117060583e-05, 'val/num_eos_tokens': 0, 'lr': 1.2861342479176874e-05, 'episode': 6068, 'epoch': 0.74}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:15:15<1:05:39, 131kB/s]
 74%|███████▍  | 1518/2041 [2:11:57<45:27,  5.22s/it][A

{'eps': 0, 'objective/kl': 76.93280029296875, 'objective/entropy': 43.95159912109375, 'objective/non_score_reward': -3.846640110015869, 'objective/rlhf_reward': -6.354814052581787, 'objective/scores': -2.508173942565918, 'policy/approxkl_avg': 0.00799432210624218, 'policy/clipfrac_avg': 0.061320751905441284, 'loss/policy_avg': -0.023723460733890533, 'loss/value_avg': 0.3825998306274414, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.873512327671051, 'val/ratio': 0.9967572093009949, 'val/ratio_var': 6.028963980497792e-06, 'val/num_eos_tokens': 0, 'lr': 1.2836844683978442e-05, 'episode': 6072, 'epoch': 0.74}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:15:21<1:05:39, 131kB/s]
 74%|███████▍  | 1519/2041 [2:12:02<45:16,  5.20s/it][A

{'eps': 0, 'objective/kl': 96.30244445800781, 'objective/entropy': 40.20985794067383, 'objective/non_score_reward': -4.815122604370117, 'objective/rlhf_reward': -6.812511444091797, 'objective/scores': -1.9973889589309692, 'policy/approxkl_avg': 0.00813720840960741, 'policy/clipfrac_avg': 0.048349056392908096, 'loss/policy_avg': -0.027947254478931427, 'loss/value_avg': 0.41134488582611084, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6935899257659912, 'val/ratio': 0.97916579246521, 'val/ratio_var': 0.0003478862054180354, 'val/num_eos_tokens': 0, 'lr': 1.2812346888780011e-05, 'episode': 6076, 'epoch': 0.74}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:15:26<1:05:39, 131kB/s]
 74%|███████▍  | 1520/2041 [2:12:07<45:22,  5.23s/it][A

{'eps': 0, 'objective/kl': 101.64705657958984, 'objective/entropy': 42.15827178955078, 'objective/non_score_reward': -5.082352638244629, 'objective/rlhf_reward': -6.479528903961182, 'objective/scores': -1.3971761465072632, 'policy/approxkl_avg': 0.0076894983649253845, 'policy/clipfrac_avg': 0.05896226689219475, 'loss/policy_avg': -0.02447747066617012, 'loss/value_avg': 0.35111719369888306, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8956244587898254, 'val/ratio': 0.9981397390365601, 'val/ratio_var': 1.6570824072914547e-06, 'val/num_eos_tokens': 0, 'lr': 1.2787849093581578e-05, 'episode': 6080, 'epoch': 0.74}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:15:31<1:05:39, 131kB/s]
 75%|███████▍  | 1521/2041 [2:12:13<45:09,  5.21s/it][A

{'eps': 0, 'objective/kl': 92.27665710449219, 'objective/entropy': 56.12299346923828, 'objective/non_score_reward': -4.613832950592041, 'objective/rlhf_reward': -6.8794264793396, 'objective/scores': -2.2655935287475586, 'policy/approxkl_avg': 0.008225667290389538, 'policy/clipfrac_avg': 0.0624999962747097, 'loss/policy_avg': -0.026616891846060753, 'loss/value_avg': 0.4327602982521057, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9811879396438599, 'val/ratio': 1.002558946609497, 'val/ratio_var': 7.22317827239749e-06, 'val/num_eos_tokens': 0, 'lr': 1.2763351298383148e-05, 'episode': 6084, 'epoch': 0.75}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:15:36<1:05:39, 131kB/s]
 75%|███████▍  | 1522/2041 [2:12:18<44:53,  5.19s/it][A

{'eps': 0, 'objective/kl': 105.19013214111328, 'objective/entropy': 63.233314514160156, 'objective/non_score_reward': -5.259507179260254, 'objective/rlhf_reward': -6.48992919921875, 'objective/scores': -1.230421781539917, 'policy/approxkl_avg': 0.08095352351665497, 'policy/clipfrac_avg': 0.09198113530874252, 'loss/policy_avg': -0.03213973343372345, 'loss/value_avg': 0.5515218377113342, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0507757663726807, 'val/ratio': 0.9665063619613647, 'val/ratio_var': 0.00072452676249668, 'val/num_eos_tokens': 0, 'lr': 1.2738853503184714e-05, 'episode': 6088, 'epoch': 0.75}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:15:41<1:05:39, 131kB/s]
 75%|███████▍  | 1523/2041 [2:12:23<45:00,  5.21s/it][A

{'eps': 0, 'objective/kl': 97.06358337402344, 'objective/entropy': 57.84217834472656, 'objective/non_score_reward': -4.853178977966309, 'objective/rlhf_reward': -6.781147003173828, 'objective/scores': -1.927968144416809, 'policy/approxkl_avg': 0.016412634402513504, 'policy/clipfrac_avg': 0.08844339847564697, 'loss/policy_avg': -0.03438596799969673, 'loss/value_avg': 0.3140419125556946, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.072288990020752, 'val/ratio': 1.0164549350738525, 'val/ratio_var': 0.0003725361020769924, 'val/num_eos_tokens': 0, 'lr': 1.271435570798628e-05, 'episode': 6092, 'epoch': 0.75}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:15:47<1:05:39, 131kB/s]
 75%|███████▍  | 1524/2041 [2:12:28<44:46,  5.20s/it][A

{'eps': 0, 'objective/kl': 101.59038543701172, 'objective/entropy': 50.723655700683594, 'objective/non_score_reward': -5.079519271850586, 'objective/rlhf_reward': -6.3885650634765625, 'objective/scores': -1.3090455532073975, 'policy/approxkl_avg': 0.007906455546617508, 'policy/clipfrac_avg': 0.0695754662156105, 'loss/policy_avg': -0.032460324466228485, 'loss/value_avg': 0.3388081192970276, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9661864042282104, 'val/ratio': 0.9890953302383423, 'val/ratio_var': 8.908885502023622e-05, 'val/num_eos_tokens': 0, 'lr': 1.268985791278785e-05, 'episode': 6096, 'epoch': 0.75}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:15:52<1:05:39, 131kB/s]
 75%|███████▍  | 1525/2041 [2:12:33<44:31,  5.18s/it][A

{'eps': 0, 'objective/kl': 82.75576782226562, 'objective/entropy': 47.72607421875, 'objective/non_score_reward': -4.137788772583008, 'objective/rlhf_reward': -6.067248344421387, 'objective/scores': -1.929459810256958, 'policy/approxkl_avg': 0.015717394649982452, 'policy/clipfrac_avg': 0.07193396240472794, 'loss/policy_avg': -0.028089597821235657, 'loss/value_avg': 0.3819183111190796, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9009044170379639, 'val/ratio': 0.9897968769073486, 'val/ratio_var': 6.269665755098686e-05, 'val/num_eos_tokens': 0, 'lr': 1.2665360117589416e-05, 'episode': 6100, 'epoch': 0.75}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:15:57<1:05:39, 131kB/s]
 75%|███████▍  | 1526/2041 [2:12:38<44:33,  5.19s/it][A

{'eps': 0, 'objective/kl': 100.70779418945312, 'objective/entropy': 64.64022827148438, 'objective/non_score_reward': -5.0353899002075195, 'objective/rlhf_reward': -6.732090950012207, 'objective/scores': -1.6967010498046875, 'policy/approxkl_avg': 0.01512716244906187, 'policy/clipfrac_avg': 0.09787736088037491, 'loss/policy_avg': -0.03335746377706528, 'loss/value_avg': 0.4150531589984894, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0813096761703491, 'val/ratio': 0.9828224182128906, 'val/ratio_var': 0.00019538051856216043, 'val/num_eos_tokens': 0, 'lr': 1.2640862322390986e-05, 'episode': 6104, 'epoch': 0.75}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:16:02<1:05:39, 131kB/s]
 75%|███████▍  | 1527/2041 [2:12:44<44:22,  5.18s/it][A

{'eps': 0, 'objective/kl': 95.64274597167969, 'objective/entropy': 59.90799331665039, 'objective/non_score_reward': -4.782137393951416, 'objective/rlhf_reward': -5.8405961990356445, 'objective/scores': -1.058458924293518, 'policy/approxkl_avg': 0.012273432686924934, 'policy/clipfrac_avg': 0.07311321049928665, 'loss/policy_avg': -0.031105363741517067, 'loss/value_avg': 0.2575587034225464, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9492465853691101, 'val/ratio': 0.9941208958625793, 'val/ratio_var': 2.837718238879461e-05, 'val/num_eos_tokens': 0, 'lr': 1.2616364527192554e-05, 'episode': 6108, 'epoch': 0.75}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:16:07<1:05:39, 131kB/s]
 75%|███████▍  | 1528/2041 [2:12:49<44:13,  5.17s/it][A

{'eps': 0, 'objective/kl': 87.60884094238281, 'objective/entropy': 51.92795944213867, 'objective/non_score_reward': -4.380442142486572, 'objective/rlhf_reward': -6.117474555969238, 'objective/scores': -1.7370326519012451, 'policy/approxkl_avg': 0.011049837805330753, 'policy/clipfrac_avg': 0.06603772938251495, 'loss/policy_avg': -0.02783537656068802, 'loss/value_avg': 0.3087480068206787, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9420858025550842, 'val/ratio': 0.9916985034942627, 'val/ratio_var': 4.431720299180597e-05, 'val/num_eos_tokens': 0, 'lr': 1.259186673199412e-05, 'episode': 6112, 'epoch': 0.75}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:16:12<1:05:39, 131kB/s]
 75%|███████▍  | 1529/2041 [2:12:54<44:06,  5.17s/it][A

{'eps': 0, 'objective/kl': 97.176025390625, 'objective/entropy': 59.26736831665039, 'objective/non_score_reward': -4.85880184173584, 'objective/rlhf_reward': -6.569382667541504, 'objective/scores': -1.710580825805664, 'policy/approxkl_avg': 0.009969356469810009, 'policy/clipfrac_avg': 0.07311321049928665, 'loss/policy_avg': -0.029734982177615166, 'loss/value_avg': 0.3854793906211853, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0556557178497314, 'val/ratio': 0.9849090576171875, 'val/ratio_var': 0.00016914716979954392, 'val/num_eos_tokens': 0, 'lr': 1.256736893679569e-05, 'episode': 6116, 'epoch': 0.75}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:16:18<1:05:39, 131kB/s]
 75%|███████▍  | 1530/2041 [2:12:59<44:04,  5.17s/it][A

{'eps': 0, 'objective/kl': 95.12728118896484, 'objective/entropy': 62.26658630371094, 'objective/non_score_reward': -4.756364345550537, 'objective/rlhf_reward': -6.474687576293945, 'objective/scores': -1.7183234691619873, 'policy/approxkl_avg': 0.024944834411144257, 'policy/clipfrac_avg': 0.07311320304870605, 'loss/policy_avg': -0.034616194665431976, 'loss/value_avg': 0.3003165125846863, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.125978708267212, 'val/ratio': 0.993365466594696, 'val/ratio_var': 2.0478271835600026e-05, 'val/num_eos_tokens': 0, 'lr': 1.2542871141597257e-05, 'episode': 6120, 'epoch': 0.75}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:16:23<1:05:39, 131kB/s]
 75%|███████▌  | 1531/2041 [2:13:04<43:59,  5.18s/it][A

{'eps': 0, 'objective/kl': 94.66178894042969, 'objective/entropy': 57.77359390258789, 'objective/non_score_reward': -4.733089923858643, 'objective/rlhf_reward': -6.310385704040527, 'objective/scores': -1.5772955417633057, 'policy/approxkl_avg': 0.014526975341141224, 'policy/clipfrac_avg': 0.0625, 'loss/policy_avg': -0.029208093881607056, 'loss/value_avg': 0.3148443400859833, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.001821517944336, 'val/ratio': 0.9931254386901855, 'val/ratio_var': 2.7858841349370778e-05, 'val/num_eos_tokens': 0, 'lr': 1.2518373346398823e-05, 'episode': 6124, 'epoch': 0.75}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:16:28<1:05:39, 131kB/s]
 75%|███████▌  | 1532/2041 [2:13:09<43:50,  5.17s/it][A

{'eps': 0, 'objective/kl': 90.19941711425781, 'objective/entropy': 55.983123779296875, 'objective/non_score_reward': -4.509970664978027, 'objective/rlhf_reward': -6.32392692565918, 'objective/scores': -1.8139560222625732, 'policy/approxkl_avg': 0.015207202173769474, 'policy/clipfrac_avg': 0.07429245114326477, 'loss/policy_avg': -0.029964204877614975, 'loss/value_avg': 0.28724396228790283, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9211587309837341, 'val/ratio': 0.9820784330368042, 'val/ratio_var': 0.0002038150414591655, 'val/num_eos_tokens': 0, 'lr': 1.2493875551200393e-05, 'episode': 6128, 'epoch': 0.75}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:16:33<1:05:39, 131kB/s]
 75%|███████▌  | 1533/2041 [2:13:15<43:45,  5.17s/it][A

{'eps': 0, 'objective/kl': 91.70541381835938, 'objective/entropy': 65.73426818847656, 'objective/non_score_reward': -4.585270881652832, 'objective/rlhf_reward': -6.425287246704102, 'objective/scores': -1.8400161266326904, 'policy/approxkl_avg': 0.013680476695299149, 'policy/clipfrac_avg': 0.0837264209985733, 'loss/policy_avg': -0.03575681895017624, 'loss/value_avg': 0.3868409991264343, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.977587103843689, 'val/ratio': 0.9847224354743958, 'val/ratio_var': 0.0001644060102989897, 'val/num_eos_tokens': 0, 'lr': 1.2469377756001959e-05, 'episode': 6132, 'epoch': 0.75}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:16:38<1:05:39, 131kB/s]
 75%|███████▌  | 1534/2041 [2:13:20<43:33,  5.16s/it][A

{'eps': 0, 'objective/kl': 85.7279281616211, 'objective/entropy': 57.948089599609375, 'objective/non_score_reward': -4.286396503448486, 'objective/rlhf_reward': -6.015649795532227, 'objective/scores': -1.7292535305023193, 'policy/approxkl_avg': 0.010232887230813503, 'policy/clipfrac_avg': 0.06721697747707367, 'loss/policy_avg': -0.026252057403326035, 'loss/value_avg': 0.3907109498977661, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9560608863830566, 'val/ratio': 0.9953784942626953, 'val/ratio_var': 1.0744174687715713e-05, 'val/num_eos_tokens': 0, 'lr': 1.2444879960803529e-05, 'episode': 6136, 'epoch': 0.75}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:16:43<1:05:39, 131kB/s]
 75%|███████▌  | 1535/2041 [2:13:25<43:49,  5.20s/it][A

{'eps': 0, 'objective/kl': 96.06764221191406, 'objective/entropy': 49.99972152709961, 'objective/non_score_reward': -4.80338191986084, 'objective/rlhf_reward': -6.94162654876709, 'objective/scores': -2.138244390487671, 'policy/approxkl_avg': 0.01889628916978836, 'policy/clipfrac_avg': 0.06132075563073158, 'loss/policy_avg': -0.02850940078496933, 'loss/value_avg': 0.3737722635269165, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9573286175727844, 'val/ratio': 0.9833918809890747, 'val/ratio_var': 0.0001902938965940848, 'val/num_eos_tokens': 0, 'lr': 1.2420382165605097e-05, 'episode': 6140, 'epoch': 0.75}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:16:49<1:05:39, 131kB/s]
 75%|███████▌  | 1536/2041 [2:13:30<43:43,  5.19s/it][A

{'eps': 0, 'objective/kl': 83.98348999023438, 'objective/entropy': 44.28691864013672, 'objective/non_score_reward': -4.199174404144287, 'objective/rlhf_reward': -5.999763488769531, 'objective/scores': -1.8005893230438232, 'policy/approxkl_avg': 0.004499385133385658, 'policy/clipfrac_avg': 0.053066037595272064, 'loss/policy_avg': -0.025119177997112274, 'loss/value_avg': 0.35727065801620483, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0259915590286255, 'val/ratio': 0.9964995384216309, 'val/ratio_var': 9.015948307933286e-06, 'val/num_eos_tokens': 0, 'lr': 1.2395884370406665e-05, 'episode': 6144, 'epoch': 0.75}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:16:54<1:05:39, 131kB/s]
 75%|███████▌  | 1537/2041 [2:13:35<43:17,  5.15s/it][A

{'eps': 0, 'objective/kl': 91.77050018310547, 'objective/entropy': 45.42988586425781, 'objective/non_score_reward': -4.58852481842041, 'objective/rlhf_reward': -6.343168258666992, 'objective/scores': -1.754643440246582, 'policy/approxkl_avg': 0.012605860829353333, 'policy/clipfrac_avg': 0.06721697747707367, 'loss/policy_avg': -0.028404587879776955, 'loss/value_avg': 0.38602811098098755, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8206308484077454, 'val/ratio': 0.9794962406158447, 'val/ratio_var': 0.0003134310245513916, 'val/num_eos_tokens': 0, 'lr': 1.2371386575208233e-05, 'episode': 6148, 'epoch': 0.75}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:16:59<1:05:39, 131kB/s]
 75%|███████▌  | 1538/2041 [2:13:40<43:06,  5.14s/it][A

{'eps': 0, 'objective/kl': 98.12310791015625, 'objective/entropy': 46.85420227050781, 'objective/non_score_reward': -4.906155586242676, 'objective/rlhf_reward': -7.030460357666016, 'objective/scores': -2.1243045330047607, 'policy/approxkl_avg': 0.013728775084018707, 'policy/clipfrac_avg': 0.0766509473323822, 'loss/policy_avg': -0.030715711414813995, 'loss/value_avg': 0.4925970733165741, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9896279573440552, 'val/ratio': 0.9776501655578613, 'val/ratio_var': 0.0003546015650499612, 'val/num_eos_tokens': 0, 'lr': 1.23468887800098e-05, 'episode': 6152, 'epoch': 0.75}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:17:04<1:05:39, 131kB/s]
 75%|███████▌  | 1539/2041 [2:13:46<43:19,  5.18s/it][A

{'eps': 0, 'objective/kl': 101.74809265136719, 'objective/entropy': 47.85166549682617, 'objective/non_score_reward': -5.087404251098633, 'objective/rlhf_reward': -7.002092361450195, 'objective/scores': -1.9146883487701416, 'policy/approxkl_avg': 0.008104118518531322, 'policy/clipfrac_avg': 0.07075471431016922, 'loss/policy_avg': -0.029713016003370285, 'loss/value_avg': 0.5024428963661194, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8395659923553467, 'val/ratio': 0.9847389459609985, 'val/ratio_var': 0.00017537524399813265, 'val/num_eos_tokens': 0, 'lr': 1.2322390984811367e-05, 'episode': 6156, 'epoch': 0.75}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:17:09<1:05:39, 131kB/s]
 75%|███████▌  | 1540/2041 [2:13:51<43:12,  5.18s/it][A

{'eps': 0, 'objective/kl': 105.93561553955078, 'objective/entropy': 64.15023803710938, 'objective/non_score_reward': -5.296780586242676, 'objective/rlhf_reward': -7.433130264282227, 'objective/scores': -2.136349678039551, 'policy/approxkl_avg': 0.008557615801692009, 'policy/clipfrac_avg': 0.06367924809455872, 'loss/policy_avg': -0.029718603938817978, 'loss/value_avg': 0.5481228232383728, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0104937553405762, 'val/ratio': 0.9843521118164062, 'val/ratio_var': 0.00019320560386404395, 'val/num_eos_tokens': 0, 'lr': 1.2297893189612935e-05, 'episode': 6160, 'epoch': 0.75}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:17:14<1:05:39, 131kB/s]
 76%|███████▌  | 1541/2041 [2:13:56<43:01,  5.16s/it][A

{'eps': 0, 'objective/kl': 79.64453125, 'objective/entropy': 40.84341049194336, 'objective/non_score_reward': -3.982227087020874, 'objective/rlhf_reward': -6.270514488220215, 'objective/scores': -2.288287401199341, 'policy/approxkl_avg': 0.00802298542112112, 'policy/clipfrac_avg': 0.05896226689219475, 'loss/policy_avg': -0.024271586909890175, 'loss/value_avg': 0.31065797805786133, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8042268753051758, 'val/ratio': 0.998508095741272, 'val/ratio_var': 1.1212549679839867e-06, 'val/num_eos_tokens': 0, 'lr': 1.2273395394414503e-05, 'episode': 6164, 'epoch': 0.76}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:17:19<1:05:39, 131kB/s]
 76%|███████▌  | 1542/2041 [2:14:01<42:49,  5.15s/it][A

{'eps': 0, 'objective/kl': 86.00865173339844, 'objective/entropy': 44.96644973754883, 'objective/non_score_reward': -4.3004326820373535, 'objective/rlhf_reward': -6.361085414886475, 'objective/scores': -2.060652732849121, 'policy/approxkl_avg': 0.00952487625181675, 'policy/clipfrac_avg': 0.060141511261463165, 'loss/policy_avg': -0.025882922112941742, 'loss/value_avg': 0.3438469469547272, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7811890244483948, 'val/ratio': 0.9806541800498962, 'val/ratio_var': 0.0003080676542595029, 'val/num_eos_tokens': 0, 'lr': 1.2248897599216071e-05, 'episode': 6168, 'epoch': 0.76}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:17:25<1:05:39, 131kB/s]
 76%|███████▌  | 1543/2041 [2:14:06<42:59,  5.18s/it][A

{'eps': 0, 'objective/kl': 79.1943359375, 'objective/entropy': 54.136444091796875, 'objective/non_score_reward': -3.959716796875, 'objective/rlhf_reward': -6.52455997467041, 'objective/scores': -2.564842939376831, 'policy/approxkl_avg': 0.011552653275430202, 'policy/clipfrac_avg': 0.08962263911962509, 'loss/policy_avg': -0.0377422496676445, 'loss/value_avg': 0.4248560667037964, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.005110502243042, 'val/ratio': 0.993541955947876, 'val/ratio_var': 2.1994937924318947e-05, 'val/num_eos_tokens': 0, 'lr': 1.222439980401764e-05, 'episode': 6172, 'epoch': 0.76}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:17:30<1:05:39, 131kB/s]
 76%|███████▌  | 1544/2041 [2:14:12<42:59,  5.19s/it][A

{'eps': 0, 'objective/kl': 97.34925842285156, 'objective/entropy': 47.15973663330078, 'objective/non_score_reward': -4.867463111877441, 'objective/rlhf_reward': -6.556454658508301, 'objective/scores': -1.6889914274215698, 'policy/approxkl_avg': 0.008198323659598827, 'policy/clipfrac_avg': 0.05070754885673523, 'loss/policy_avg': -0.02576381526887417, 'loss/value_avg': 0.28096288442611694, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8750447034835815, 'val/ratio': 0.9955847263336182, 'val/ratio_var': 1.3962419870949816e-05, 'val/num_eos_tokens': 0, 'lr': 1.2199902008819207e-05, 'episode': 6176, 'epoch': 0.76}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:17:35<1:05:39, 131kB/s]
 76%|███████▌  | 1545/2041 [2:14:17<43:05,  5.21s/it][A

{'eps': 0, 'objective/kl': 92.2086181640625, 'objective/entropy': 42.04558181762695, 'objective/non_score_reward': -4.610430717468262, 'objective/rlhf_reward': -6.77699089050293, 'objective/scores': -2.166560173034668, 'policy/approxkl_avg': 0.006656062789261341, 'policy/clipfrac_avg': 0.056603774428367615, 'loss/policy_avg': -0.02535221539437771, 'loss/value_avg': 0.3428768515586853, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8861579895019531, 'val/ratio': 0.9900529980659485, 'val/ratio_var': 7.646298035979271e-05, 'val/num_eos_tokens': 0, 'lr': 1.2175404213620775e-05, 'episode': 6180, 'epoch': 0.76}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:17:40<1:05:39, 131kB/s]
 76%|███████▌  | 1546/2041 [2:14:22<42:49,  5.19s/it][A

{'eps': 0, 'objective/kl': 86.28578186035156, 'objective/entropy': 65.92903137207031, 'objective/non_score_reward': -4.314289093017578, 'objective/rlhf_reward': -6.456875324249268, 'objective/scores': -2.1425862312316895, 'policy/approxkl_avg': 0.012651483528316021, 'policy/clipfrac_avg': 0.08254717290401459, 'loss/policy_avg': -0.03580695390701294, 'loss/value_avg': 0.3110842704772949, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0679969787597656, 'val/ratio': 0.999769389629364, 'val/ratio_var': 6.8092194851487875e-06, 'val/num_eos_tokens': 0, 'lr': 1.2150906418422342e-05, 'episode': 6184, 'epoch': 0.76}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:17:45<1:05:39, 131kB/s]
 76%|███████▌  | 1547/2041 [2:14:27<42:33,  5.17s/it][A

{'eps': 0, 'objective/kl': 100.10350036621094, 'objective/entropy': 56.70228576660156, 'objective/non_score_reward': -5.00517463684082, 'objective/rlhf_reward': -6.2476348876953125, 'objective/scores': -1.2424601316452026, 'policy/approxkl_avg': 0.007554417010396719, 'policy/clipfrac_avg': 0.04363207519054413, 'loss/policy_avg': -0.02248322032392025, 'loss/value_avg': 0.4405975043773651, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0896185636520386, 'val/ratio': 0.9799606800079346, 'val/ratio_var': 0.000326242734445259, 'val/num_eos_tokens': 0, 'lr': 1.212640862322391e-05, 'episode': 6188, 'epoch': 0.76}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:17:51<1:05:39, 131kB/s]
 76%|███████▌  | 1548/2041 [2:14:32<42:23,  5.16s/it][A

{'eps': 0, 'objective/kl': 93.50080871582031, 'objective/entropy': 52.77660369873047, 'objective/non_score_reward': -4.675040245056152, 'objective/rlhf_reward': -7.240594863891602, 'objective/scores': -2.565554618835449, 'policy/approxkl_avg': 0.007612423971295357, 'policy/clipfrac_avg': 0.06132075563073158, 'loss/policy_avg': -0.027823742479085922, 'loss/value_avg': 0.37743228673934937, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8387302160263062, 'val/ratio': 0.9864011406898499, 'val/ratio_var': 0.0001345009804936126, 'val/num_eos_tokens': 0, 'lr': 1.2101910828025478e-05, 'episode': 6192, 'epoch': 0.76}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:17:56<1:05:39, 131kB/s]
 76%|███████▌  | 1549/2041 [2:14:37<42:20,  5.16s/it][A

{'eps': 0, 'objective/kl': 93.35848999023438, 'objective/entropy': 58.515113830566406, 'objective/non_score_reward': -4.667924880981445, 'objective/rlhf_reward': -6.577663898468018, 'objective/scores': -1.9097391366958618, 'policy/approxkl_avg': 0.00892502348870039, 'policy/clipfrac_avg': 0.06485849618911743, 'loss/policy_avg': -0.03324412927031517, 'loss/value_avg': 0.2766641080379486, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0550371408462524, 'val/ratio': 0.9844232201576233, 'val/ratio_var': 0.00017874019977170974, 'val/num_eos_tokens': 0, 'lr': 1.2077413032827046e-05, 'episode': 6196, 'epoch': 0.76}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:18:01<1:05:39, 131kB/s]
 76%|███████▌  | 1550/2041 [2:14:42<42:04,  5.14s/it][A

{'eps': 0, 'objective/kl': 90.59049987792969, 'objective/entropy': 53.57243347167969, 'objective/non_score_reward': -4.529524803161621, 'objective/rlhf_reward': -6.850851058959961, 'objective/scores': -2.32132625579834, 'policy/approxkl_avg': 0.0108264135196805, 'policy/clipfrac_avg': 0.07311320304870605, 'loss/policy_avg': -0.03162321075797081, 'loss/value_avg': 0.4252316653728485, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0376780033111572, 'val/ratio': 0.9878363013267517, 'val/ratio_var': 9.388036414748058e-05, 'val/num_eos_tokens': 0, 'lr': 1.2052915237628614e-05, 'episode': 6200, 'epoch': 0.76}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:18:06<1:05:39, 131kB/s]
 76%|███████▌  | 1551/2041 [2:14:48<41:56,  5.14s/it][A

{'eps': 0, 'objective/kl': 87.92510986328125, 'objective/entropy': 51.00679397583008, 'objective/non_score_reward': -4.396255970001221, 'objective/rlhf_reward': -6.216962814331055, 'objective/scores': -1.8207067251205444, 'policy/approxkl_avg': 0.013502088375389576, 'policy/clipfrac_avg': 0.07075472176074982, 'loss/policy_avg': -0.024422019720077515, 'loss/value_avg': 0.38232025504112244, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8232161998748779, 'val/ratio': 0.9890607595443726, 'val/ratio_var': 8.715433796169236e-05, 'val/num_eos_tokens': 0, 'lr': 1.2028417442430182e-05, 'episode': 6204, 'epoch': 0.76}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:18:11<1:05:39, 131kB/s]
 76%|███████▌  | 1552/2041 [2:14:53<41:58,  5.15s/it][A

{'eps': 0, 'objective/kl': 96.82601928710938, 'objective/entropy': 41.44815444946289, 'objective/non_score_reward': -4.841301441192627, 'objective/rlhf_reward': -6.7640252113342285, 'objective/scores': -1.9227237701416016, 'policy/approxkl_avg': 0.007150549441576004, 'policy/clipfrac_avg': 0.05188678950071335, 'loss/policy_avg': -0.023460187017917633, 'loss/value_avg': 0.4335356056690216, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8114961385726929, 'val/ratio': 0.9853461980819702, 'val/ratio_var': 0.00017026298155542463, 'val/num_eos_tokens': 0, 'lr': 1.200391964723175e-05, 'episode': 6208, 'epoch': 0.76}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:18:16<1:05:39, 131kB/s]
 76%|███████▌  | 1553/2041 [2:14:58<41:48,  5.14s/it][A

{'eps': 0, 'objective/kl': 103.87055206298828, 'objective/entropy': 51.06540298461914, 'objective/non_score_reward': -5.193527698516846, 'objective/rlhf_reward': -6.926438331604004, 'objective/scores': -1.7329105138778687, 'policy/approxkl_avg': 0.00916070956736803, 'policy/clipfrac_avg': 0.0625, 'loss/policy_avg': -0.02919572964310646, 'loss/value_avg': 0.4257361590862274, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8860008716583252, 'val/ratio': 0.985997200012207, 'val/ratio_var': 0.00013334256072994322, 'val/num_eos_tokens': 0, 'lr': 1.1979421852033318e-05, 'episode': 6212, 'epoch': 0.76}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:18:22<1:05:39, 131kB/s]
 76%|███████▌  | 1554/2041 [2:15:03<41:54,  5.16s/it][A

{'eps': 0, 'objective/kl': 88.28968811035156, 'objective/entropy': 43.085025787353516, 'objective/non_score_reward': -4.414484024047852, 'objective/rlhf_reward': -6.832400321960449, 'objective/scores': -2.4179160594940186, 'policy/approxkl_avg': 0.007709749508649111, 'policy/clipfrac_avg': 0.048349056392908096, 'loss/policy_avg': -0.020015602931380272, 'loss/value_avg': 0.4458661675453186, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8611563444137573, 'val/ratio': 0.9924598336219788, 'val/ratio_var': 3.8167581806192175e-05, 'val/num_eos_tokens': 0, 'lr': 1.1954924056834886e-05, 'episode': 6216, 'epoch': 0.76}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:18:27<1:05:39, 131kB/s]
 76%|███████▌  | 1555/2041 [2:15:08<41:55,  5.18s/it][A

{'eps': 0, 'objective/kl': 74.05493927001953, 'objective/entropy': 46.201358795166016, 'objective/non_score_reward': -3.702746868133545, 'objective/rlhf_reward': -5.699187278747559, 'objective/scores': -1.9964401721954346, 'policy/approxkl_avg': 0.005186061374843121, 'policy/clipfrac_avg': 0.04716981202363968, 'loss/policy_avg': -0.020518630743026733, 'loss/value_avg': 0.303654283285141, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8071313500404358, 'val/ratio': 0.9932416677474976, 'val/ratio_var': 3.148945688735694e-05, 'val/num_eos_tokens': 0, 'lr': 1.1930426261636453e-05, 'episode': 6220, 'epoch': 0.76}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:18:32<1:05:39, 131kB/s]
 76%|███████▌  | 1556/2041 [2:15:14<41:56,  5.19s/it][A

{'eps': 0, 'objective/kl': 90.69575500488281, 'objective/entropy': 66.47990417480469, 'objective/non_score_reward': -4.534788131713867, 'objective/rlhf_reward': -7.3061323165893555, 'objective/scores': -2.7713441848754883, 'policy/approxkl_avg': 0.01943640597164631, 'policy/clipfrac_avg': 0.08608490973711014, 'loss/policy_avg': -0.03799965977668762, 'loss/value_avg': 0.6519347429275513, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1535269021987915, 'val/ratio': 0.9769113063812256, 'val/ratio_var': 0.00035510887391865253, 'val/num_eos_tokens': 0, 'lr': 1.190592846643802e-05, 'episode': 6224, 'epoch': 0.76}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:18:37<1:05:39, 131kB/s]
 76%|███████▋  | 1557/2041 [2:15:19<41:54,  5.20s/it][A

{'eps': 0, 'objective/kl': 78.58726501464844, 'objective/entropy': 38.810829162597656, 'objective/non_score_reward': -3.929363250732422, 'objective/rlhf_reward': -5.803315162658691, 'objective/scores': -1.873952031135559, 'policy/approxkl_avg': 0.004459768068045378, 'policy/clipfrac_avg': 0.05424528568983078, 'loss/policy_avg': -0.026168981567025185, 'loss/value_avg': 0.27000394463539124, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7207348346710205, 'val/ratio': 0.983900249004364, 'val/ratio_var': 0.00024203489010687917, 'val/num_eos_tokens': 0, 'lr': 1.1881430671239589e-05, 'episode': 6228, 'epoch': 0.76}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:18:42<1:05:39, 131kB/s]
 76%|███████▋  | 1558/2041 [2:15:24<41:45,  5.19s/it][A

{'eps': 0, 'objective/kl': 95.01615905761719, 'objective/entropy': 53.461570739746094, 'objective/non_score_reward': -4.750807762145996, 'objective/rlhf_reward': -6.6178741455078125, 'objective/scores': -1.8670661449432373, 'policy/approxkl_avg': 0.00803634338080883, 'policy/clipfrac_avg': 0.060141511261463165, 'loss/policy_avg': -0.024541905149817467, 'loss/value_avg': 0.34105709195137024, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.910049319267273, 'val/ratio': 0.9822511076927185, 'val/ratio_var': 0.00026320540928281844, 'val/num_eos_tokens': 0, 'lr': 1.1856932876041157e-05, 'episode': 6232, 'epoch': 0.76}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:18:48<1:05:39, 131kB/s]
 76%|███████▋  | 1559/2041 [2:15:29<41:41,  5.19s/it][A

{'eps': 0, 'objective/kl': 90.2029037475586, 'objective/entropy': 46.799190521240234, 'objective/non_score_reward': -4.51014518737793, 'objective/rlhf_reward': -5.848194122314453, 'objective/scores': -1.3380486965179443, 'policy/approxkl_avg': 0.0034375335089862347, 'policy/clipfrac_avg': 0.029481131583452225, 'loss/policy_avg': -0.017569823190569878, 'loss/value_avg': 0.2520386278629303, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8283171057701111, 'val/ratio': 0.9868261814117432, 'val/ratio_var': 0.00015915198309812695, 'val/num_eos_tokens': 0, 'lr': 1.1832435080842725e-05, 'episode': 6236, 'epoch': 0.76}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:18:53<1:05:39, 131kB/s]
 76%|███████▋  | 1560/2041 [2:15:34<41:30,  5.18s/it][A

{'eps': 0, 'objective/kl': 101.11009979248047, 'objective/entropy': 38.33900833129883, 'objective/non_score_reward': -5.05550479888916, 'objective/rlhf_reward': -6.319307327270508, 'objective/scores': -1.2638025283813477, 'policy/approxkl_avg': 0.017036989331245422, 'policy/clipfrac_avg': 0.0695754662156105, 'loss/policy_avg': -0.028852911666035652, 'loss/value_avg': 0.2666636109352112, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.81461501121521, 'val/ratio': 0.9821919202804565, 'val/ratio_var': 0.00024154142010957003, 'val/num_eos_tokens': 0, 'lr': 1.1807937285644293e-05, 'episode': 6240, 'epoch': 0.76}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:18:58<1:05:39, 131kB/s]
 76%|███████▋  | 1561/2041 [2:15:39<41:07,  5.14s/it][A

{'eps': 0, 'objective/kl': 95.52430725097656, 'objective/entropy': 49.289947509765625, 'objective/non_score_reward': -4.776215076446533, 'objective/rlhf_reward': -6.823306083679199, 'objective/scores': -2.047090768814087, 'policy/approxkl_avg': 0.006781540811061859, 'policy/clipfrac_avg': 0.0624999962747097, 'loss/policy_avg': -0.02818152867257595, 'loss/value_avg': 0.3944811522960663, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.870538592338562, 'val/ratio': 0.9900615215301514, 'val/ratio_var': 7.95613814261742e-05, 'val/num_eos_tokens': 0, 'lr': 1.178343949044586e-05, 'episode': 6244, 'epoch': 0.76}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:19:03<1:05:39, 131kB/s]
 77%|███████▋  | 1562/2041 [2:15:45<41:20,  5.18s/it][A

{'eps': 0, 'objective/kl': 84.18919372558594, 'objective/entropy': 42.71063232421875, 'objective/non_score_reward': -4.209460258483887, 'objective/rlhf_reward': -6.697194576263428, 'objective/scores': -2.487734317779541, 'policy/approxkl_avg': 0.00678712222725153, 'policy/clipfrac_avg': 0.061320751905441284, 'loss/policy_avg': -0.028850749135017395, 'loss/value_avg': 0.35232287645339966, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9345993995666504, 'val/ratio': 0.9788670539855957, 'val/ratio_var': 0.00035055214539170265, 'val/num_eos_tokens': 0, 'lr': 1.1758941695247429e-05, 'episode': 6248, 'epoch': 0.77}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:19:08<1:05:39, 131kB/s]
 77%|███████▋  | 1563/2041 [2:15:50<41:17,  5.18s/it][A

{'eps': 0, 'objective/kl': 85.07412719726562, 'objective/entropy': 36.73308563232422, 'objective/non_score_reward': -4.253706455230713, 'objective/rlhf_reward': -5.933367729187012, 'objective/scores': -1.6796610355377197, 'policy/approxkl_avg': 0.004621540661901236, 'policy/clipfrac_avg': 0.04245282709598541, 'loss/policy_avg': -0.021461617201566696, 'loss/value_avg': 0.2852725088596344, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.85099858045578, 'val/ratio': 0.9994845390319824, 'val/ratio_var': 5.68996199490357e-07, 'val/num_eos_tokens': 0, 'lr': 1.1734443900048995e-05, 'episode': 6252, 'epoch': 0.77}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:19:13<1:05:39, 131kB/s]
 77%|███████▋  | 1564/2041 [2:15:55<41:20,  5.20s/it][A

{'eps': 0, 'objective/kl': 108.1916732788086, 'objective/entropy': 42.81632995605469, 'objective/non_score_reward': -5.409583568572998, 'objective/rlhf_reward': -7.085023880004883, 'objective/scores': -1.6754403114318848, 'policy/approxkl_avg': 0.00539902551099658, 'policy/clipfrac_avg': 0.053066033869981766, 'loss/policy_avg': -0.024214189499616623, 'loss/value_avg': 0.4921116232872009, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8157957196235657, 'val/ratio': 0.9867919087409973, 'val/ratio_var': 0.00013838395534548908, 'val/num_eos_tokens': 0, 'lr': 1.1709946104850563e-05, 'episode': 6256, 'epoch': 0.77}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:19:19<1:05:39, 131kB/s]
 77%|███████▋  | 1565/2041 [2:16:00<41:04,  5.18s/it][A

{'eps': 0, 'objective/kl': 98.92237091064453, 'objective/entropy': 42.05418395996094, 'objective/non_score_reward': -4.946118354797363, 'objective/rlhf_reward': -7.042364120483398, 'objective/scores': -2.096245765686035, 'policy/approxkl_avg': 0.00404705572873354, 'policy/clipfrac_avg': 0.036556605249643326, 'loss/policy_avg': -0.017652658745646477, 'loss/value_avg': 0.5036566853523254, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7865657210350037, 'val/ratio': 0.9902801513671875, 'val/ratio_var': 8.163823076756671e-05, 'val/num_eos_tokens': 0, 'lr': 1.1685448309652131e-05, 'episode': 6260, 'epoch': 0.77}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:19:24<1:05:39, 131kB/s]
 77%|███████▋  | 1566/2041 [2:16:05<40:55,  5.17s/it][A

{'eps': 0, 'objective/kl': 91.10955047607422, 'objective/entropy': 35.83177947998047, 'objective/non_score_reward': -4.555477619171143, 'objective/rlhf_reward': -6.707437515258789, 'objective/scores': -2.1519601345062256, 'policy/approxkl_avg': 0.0029835677705705166, 'policy/clipfrac_avg': 0.03419811278581619, 'loss/policy_avg': -0.019689196720719337, 'loss/value_avg': 0.33966419100761414, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7582385540008545, 'val/ratio': 0.9858810901641846, 'val/ratio_var': 0.00018312888278160244, 'val/num_eos_tokens': 0, 'lr': 1.16609505144537e-05, 'episode': 6264, 'epoch': 0.77}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:19:29<1:05:39, 131kB/s]
 77%|███████▋  | 1567/2041 [2:16:10<40:49,  5.17s/it][A

{'eps': 0, 'objective/kl': 106.88552856445312, 'objective/entropy': 37.945716857910156, 'objective/non_score_reward': -5.344276428222656, 'objective/rlhf_reward': -6.9107255935668945, 'objective/scores': -1.5664494037628174, 'policy/approxkl_avg': 0.0021144081838428974, 'policy/clipfrac_avg': 0.029481131583452225, 'loss/policy_avg': -0.020786745473742485, 'loss/value_avg': 0.3774665594100952, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7706126570701599, 'val/ratio': 0.9889870882034302, 'val/ratio_var': 0.000113736889034044, 'val/num_eos_tokens': 0, 'lr': 1.1636452719255267e-05, 'episode': 6268, 'epoch': 0.77}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:19:34<1:05:39, 131kB/s]
 77%|███████▋  | 1568/2041 [2:16:16<40:47,  5.17s/it][A

{'eps': 0, 'objective/kl': 95.24554443359375, 'objective/entropy': 34.75111389160156, 'objective/non_score_reward': -4.762277126312256, 'objective/rlhf_reward': -6.724102020263672, 'objective/scores': -1.961824893951416, 'policy/approxkl_avg': 0.0035750982351601124, 'policy/clipfrac_avg': 0.03419811278581619, 'loss/policy_avg': -0.018560927361249924, 'loss/value_avg': 0.3975357413291931, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7526071667671204, 'val/ratio': 0.9982949495315552, 'val/ratio_var': 1.4878654610583908e-06, 'val/num_eos_tokens': 0, 'lr': 1.1611954924056835e-05, 'episode': 6272, 'epoch': 0.77}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:19:39<1:05:39, 131kB/s]
 77%|███████▋  | 1569/2041 [2:16:21<40:36,  5.16s/it][A

{'eps': 0, 'objective/kl': 104.62503051757812, 'objective/entropy': 36.92922592163086, 'objective/non_score_reward': -5.2312517166137695, 'objective/rlhf_reward': -7.019692897796631, 'objective/scores': -1.7884411811828613, 'policy/approxkl_avg': 0.004410756751894951, 'policy/clipfrac_avg': 0.03773584961891174, 'loss/policy_avg': -0.02222432568669319, 'loss/value_avg': 0.4395275115966797, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7641212344169617, 'val/ratio': 0.9903269410133362, 'val/ratio_var': 7.596683281008154e-05, 'val/num_eos_tokens': 0, 'lr': 1.1587457128858403e-05, 'episode': 6276, 'epoch': 0.77}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:19:44<1:05:39, 131kB/s]
 77%|███████▋  | 1570/2041 [2:16:26<40:28,  5.16s/it][A

{'eps': 0, 'objective/kl': 97.53767395019531, 'objective/entropy': 27.97789764404297, 'objective/non_score_reward': -4.8768839836120605, 'objective/rlhf_reward': -6.626749038696289, 'objective/scores': -1.7498650550842285, 'policy/approxkl_avg': 0.002151834312826395, 'policy/clipfrac_avg': 0.024764150381088257, 'loss/policy_avg': -0.017545606940984726, 'loss/value_avg': 0.44541722536087036, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.655668318271637, 'val/ratio': 0.9899436235427856, 'val/ratio_var': 0.00010130950977327302, 'val/num_eos_tokens': 0, 'lr': 1.1562959333659971e-05, 'episode': 6280, 'epoch': 0.77}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:19:49<1:05:39, 131kB/s]
 77%|███████▋  | 1571/2041 [2:16:31<40:18,  5.15s/it][A

{'eps': 0, 'objective/kl': 90.8221435546875, 'objective/entropy': 49.92934799194336, 'objective/non_score_reward': -4.541107177734375, 'objective/rlhf_reward': -6.6521453857421875, 'objective/scores': -2.1110384464263916, 'policy/approxkl_avg': 0.005022088065743446, 'policy/clipfrac_avg': 0.04716981202363968, 'loss/policy_avg': -0.022301778197288513, 'loss/value_avg': 0.37403547763824463, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8450506329536438, 'val/ratio': 0.9916939735412598, 'val/ratio_var': 4.3986441596644e-05, 'val/num_eos_tokens': 0, 'lr': 1.153846153846154e-05, 'episode': 6284, 'epoch': 0.77}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:19:55<1:05:39, 131kB/s]
 77%|███████▋  | 1572/2041 [2:16:36<40:08,  5.14s/it][A

{'eps': 0, 'objective/kl': 95.30630493164062, 'objective/entropy': 39.33379364013672, 'objective/non_score_reward': -4.765315055847168, 'objective/rlhf_reward': -6.2603654861450195, 'objective/scores': -1.4950504302978516, 'policy/approxkl_avg': 0.006212863605469465, 'policy/clipfrac_avg': 0.04599056392908096, 'loss/policy_avg': -0.02335137315094471, 'loss/value_avg': 0.3128427267074585, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.917588472366333, 'val/ratio': 0.9855054020881653, 'val/ratio_var': 0.00016920450434554368, 'val/num_eos_tokens': 0, 'lr': 1.1513963743263106e-05, 'episode': 6288, 'epoch': 0.77}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:20:00<1:05:39, 131kB/s]
 77%|███████▋  | 1573/2041 [2:16:41<40:03,  5.14s/it][A

{'eps': 0, 'objective/kl': 101.85226440429688, 'objective/entropy': 44.80628967285156, 'objective/non_score_reward': -5.092613220214844, 'objective/rlhf_reward': -6.90855598449707, 'objective/scores': -1.8159427642822266, 'policy/approxkl_avg': 0.012566053308546543, 'policy/clipfrac_avg': 0.05070754885673523, 'loss/policy_avg': -0.023433707654476166, 'loss/value_avg': 0.422051340341568, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7603262662887573, 'val/ratio': 0.9870889186859131, 'val/ratio_var': 0.00011936947703361511, 'val/num_eos_tokens': 0, 'lr': 1.1489465948064674e-05, 'episode': 6292, 'epoch': 0.77}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:20:05<1:05:39, 131kB/s]
 77%|███████▋  | 1574/2041 [2:16:46<39:58,  5.14s/it][A

{'eps': 0, 'objective/kl': 94.39387512207031, 'objective/entropy': 35.5130615234375, 'objective/non_score_reward': -4.719693660736084, 'objective/rlhf_reward': -6.79918909072876, 'objective/scores': -2.079495429992676, 'policy/approxkl_avg': 0.005275895819067955, 'policy/clipfrac_avg': 0.05070754513144493, 'loss/policy_avg': -0.022397970780730247, 'loss/value_avg': 0.39770618081092834, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6811263561248779, 'val/ratio': 0.9925200343132019, 'val/ratio_var': 5.7436322094872594e-05, 'val/num_eos_tokens': 0, 'lr': 1.1464968152866242e-05, 'episode': 6296, 'epoch': 0.77}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:20:10<1:05:39, 131kB/s]
 77%|███████▋  | 1575/2041 [2:16:52<39:55,  5.14s/it][A

{'eps': 0, 'objective/kl': 96.19271850585938, 'objective/entropy': 34.700103759765625, 'objective/non_score_reward': -4.809636116027832, 'objective/rlhf_reward': -6.776798248291016, 'objective/scores': -1.9671621322631836, 'policy/approxkl_avg': 0.003532509785145521, 'policy/clipfrac_avg': 0.028301887214183807, 'loss/policy_avg': -0.019127819687128067, 'loss/value_avg': 0.4035646915435791, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7027597427368164, 'val/ratio': 0.9943531155586243, 'val/ratio_var': 3.4675416827667505e-05, 'val/num_eos_tokens': 0, 'lr': 1.1440470357667812e-05, 'episode': 6300, 'epoch': 0.77}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:20:15<1:05:39, 131kB/s]
 77%|███████▋  | 1576/2041 [2:16:57<39:50,  5.14s/it][A

{'eps': 0, 'objective/kl': 83.35289001464844, 'objective/entropy': 43.732852935791016, 'objective/non_score_reward': -4.167644500732422, 'objective/rlhf_reward': -6.506402969360352, 'objective/scores': -2.3387584686279297, 'policy/approxkl_avg': 0.02658267505466938, 'policy/clipfrac_avg': 0.06603773683309555, 'loss/policy_avg': -0.02753269299864769, 'loss/value_avg': 0.2625240683555603, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7383516430854797, 'val/ratio': 0.9886207580566406, 'val/ratio_var': 8.913603232940659e-05, 'val/num_eos_tokens': 0, 'lr': 1.1415972562469378e-05, 'episode': 6304, 'epoch': 0.77}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:20:20<1:05:39, 131kB/s]
 77%|███████▋  | 1577/2041 [2:17:02<39:51,  5.15s/it][A

{'eps': 0, 'objective/kl': 91.95332336425781, 'objective/entropy': 33.384273529052734, 'objective/non_score_reward': -4.597666263580322, 'objective/rlhf_reward': -6.0501580238342285, 'objective/scores': -1.4524916410446167, 'policy/approxkl_avg': 0.013933630660176277, 'policy/clipfrac_avg': 0.036556605249643326, 'loss/policy_avg': -0.02180936001241207, 'loss/value_avg': 0.32515496015548706, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7719911336898804, 'val/ratio': 0.9847897291183472, 'val/ratio_var': 0.00019037001766264439, 'val/num_eos_tokens': 0, 'lr': 1.1391474767270946e-05, 'episode': 6308, 'epoch': 0.77}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:20:26<1:05:39, 131kB/s]
 77%|███████▋  | 1578/2041 [2:17:07<39:59,  5.18s/it][A

{'eps': 0, 'objective/kl': 80.62741088867188, 'objective/entropy': 42.067710876464844, 'objective/non_score_reward': -4.031371116638184, 'objective/rlhf_reward': -7.169643402099609, 'objective/scores': -3.138272523880005, 'policy/approxkl_avg': 0.008277571760118008, 'policy/clipfrac_avg': 0.040094342082738876, 'loss/policy_avg': -0.02464645728468895, 'loss/value_avg': 0.3773745596408844, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7817498445510864, 'val/ratio': 0.9950723648071289, 'val/ratio_var': 1.3140412193024531e-05, 'val/num_eos_tokens': 0, 'lr': 1.1366976972072514e-05, 'episode': 6312, 'epoch': 0.77}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:20:31<1:05:39, 131kB/s]
 77%|███████▋  | 1579/2041 [2:17:12<39:47,  5.17s/it][A

{'eps': 0, 'objective/kl': 85.1314926147461, 'objective/entropy': 38.954925537109375, 'objective/non_score_reward': -4.256574630737305, 'objective/rlhf_reward': -6.29105281829834, 'objective/scores': -2.0344784259796143, 'policy/approxkl_avg': 0.007501253858208656, 'policy/clipfrac_avg': 0.041273586452007294, 'loss/policy_avg': -0.02280361019074917, 'loss/value_avg': 0.30613648891448975, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7604662179946899, 'val/ratio': 0.9852075576782227, 'val/ratio_var': 0.000178407397470437, 'val/num_eos_tokens': 0, 'lr': 1.1342479176874082e-05, 'episode': 6316, 'epoch': 0.77}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:20:36<1:05:39, 131kB/s]
 77%|███████▋  | 1580/2041 [2:17:17<39:47,  5.18s/it][A

{'eps': 0, 'objective/kl': 97.73937225341797, 'objective/entropy': 65.0094223022461, 'objective/non_score_reward': -4.886968612670898, 'objective/rlhf_reward': -7.655242919921875, 'objective/scores': -2.7682743072509766, 'policy/approxkl_avg': 0.007656426634639502, 'policy/clipfrac_avg': 0.06014150753617287, 'loss/policy_avg': -0.0314314030110836, 'loss/value_avg': 0.5029760599136353, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1878297328948975, 'val/ratio': 0.9831243753433228, 'val/ratio_var': 0.00024021218996495008, 'val/num_eos_tokens': 0, 'lr': 1.1317981381675649e-05, 'episode': 6320, 'epoch': 0.77}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:20:41<1:05:39, 131kB/s]
 77%|███████▋  | 1581/2041 [2:17:23<39:47,  5.19s/it][A

{'eps': 0, 'objective/kl': 94.79995727539062, 'objective/entropy': 37.334373474121094, 'objective/non_score_reward': -4.739997386932373, 'objective/rlhf_reward': -6.522243499755859, 'objective/scores': -1.7822459936141968, 'policy/approxkl_avg': 0.003746102564036846, 'policy/clipfrac_avg': 0.04009433835744858, 'loss/policy_avg': -0.020429065451025963, 'loss/value_avg': 0.3281420171260834, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7644275426864624, 'val/ratio': 0.9899532794952393, 'val/ratio_var': 7.948675192892551e-05, 'val/num_eos_tokens': 0, 'lr': 1.1293483586477217e-05, 'episode': 6324, 'epoch': 0.77}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:20:46<1:05:39, 131kB/s]
 78%|███████▊  | 1582/2041 [2:17:28<39:37,  5.18s/it][A

{'eps': 0, 'objective/kl': 87.05641174316406, 'objective/entropy': 42.7182731628418, 'objective/non_score_reward': -4.352821350097656, 'objective/rlhf_reward': -6.32419490814209, 'objective/scores': -1.971373438835144, 'policy/approxkl_avg': 0.039756279438734055, 'policy/clipfrac_avg': 0.06721698492765427, 'loss/policy_avg': -0.02947884425520897, 'loss/value_avg': 0.3212197422981262, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9120686054229736, 'val/ratio': 0.9858853220939636, 'val/ratio_var': 0.00013402351760305464, 'val/num_eos_tokens': 0, 'lr': 1.1268985791278786e-05, 'episode': 6328, 'epoch': 0.78}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:20:51<1:05:39, 131kB/s]
 78%|███████▊  | 1583/2041 [2:17:33<39:23,  5.16s/it][A

{'eps': 0, 'objective/kl': 96.05084991455078, 'objective/entropy': 37.11017990112305, 'objective/non_score_reward': -4.802542209625244, 'objective/rlhf_reward': -7.020930767059326, 'objective/scores': -2.218388557434082, 'policy/approxkl_avg': 0.0118637690320611, 'policy/clipfrac_avg': 0.03773584961891174, 'loss/policy_avg': -0.022912530228495598, 'loss/value_avg': 0.36376798152923584, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.722637414932251, 'val/ratio': 0.9854202270507812, 'val/ratio_var': 0.0001644221629248932, 'val/num_eos_tokens': 0, 'lr': 1.1244487996080354e-05, 'episode': 6332, 'epoch': 0.78}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:20:57<1:05:39, 131kB/s]
 78%|███████▊  | 1584/2041 [2:17:38<39:13,  5.15s/it][A

{'eps': 0, 'objective/kl': 95.62400817871094, 'objective/entropy': 44.20208740234375, 'objective/non_score_reward': -4.781200408935547, 'objective/rlhf_reward': -7.180566787719727, 'objective/scores': -2.3993661403656006, 'policy/approxkl_avg': 0.006942890118807554, 'policy/clipfrac_avg': 0.05070754885673523, 'loss/policy_avg': -0.02515486627817154, 'loss/value_avg': 0.46722060441970825, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8104777336120605, 'val/ratio': 0.9887480735778809, 'val/ratio_var': 0.00010631203622324392, 'val/num_eos_tokens': 0, 'lr': 1.121999020088192e-05, 'episode': 6336, 'epoch': 0.78}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:21:02<1:05:39, 131kB/s]
 78%|███████▊  | 1585/2041 [2:17:43<39:12,  5.16s/it][A

{'eps': 0, 'objective/kl': 104.79148864746094, 'objective/entropy': 47.44059371948242, 'objective/non_score_reward': -5.239574432373047, 'objective/rlhf_reward': -7.631872177124023, 'objective/scores': -2.3922977447509766, 'policy/approxkl_avg': 0.0036480207927525043, 'policy/clipfrac_avg': 0.03419811278581619, 'loss/policy_avg': -0.01876647397875786, 'loss/value_avg': 0.5745213031768799, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9196996688842773, 'val/ratio': 0.9927934408187866, 'val/ratio_var': 4.859932596446015e-05, 'val/num_eos_tokens': 0, 'lr': 1.1195492405683489e-05, 'episode': 6340, 'epoch': 0.78}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:21:07<1:05:39, 131kB/s]
 78%|███████▊  | 1586/2041 [2:17:48<39:02,  5.15s/it][A

{'eps': 0, 'objective/kl': 83.26397705078125, 'objective/entropy': 29.728912353515625, 'objective/non_score_reward': -4.163198471069336, 'objective/rlhf_reward': -6.3102521896362305, 'objective/scores': -2.1470534801483154, 'policy/approxkl_avg': 0.004241916351020336, 'policy/clipfrac_avg': 0.03183962404727936, 'loss/policy_avg': -0.018592743203043938, 'loss/value_avg': 0.2934214770793915, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.5227476358413696, 'val/ratio': 0.9878954887390137, 'val/ratio_var': 0.0001243871811311692, 'val/num_eos_tokens': 0, 'lr': 1.1170994610485057e-05, 'episode': 6344, 'epoch': 0.78}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:21:12<1:05:39, 131kB/s]
 78%|███████▊  | 1587/2041 [2:17:54<38:55,  5.14s/it][A

{'eps': 0, 'objective/kl': 107.43779754638672, 'objective/entropy': 48.04630661010742, 'objective/non_score_reward': -5.371890068054199, 'objective/rlhf_reward': -7.4345855712890625, 'objective/scores': -2.0626957416534424, 'policy/approxkl_avg': 0.010505946353077888, 'policy/clipfrac_avg': 0.06367924809455872, 'loss/policy_avg': -0.030647367238998413, 'loss/value_avg': 0.4955427646636963, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9278036952018738, 'val/ratio': 1.008334994316101, 'val/ratio_var': 5.403606701293029e-05, 'val/num_eos_tokens': 0, 'lr': 1.1146496815286625e-05, 'episode': 6348, 'epoch': 0.78}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:21:17<1:05:39, 131kB/s]
 78%|███████▊  | 1588/2041 [2:17:59<38:50,  5.14s/it][A

{'eps': 0, 'objective/kl': 91.08470916748047, 'objective/entropy': 48.30026626586914, 'objective/non_score_reward': -4.554235458374023, 'objective/rlhf_reward': -6.700831413269043, 'objective/scores': -2.1465959548950195, 'policy/approxkl_avg': 0.007391311693936586, 'policy/clipfrac_avg': 0.041273586452007294, 'loss/policy_avg': -0.021396145224571228, 'loss/value_avg': 0.2998693585395813, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8786216378211975, 'val/ratio': 0.9923110008239746, 'val/ratio_var': 4.861770139541477e-05, 'val/num_eos_tokens': 0, 'lr': 1.1121999020088193e-05, 'episode': 6352, 'epoch': 0.78}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:21:22<1:05:39, 131kB/s]
 78%|███████▊  | 1589/2041 [2:18:04<38:50,  5.16s/it][A

{'eps': 0, 'objective/kl': 77.09893798828125, 'objective/entropy': 39.519596099853516, 'objective/non_score_reward': -3.854947090148926, 'objective/rlhf_reward': -6.314826965332031, 'objective/scores': -2.4598796367645264, 'policy/approxkl_avg': 0.00439054612070322, 'policy/clipfrac_avg': 0.04009433835744858, 'loss/policy_avg': -0.02089950256049633, 'loss/value_avg': 0.37875688076019287, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8326995968818665, 'val/ratio': 0.9962789416313171, 'val/ratio_var': 1.196956782223424e-05, 'val/num_eos_tokens': 0, 'lr': 1.109750122488976e-05, 'episode': 6356, 'epoch': 0.78}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:21:27<1:05:39, 131kB/s]
 78%|███████▊  | 1590/2041 [2:18:09<38:43,  5.15s/it][A

{'eps': 0, 'objective/kl': 90.6966552734375, 'objective/entropy': 59.33885955810547, 'objective/non_score_reward': -4.534832954406738, 'objective/rlhf_reward': -6.297115802764893, 'objective/scores': -1.7622828483581543, 'policy/approxkl_avg': 0.007470135111361742, 'policy/clipfrac_avg': 0.06839622557163239, 'loss/policy_avg': -0.0327182337641716, 'loss/value_avg': 0.2775002717971802, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.977149486541748, 'val/ratio': 0.9808063507080078, 'val/ratio_var': 0.0002700086042750627, 'val/num_eos_tokens': 0, 'lr': 1.1073003429691329e-05, 'episode': 6360, 'epoch': 0.78}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:21:33<1:05:39, 131kB/s]
 78%|███████▊  | 1591/2041 [2:18:14<38:39,  5.15s/it][A

{'eps': 0, 'objective/kl': 99.80792999267578, 'objective/entropy': 44.267982482910156, 'objective/non_score_reward': -4.990396499633789, 'objective/rlhf_reward': -7.170113563537598, 'objective/scores': -2.1797173023223877, 'policy/approxkl_avg': 0.003945223521441221, 'policy/clipfrac_avg': 0.03537736088037491, 'loss/policy_avg': -0.021566884592175484, 'loss/value_avg': 0.36384057998657227, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9000776410102844, 'val/ratio': 0.9945428967475891, 'val/ratio_var': 3.638868292910047e-05, 'val/num_eos_tokens': 0, 'lr': 1.1048505634492897e-05, 'episode': 6364, 'epoch': 0.78}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:21:38<1:05:39, 131kB/s]
 78%|███████▊  | 1592/2041 [2:18:19<38:38,  5.16s/it][A

{'eps': 0, 'objective/kl': 100.11785888671875, 'objective/entropy': 51.094966888427734, 'objective/non_score_reward': -5.005893230438232, 'objective/rlhf_reward': -7.471200466156006, 'objective/scores': -2.4653072357177734, 'policy/approxkl_avg': 0.004803301766514778, 'policy/clipfrac_avg': 0.053066033869981766, 'loss/policy_avg': -0.027205629274249077, 'loss/value_avg': 0.39529040455818176, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.90830397605896, 'val/ratio': 0.9932422637939453, 'val/ratio_var': 3.493190524750389e-05, 'val/num_eos_tokens': 0, 'lr': 1.1024007839294465e-05, 'episode': 6368, 'epoch': 0.78}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:21:43<1:05:39, 131kB/s]
 78%|███████▊  | 1593/2041 [2:18:24<38:29,  5.16s/it][A

{'eps': 0, 'objective/kl': 97.57553100585938, 'objective/entropy': 45.11829376220703, 'objective/non_score_reward': -4.878776550292969, 'objective/rlhf_reward': -7.075226306915283, 'objective/scores': -2.1964497566223145, 'policy/approxkl_avg': 0.011270033195614815, 'policy/clipfrac_avg': 0.04599056765437126, 'loss/policy_avg': -0.025777781382203102, 'loss/value_avg': 0.42859339714050293, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7436398267745972, 'val/ratio': 1.0029923915863037, 'val/ratio_var': 3.670592559501529e-05, 'val/num_eos_tokens': 0, 'lr': 1.0999510044096031e-05, 'episode': 6372, 'epoch': 0.78}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:21:48<1:05:39, 131kB/s]
 78%|███████▊  | 1594/2041 [2:18:30<38:12,  5.13s/it][A

{'eps': 0, 'objective/kl': 95.95450592041016, 'objective/entropy': 46.46821594238281, 'objective/non_score_reward': -4.797725677490234, 'objective/rlhf_reward': -6.728102684020996, 'objective/scores': -1.9303768873214722, 'policy/approxkl_avg': 0.003076458815485239, 'policy/clipfrac_avg': 0.03773584961891174, 'loss/policy_avg': -0.02171752229332924, 'loss/value_avg': 0.32455092668533325, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9102108478546143, 'val/ratio': 0.9931896924972534, 'val/ratio_var': 4.8010508180595934e-05, 'val/num_eos_tokens': 0, 'lr': 1.09750122488976e-05, 'episode': 6376, 'epoch': 0.78}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:21:53<1:05:39, 131kB/s]
 78%|███████▊  | 1595/2041 [2:18:35<38:00,  5.11s/it][A

{'eps': 0, 'objective/kl': 113.24671936035156, 'objective/entropy': 35.428741455078125, 'objective/non_score_reward': -5.662335395812988, 'objective/rlhf_reward': -7.500728130340576, 'objective/scores': -1.838392734527588, 'policy/approxkl_avg': 0.002566657727584243, 'policy/clipfrac_avg': 0.025943396613001823, 'loss/policy_avg': -0.017693569883704185, 'loss/value_avg': 0.5594648122787476, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8648480176925659, 'val/ratio': 0.9912756681442261, 'val/ratio_var': 7.262390136020258e-05, 'val/num_eos_tokens': 0, 'lr': 1.0950514453699168e-05, 'episode': 6380, 'epoch': 0.78}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:21:58<1:05:39, 131kB/s]
 78%|███████▊  | 1596/2041 [2:18:40<37:46,  5.09s/it][A

{'eps': 0, 'objective/kl': 88.77874755859375, 'objective/entropy': 46.27500534057617, 'objective/non_score_reward': -4.438937664031982, 'objective/rlhf_reward': -6.471676349639893, 'objective/scores': -2.03273868560791, 'policy/approxkl_avg': 0.005411106627434492, 'policy/clipfrac_avg': 0.049528300762176514, 'loss/policy_avg': -0.026408245787024498, 'loss/value_avg': 0.22139054536819458, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7981424331665039, 'val/ratio': 0.9868800640106201, 'val/ratio_var': 0.000149246581713669, 'val/num_eos_tokens': 0, 'lr': 1.0926016658500736e-05, 'episode': 6384, 'epoch': 0.78}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:22:03<1:05:39, 131kB/s]
 78%|███████▊  | 1597/2041 [2:18:45<37:45,  5.10s/it][A

{'eps': 0, 'objective/kl': 79.39063262939453, 'objective/entropy': 46.574859619140625, 'objective/non_score_reward': -3.969532012939453, 'objective/rlhf_reward': -6.571924209594727, 'objective/scores': -2.6023921966552734, 'policy/approxkl_avg': 0.004220940172672272, 'policy/clipfrac_avg': 0.037735845893621445, 'loss/policy_avg': -0.024524852633476257, 'loss/value_avg': 0.2940085530281067, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8961823582649231, 'val/ratio': 0.9834639430046082, 'val/ratio_var': 0.00023318208695854992, 'val/num_eos_tokens': 0, 'lr': 1.0901518863302302e-05, 'episode': 6388, 'epoch': 0.78}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:22:08<1:05:39, 131kB/s]
 78%|███████▊  | 1598/2041 [2:18:50<37:46,  5.12s/it][A

{'eps': 0, 'objective/kl': 92.02120208740234, 'objective/entropy': 46.60264587402344, 'objective/non_score_reward': -4.601059913635254, 'objective/rlhf_reward': -7.1624298095703125, 'objective/scores': -2.5613696575164795, 'policy/approxkl_avg': 0.00386862107552588, 'policy/clipfrac_avg': 0.04363207519054413, 'loss/policy_avg': -0.022226249799132347, 'loss/value_avg': 0.34330832958221436, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8354418277740479, 'val/ratio': 0.9914358854293823, 'val/ratio_var': 6.822533759986982e-05, 'val/num_eos_tokens': 0, 'lr': 1.0877021068103872e-05, 'episode': 6392, 'epoch': 0.78}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:22:14<1:05:39, 131kB/s]
 78%|███████▊  | 1599/2041 [2:18:55<37:53,  5.14s/it][A

{'eps': 0, 'objective/kl': 87.21576690673828, 'objective/entropy': 42.34186553955078, 'objective/non_score_reward': -4.360788345336914, 'objective/rlhf_reward': -6.856751441955566, 'objective/scores': -2.4959630966186523, 'policy/approxkl_avg': 0.0046418155543506145, 'policy/clipfrac_avg': 0.03891509771347046, 'loss/policy_avg': -0.022050730884075165, 'loss/value_avg': 0.33334994316101074, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.836178719997406, 'val/ratio': 0.9998916387557983, 'val/ratio_var': 4.2609460138010036e-07, 'val/num_eos_tokens': 0, 'lr': 1.085252327290544e-05, 'episode': 6396, 'epoch': 0.78}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:22:19<1:05:39, 131kB/s]
 78%|███████▊  | 1600/2041 [2:19:00<37:49,  5.15s/it][A

{'eps': 0, 'objective/kl': 86.40196228027344, 'objective/entropy': 47.74852752685547, 'objective/non_score_reward': -4.320098400115967, 'objective/rlhf_reward': -6.841716766357422, 'objective/scores': -2.521618366241455, 'policy/approxkl_avg': 0.024205513298511505, 'policy/clipfrac_avg': 0.048349060118198395, 'loss/policy_avg': -0.026325609534978867, 'loss/value_avg': 0.3591892123222351, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8686374425888062, 'val/ratio': 0.9861181974411011, 'val/ratio_var': 0.00013249574112705886, 'val/num_eos_tokens': 0, 'lr': 1.0828025477707008e-05, 'episode': 6400, 'epoch': 0.78}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:22:24<1:05:39, 131kB/s]
 78%|███████▊  | 1601/2041 [2:19:05<37:36,  5.13s/it][A

{'eps': 0, 'objective/kl': 86.40653228759766, 'objective/entropy': 49.34764862060547, 'objective/non_score_reward': -4.320326328277588, 'objective/rlhf_reward': -6.453681945800781, 'objective/scores': -2.1333553791046143, 'policy/approxkl_avg': 0.005859304219484329, 'policy/clipfrac_avg': 0.0554245263338089, 'loss/policy_avg': -0.025957681238651276, 'loss/value_avg': 0.3124919831752777, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8427499532699585, 'val/ratio': 0.9913961887359619, 'val/ratio_var': 5.756008249591105e-05, 'val/num_eos_tokens': 0, 'lr': 1.0803527682508574e-05, 'episode': 6404, 'epoch': 0.78}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:22:29<1:05:39, 131kB/s]
 78%|███████▊  | 1602/2041 [2:19:10<37:27,  5.12s/it][A

{'eps': 0, 'objective/kl': 89.22013092041016, 'objective/entropy': 41.700218200683594, 'objective/non_score_reward': -4.461006164550781, 'objective/rlhf_reward': -7.048707008361816, 'objective/scores': -2.587700843811035, 'policy/approxkl_avg': 0.023033972829580307, 'policy/clipfrac_avg': 0.05896226689219475, 'loss/policy_avg': -0.026721768081188202, 'loss/value_avg': 0.33551454544067383, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7768877148628235, 'val/ratio': 1.066063404083252, 'val/ratio_var': 0.005673369858413935, 'val/num_eos_tokens': 0, 'lr': 1.0779029887310142e-05, 'episode': 6408, 'epoch': 0.79}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:22:34<1:05:39, 131kB/s]
 79%|███████▊  | 1603/2041 [2:19:16<37:14,  5.10s/it][A

{'eps': 0, 'objective/kl': 95.43138885498047, 'objective/entropy': 50.09431076049805, 'objective/non_score_reward': -4.77156925201416, 'objective/rlhf_reward': -5.786983966827393, 'objective/scores': -1.0154147148132324, 'policy/approxkl_avg': 0.005355847999453545, 'policy/clipfrac_avg': 0.0554245263338089, 'loss/policy_avg': -0.026041695848107338, 'loss/value_avg': 0.29747819900512695, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7683156728744507, 'val/ratio': 0.9826638698577881, 'val/ratio_var': 0.00024134489649441093, 'val/num_eos_tokens': 0, 'lr': 1.075453209211171e-05, 'episode': 6412, 'epoch': 0.79}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:22:39<1:05:39, 131kB/s]
 79%|███████▊  | 1604/2041 [2:19:21<37:14,  5.11s/it][A

{'eps': 0, 'objective/kl': 95.30192565917969, 'objective/entropy': 34.482887268066406, 'objective/non_score_reward': -4.765096187591553, 'objective/rlhf_reward': -6.895757675170898, 'objective/scores': -2.1306612491607666, 'policy/approxkl_avg': 0.004594793543219566, 'policy/clipfrac_avg': 0.03655660152435303, 'loss/policy_avg': -0.021840950474143028, 'loss/value_avg': 0.37520045042037964, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7350928783416748, 'val/ratio': 0.9839835166931152, 'val/ratio_var': 0.00022647122386842966, 'val/num_eos_tokens': 0, 'lr': 1.0730034296913278e-05, 'episode': 6416, 'epoch': 0.79}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:22:44<1:05:39, 131kB/s]
 79%|███████▊  | 1605/2041 [2:19:26<37:08,  5.11s/it][A

{'eps': 0, 'objective/kl': 95.55103302001953, 'objective/entropy': 40.8900260925293, 'objective/non_score_reward': -4.777551651000977, 'objective/rlhf_reward': -6.843773365020752, 'objective/scores': -2.0662217140197754, 'policy/approxkl_avg': 0.005479097366333008, 'policy/clipfrac_avg': 0.03419811278581619, 'loss/policy_avg': -0.022482672706246376, 'loss/value_avg': 0.26228469610214233, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8324151039123535, 'val/ratio': 0.9959891438484192, 'val/ratio_var': 1.1014875781256706e-05, 'val/num_eos_tokens': 0, 'lr': 1.0705536501714846e-05, 'episode': 6420, 'epoch': 0.79}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:22:49<1:05:39, 131kB/s]
 79%|███████▊  | 1606/2041 [2:19:31<37:13,  5.13s/it][A

{'eps': 0, 'objective/kl': 91.10606384277344, 'objective/entropy': 38.150203704833984, 'objective/non_score_reward': -4.55530309677124, 'objective/rlhf_reward': -6.635978698730469, 'objective/scores': -2.0806758403778076, 'policy/approxkl_avg': 0.013891441747546196, 'policy/clipfrac_avg': 0.06132075935602188, 'loss/policy_avg': -0.02731746807694435, 'loss/value_avg': 0.30347737669944763, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7538639307022095, 'val/ratio': 0.9826321601867676, 'val/ratio_var': 0.00020362909708637744, 'val/num_eos_tokens': 0, 'lr': 1.0681038706516414e-05, 'episode': 6424, 'epoch': 0.79}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:22:55<1:05:39, 131kB/s]
 79%|███████▊  | 1607/2041 [2:19:36<37:14,  5.15s/it][A

{'eps': 0, 'objective/kl': 97.21562194824219, 'objective/entropy': 33.73915100097656, 'objective/non_score_reward': -4.860781669616699, 'objective/rlhf_reward': -6.873224258422852, 'objective/scores': -2.0124425888061523, 'policy/approxkl_avg': 0.0028316620737314224, 'policy/clipfrac_avg': 0.02712264284491539, 'loss/policy_avg': -0.014355077408254147, 'loss/value_avg': 0.29392945766448975, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6718178391456604, 'val/ratio': 0.9875024557113647, 'val/ratio_var': 0.00014055032806936651, 'val/num_eos_tokens': 0, 'lr': 1.0656540911317982e-05, 'episode': 6428, 'epoch': 0.79}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:23:00<1:05:39, 131kB/s]
 79%|███████▉  | 1608/2041 [2:19:41<37:17,  5.17s/it][A

{'eps': 0, 'objective/kl': 90.26820373535156, 'objective/entropy': 39.776126861572266, 'objective/non_score_reward': -4.513410568237305, 'objective/rlhf_reward': -7.1309709548950195, 'objective/scores': -2.6175601482391357, 'policy/approxkl_avg': 0.0053796665742993355, 'policy/clipfrac_avg': 0.049528300762176514, 'loss/policy_avg': -0.025461087003350258, 'loss/value_avg': 0.3510269522666931, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8215295076370239, 'val/ratio': 0.9919072985649109, 'val/ratio_var': 5.035743015469052e-05, 'val/num_eos_tokens': 0, 'lr': 1.063204311611955e-05, 'episode': 6432, 'epoch': 0.79}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:23:05<1:05:39, 131kB/s]
 79%|███████▉  | 1609/2041 [2:19:46<37:00,  5.14s/it][A

{'eps': 0, 'objective/kl': 98.12969970703125, 'objective/entropy': 41.836570739746094, 'objective/non_score_reward': -4.906484603881836, 'objective/rlhf_reward': -6.25369930267334, 'objective/scores': -1.347214698791504, 'policy/approxkl_avg': 0.007185089867562056, 'policy/clipfrac_avg': 0.056603770703077316, 'loss/policy_avg': -0.02248449996113777, 'loss/value_avg': 0.4403943121433258, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7980598211288452, 'val/ratio': 0.9952996373176575, 'val/ratio_var': 1.8098919099429622e-05, 'val/num_eos_tokens': 0, 'lr': 1.0607545320921118e-05, 'episode': 6436, 'epoch': 0.79}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:23:10<1:05:39, 131kB/s]
 79%|███████▉  | 1610/2041 [2:19:52<37:01,  5.16s/it][A

{'eps': 0, 'objective/kl': 93.83872985839844, 'objective/entropy': 40.89452362060547, 'objective/non_score_reward': -4.691936492919922, 'objective/rlhf_reward': -6.438226222991943, 'objective/scores': -1.746289849281311, 'policy/approxkl_avg': 0.011621212586760521, 'policy/clipfrac_avg': 0.036556605249643326, 'loss/policy_avg': -0.018682776018977165, 'loss/value_avg': 0.5160543918609619, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6923353672027588, 'val/ratio': 0.9852519035339355, 'val/ratio_var': 0.0001489800779381767, 'val/num_eos_tokens': 0, 'lr': 1.0583047525722685e-05, 'episode': 6440, 'epoch': 0.79}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:23:15<1:05:39, 131kB/s]
 79%|███████▉  | 1611/2041 [2:19:57<36:52,  5.14s/it][A

{'eps': 0, 'objective/kl': 103.14771270751953, 'objective/entropy': 51.20302200317383, 'objective/non_score_reward': -5.15738582611084, 'objective/rlhf_reward': -7.42670202255249, 'objective/scores': -2.2693161964416504, 'policy/approxkl_avg': 0.012692101299762726, 'policy/clipfrac_avg': 0.06603773683309555, 'loss/policy_avg': -0.029974544420838356, 'loss/value_avg': 0.5097476243972778, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8895857334136963, 'val/ratio': 0.9830560088157654, 'val/ratio_var': 0.00021278818894643337, 'val/num_eos_tokens': 0, 'lr': 1.0558549730524253e-05, 'episode': 6444, 'epoch': 0.79}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:23:20<1:05:39, 131kB/s]
 79%|███████▉  | 1612/2041 [2:20:02<36:55,  5.16s/it][A

{'eps': 0, 'objective/kl': 101.293212890625, 'objective/entropy': 44.61924743652344, 'objective/non_score_reward': -5.064661026000977, 'objective/rlhf_reward': -6.9690775871276855, 'objective/scores': -1.904416561126709, 'policy/approxkl_avg': 0.003708575153723359, 'policy/clipfrac_avg': 0.03655660152435303, 'loss/policy_avg': -0.020967254415154457, 'loss/value_avg': 0.3897496461868286, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8117151856422424, 'val/ratio': 0.9872510433197021, 'val/ratio_var': 0.00013042063801549375, 'val/num_eos_tokens': 0, 'lr': 1.0534051935325821e-05, 'episode': 6448, 'epoch': 0.79}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:23:26<1:05:39, 131kB/s]
 79%|███████▉  | 1613/2041 [2:20:07<36:55,  5.18s/it][A

{'eps': 0, 'objective/kl': 94.7459945678711, 'objective/entropy': 41.622535705566406, 'objective/non_score_reward': -4.737299919128418, 'objective/rlhf_reward': -6.7818169593811035, 'objective/scores': -2.0445170402526855, 'policy/approxkl_avg': 0.012624997645616531, 'policy/clipfrac_avg': 0.06485848873853683, 'loss/policy_avg': -0.03036000207066536, 'loss/value_avg': 0.31248265504837036, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7576956748962402, 'val/ratio': 0.9728494882583618, 'val/ratio_var': 0.0005863905535079539, 'val/num_eos_tokens': 0, 'lr': 1.0509554140127389e-05, 'episode': 6452, 'epoch': 0.79}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:23:31<1:05:39, 131kB/s]
 79%|███████▉  | 1614/2041 [2:20:12<36:37,  5.15s/it][A

{'eps': 0, 'objective/kl': 94.7372055053711, 'objective/entropy': 44.586708068847656, 'objective/non_score_reward': -4.736860275268555, 'objective/rlhf_reward': -6.957919120788574, 'objective/scores': -2.2210590839385986, 'policy/approxkl_avg': 0.00603079330176115, 'policy/clipfrac_avg': 0.049528300762176514, 'loss/policy_avg': -0.022238366305828094, 'loss/value_avg': 0.341400682926178, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7800964117050171, 'val/ratio': 0.997481107711792, 'val/ratio_var': 3.0568880902137607e-06, 'val/num_eos_tokens': 0, 'lr': 1.0485056344928957e-05, 'episode': 6456, 'epoch': 0.79}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:23:36<1:05:39, 131kB/s]
 79%|███████▉  | 1615/2041 [2:20:17<36:26,  5.13s/it][A

{'eps': 0, 'objective/kl': 96.16313171386719, 'objective/entropy': 64.92337799072266, 'objective/non_score_reward': -4.808156490325928, 'objective/rlhf_reward': -6.929135322570801, 'objective/scores': -2.120978832244873, 'policy/approxkl_avg': 0.039169684052467346, 'policy/clipfrac_avg': 0.0837264135479927, 'loss/policy_avg': -0.035828497260808945, 'loss/value_avg': 0.4722737967967987, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2406564950942993, 'val/ratio': 0.9934180378913879, 'val/ratio_var': 2.0278528609196655e-05, 'val/num_eos_tokens': 0, 'lr': 1.0460558549730525e-05, 'episode': 6460, 'epoch': 0.79}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:23:41<1:05:39, 131kB/s]
 79%|███████▉  | 1616/2041 [2:20:22<36:23,  5.14s/it][A

{'eps': 0, 'objective/kl': 98.76008605957031, 'objective/entropy': 39.58154296875, 'objective/non_score_reward': -4.938004016876221, 'objective/rlhf_reward': -7.043997764587402, 'objective/scores': -2.1059939861297607, 'policy/approxkl_avg': 0.007294925861060619, 'policy/clipfrac_avg': 0.03419811278581619, 'loss/policy_avg': -0.018366575241088867, 'loss/value_avg': 0.5466122627258301, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7559709548950195, 'val/ratio': 1.0086698532104492, 'val/ratio_var': 6.90218003001064e-05, 'val/num_eos_tokens': 0, 'lr': 1.0436060754532093e-05, 'episode': 6464, 'epoch': 0.79}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:23:46<1:05:39, 131kB/s]
 79%|███████▉  | 1617/2041 [2:20:28<36:15,  5.13s/it][A

{'eps': 0, 'objective/kl': 101.207763671875, 'objective/entropy': 41.71404266357422, 'objective/non_score_reward': -5.060388088226318, 'objective/rlhf_reward': -7.2077155113220215, 'objective/scores': -2.147327423095703, 'policy/approxkl_avg': 0.003125745803117752, 'policy/clipfrac_avg': 0.04363207519054413, 'loss/policy_avg': -0.022190380841493607, 'loss/value_avg': 0.4039560556411743, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8459399938583374, 'val/ratio': 0.9929484128952026, 'val/ratio_var': 4.369648013380356e-05, 'val/num_eos_tokens': 0, 'lr': 1.0411562959333661e-05, 'episode': 6468, 'epoch': 0.79}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:23:51<1:05:39, 131kB/s]
 79%|███████▉  | 1618/2041 [2:20:33<36:08,  5.13s/it][A

{'eps': 0, 'objective/kl': 84.60260009765625, 'objective/entropy': 49.35551452636719, 'objective/non_score_reward': -4.230130195617676, 'objective/rlhf_reward': -6.245879173278809, 'objective/scores': -2.0157487392425537, 'policy/approxkl_avg': 0.006507735699415207, 'policy/clipfrac_avg': 0.05070754885673523, 'loss/policy_avg': -0.027302086353302002, 'loss/value_avg': 0.2517485022544861, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.856479287147522, 'val/ratio': 0.9854421615600586, 'val/ratio_var': 0.00016126509581226856, 'val/num_eos_tokens': 0, 'lr': 1.0387065164135229e-05, 'episode': 6472, 'epoch': 0.79}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:23:56<1:05:39, 131kB/s]
 79%|███████▉  | 1619/2041 [2:20:38<36:12,  5.15s/it][A

{'eps': 0, 'objective/kl': 99.50762939453125, 'objective/entropy': 47.71846389770508, 'objective/non_score_reward': -4.975381851196289, 'objective/rlhf_reward': -6.920823097229004, 'objective/scores': -1.9454410076141357, 'policy/approxkl_avg': 0.006830700673162937, 'policy/clipfrac_avg': 0.044811323285102844, 'loss/policy_avg': -0.021366722881793976, 'loss/value_avg': 0.4216136932373047, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8230078220367432, 'val/ratio': 0.9896613955497742, 'val/ratio_var': 7.500822539441288e-05, 'val/num_eos_tokens': 0, 'lr': 1.0362567368936795e-05, 'episode': 6476, 'epoch': 0.79}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:24:02<1:05:39, 131kB/s]
 79%|███████▉  | 1620/2041 [2:20:43<36:12,  5.16s/it][A

{'eps': 0, 'objective/kl': 76.87952423095703, 'objective/entropy': 35.916404724121094, 'objective/non_score_reward': -3.8439762592315674, 'objective/rlhf_reward': -6.4575276374816895, 'objective/scores': -2.613551378250122, 'policy/approxkl_avg': 0.0036911703646183014, 'policy/clipfrac_avg': 0.028301887214183807, 'loss/policy_avg': -0.02001500502228737, 'loss/value_avg': 0.30923205614089966, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.678065299987793, 'val/ratio': 0.9965509176254272, 'val/ratio_var': 1.201159739139257e-05, 'val/num_eos_tokens': 0, 'lr': 1.0338069573738364e-05, 'episode': 6480, 'epoch': 0.79}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:24:07<1:05:39, 131kB/s]
 79%|███████▉  | 1621/2041 [2:20:48<36:04,  5.15s/it][A

{'eps': 0, 'objective/kl': 94.23571014404297, 'objective/entropy': 37.06231689453125, 'objective/non_score_reward': -4.711785793304443, 'objective/rlhf_reward': -7.567210674285889, 'objective/scores': -2.8554248809814453, 'policy/approxkl_avg': 0.00556785287335515, 'policy/clipfrac_avg': 0.04245283082127571, 'loss/policy_avg': -0.021839501336216927, 'loss/value_avg': 0.4038841724395752, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7112800478935242, 'val/ratio': 0.9994710087776184, 'val/ratio_var': 1.2118439371988643e-06, 'val/num_eos_tokens': 0, 'lr': 1.0313571778539932e-05, 'episode': 6484, 'epoch': 0.79}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:24:12<1:05:39, 131kB/s]
 79%|███████▉  | 1622/2041 [2:20:53<35:55,  5.15s/it][A

{'eps': 0, 'objective/kl': 98.18464660644531, 'objective/entropy': 45.735679626464844, 'objective/non_score_reward': -4.9092326164245605, 'objective/rlhf_reward': -7.446708679199219, 'objective/scores': -2.537475824356079, 'policy/approxkl_avg': 0.0066010779701173306, 'policy/clipfrac_avg': 0.048349056392908096, 'loss/policy_avg': -0.025411332026124, 'loss/value_avg': 0.3818510174751282, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7987737059593201, 'val/ratio': 0.9920322895050049, 'val/ratio_var': 3.973561615566723e-05, 'val/num_eos_tokens': 0, 'lr': 1.02890739833415e-05, 'episode': 6488, 'epoch': 0.79}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:24:17<1:05:39, 131kB/s]
 80%|███████▉  | 1623/2041 [2:20:58<35:47,  5.14s/it][A

{'eps': 0, 'objective/kl': 89.57823181152344, 'objective/entropy': 43.95458984375, 'objective/non_score_reward': -4.478911399841309, 'objective/rlhf_reward': -7.49162483215332, 'objective/scores': -3.012713670730591, 'policy/approxkl_avg': 0.0055741071701049805, 'policy/clipfrac_avg': 0.05424527823925018, 'loss/policy_avg': -0.024098508059978485, 'loss/value_avg': 0.32943087816238403, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9207425713539124, 'val/ratio': 0.9879659414291382, 'val/ratio_var': 0.0001113057296606712, 'val/num_eos_tokens': 0, 'lr': 1.0264576188143068e-05, 'episode': 6492, 'epoch': 0.8}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:24:22<1:05:39, 131kB/s]
 80%|███████▉  | 1624/2041 [2:21:04<35:44,  5.14s/it][A

{'eps': 0, 'objective/kl': 85.6937255859375, 'objective/entropy': 35.929359436035156, 'objective/non_score_reward': -4.284686088562012, 'objective/rlhf_reward': -6.5476579666137695, 'objective/scores': -2.262971878051758, 'policy/approxkl_avg': 0.004901314619928598, 'policy/clipfrac_avg': 0.04716981202363968, 'loss/policy_avg': -0.0261714905500412, 'loss/value_avg': 0.27624577283859253, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.715601921081543, 'val/ratio': 0.9883577227592468, 'val/ratio_var': 0.00011416035704314709, 'val/num_eos_tokens': 0, 'lr': 1.0240078392944636e-05, 'episode': 6496, 'epoch': 0.8}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:24:27<1:05:39, 131kB/s]
 80%|███████▉  | 1625/2041 [2:21:09<35:49,  5.17s/it][A

{'eps': 0, 'objective/kl': 98.10054016113281, 'objective/entropy': 50.6895751953125, 'objective/non_score_reward': -4.905027389526367, 'objective/rlhf_reward': -7.265017986297607, 'objective/scores': -2.3599905967712402, 'policy/approxkl_avg': 0.005892401095479727, 'policy/clipfrac_avg': 0.04716981202363968, 'loss/policy_avg': -0.023003116250038147, 'loss/value_avg': 0.49363380670547485, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8980041146278381, 'val/ratio': 0.9996739029884338, 'val/ratio_var': 2.5963086613955966e-07, 'val/num_eos_tokens': 0, 'lr': 1.0215580597746204e-05, 'episode': 6500, 'epoch': 0.8}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:24:32<1:05:39, 131kB/s]
 80%|███████▉  | 1626/2041 [2:21:14<35:44,  5.17s/it][A

{'eps': 0, 'objective/kl': 86.68543243408203, 'objective/entropy': 44.996337890625, 'objective/non_score_reward': -4.3342719078063965, 'objective/rlhf_reward': -6.681990623474121, 'objective/scores': -2.3477187156677246, 'policy/approxkl_avg': 0.0036252879071980715, 'policy/clipfrac_avg': 0.029481131583452225, 'loss/policy_avg': -0.015590202063322067, 'loss/value_avg': 0.2816360592842102, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7918887734413147, 'val/ratio': 1.0010521411895752, 'val/ratio_var': 6.892059900565073e-07, 'val/num_eos_tokens': 0, 'lr': 1.0191082802547772e-05, 'episode': 6504, 'epoch': 0.8}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:24:38<1:05:39, 131kB/s]
 80%|███████▉  | 1627/2041 [2:21:19<35:22,  5.13s/it][A

{'eps': 0, 'objective/kl': 92.20893096923828, 'objective/entropy': 51.42184829711914, 'objective/non_score_reward': -4.610446929931641, 'objective/rlhf_reward': -7.149779319763184, 'objective/scores': -2.539332628250122, 'policy/approxkl_avg': 0.006389048416167498, 'policy/clipfrac_avg': 0.05070754885673523, 'loss/policy_avg': -0.023010846227407455, 'loss/value_avg': 0.3525483012199402, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8212055563926697, 'val/ratio': 0.9876008629798889, 'val/ratio_var': 0.0001302236778428778, 'val/num_eos_tokens': 0, 'lr': 1.0166585007349338e-05, 'episode': 6508, 'epoch': 0.8}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:24:43<1:05:39, 131kB/s]
 80%|███████▉  | 1628/2041 [2:21:24<35:18,  5.13s/it][A

{'eps': 0, 'objective/kl': 88.35617065429688, 'objective/entropy': 49.08934020996094, 'objective/non_score_reward': -4.4178080558776855, 'objective/rlhf_reward': -6.546093940734863, 'objective/scores': -2.1282856464385986, 'policy/approxkl_avg': 0.007854156196117401, 'policy/clipfrac_avg': 0.051886796951293945, 'loss/policy_avg': -0.023116623982787132, 'loss/value_avg': 0.4701805114746094, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8901957273483276, 'val/ratio': 0.9971968531608582, 'val/ratio_var': 5.804444299428724e-06, 'val/num_eos_tokens': 0, 'lr': 1.0142087212150906e-05, 'episode': 6512, 'epoch': 0.8}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:24:48<1:05:39, 131kB/s]
 80%|███████▉  | 1629/2041 [2:21:29<35:17,  5.14s/it][A

{'eps': 0, 'objective/kl': 92.81005096435547, 'objective/entropy': 47.6157341003418, 'objective/non_score_reward': -4.6405029296875, 'objective/rlhf_reward': -7.70903205871582, 'objective/scores': -3.0685291290283203, 'policy/approxkl_avg': 0.0028646725695580244, 'policy/clipfrac_avg': 0.03419811278581619, 'loss/policy_avg': -0.020520437508821487, 'loss/value_avg': 0.4258464574813843, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8599497675895691, 'val/ratio': 0.9929810166358948, 'val/ratio_var': 3.621152791311033e-05, 'val/num_eos_tokens': 0, 'lr': 1.0117589416952474e-05, 'episode': 6516, 'epoch': 0.8}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:24:53<1:05:39, 131kB/s]
 80%|███████▉  | 1630/2041 [2:21:34<35:11,  5.14s/it][A

{'eps': 0, 'objective/kl': 87.83483123779297, 'objective/entropy': 41.8664436340332, 'objective/non_score_reward': -4.391741752624512, 'objective/rlhf_reward': -6.891732215881348, 'objective/scores': -2.499990701675415, 'policy/approxkl_avg': 0.003593477886170149, 'policy/clipfrac_avg': 0.03891509398818016, 'loss/policy_avg': -0.023193586617708206, 'loss/value_avg': 0.2946111261844635, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7790188193321228, 'val/ratio': 0.9887689352035522, 'val/ratio_var': 0.00011527604510774836, 'val/num_eos_tokens': 0, 'lr': 1.0093091621754042e-05, 'episode': 6520, 'epoch': 0.8}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:24:58<1:05:39, 131kB/s]
 80%|███████▉  | 1631/2041 [2:21:40<35:01,  5.13s/it][A

{'eps': 0, 'objective/kl': 86.61028289794922, 'objective/entropy': 32.8091926574707, 'objective/non_score_reward': -4.330513954162598, 'objective/rlhf_reward': -6.651095390319824, 'objective/scores': -2.3205816745758057, 'policy/approxkl_avg': 0.0050340197049081326, 'policy/clipfrac_avg': 0.041273586452007294, 'loss/policy_avg': -0.018596749752759933, 'loss/value_avg': 0.36164864897727966, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6566879749298096, 'val/ratio': 0.9891903400421143, 'val/ratio_var': 9.549601963954046e-05, 'val/num_eos_tokens': 0, 'lr': 1.006859382655561e-05, 'episode': 6524, 'epoch': 0.8}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:25:03<1:05:39, 131kB/s]
 80%|███████▉  | 1632/2041 [2:21:45<34:53,  5.12s/it][A

{'eps': 0, 'objective/kl': 80.11167907714844, 'objective/entropy': 46.79159164428711, 'objective/non_score_reward': -4.005583763122559, 'objective/rlhf_reward': -6.664225101470947, 'objective/scores': -2.6586413383483887, 'policy/approxkl_avg': 0.015788616612553596, 'policy/clipfrac_avg': 0.048349056392908096, 'loss/policy_avg': -0.025000840425491333, 'loss/value_avg': 0.37237435579299927, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8082983493804932, 'val/ratio': 0.9918603897094727, 'val/ratio_var': 5.8209203416481614e-05, 'val/num_eos_tokens': 0, 'lr': 1.0044096031357178e-05, 'episode': 6528, 'epoch': 0.8}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:25:08<1:05:39, 131kB/s]
 80%|████████  | 1633/2041 [2:21:50<35:04,  5.16s/it][A

{'eps': 0, 'objective/kl': 87.88959503173828, 'objective/entropy': 41.236717224121094, 'objective/non_score_reward': -4.394479751586914, 'objective/rlhf_reward': -6.925546169281006, 'objective/scores': -2.531066417694092, 'policy/approxkl_avg': 0.0025503600481897593, 'policy/clipfrac_avg': 0.017688680440187454, 'loss/policy_avg': -0.017123155295848846, 'loss/value_avg': 0.34423643350601196, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7307718992233276, 'val/ratio': 0.989069938659668, 'val/ratio_var': 0.00010926697723334655, 'val/num_eos_tokens': 0, 'lr': 1.0019598236158746e-05, 'episode': 6532, 'epoch': 0.8}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:25:17<1:05:39, 131kB/s]
 80%|████████  | 1634/2041 [2:21:58<41:20,  6.09s/it][A

{'eps': 0, 'objective/kl': 88.58443450927734, 'objective/entropy': 66.1670150756836, 'objective/non_score_reward': -4.4292216300964355, 'objective/rlhf_reward': -6.840637683868408, 'objective/scores': -2.4114160537719727, 'policy/approxkl_avg': 0.006962330546230078, 'policy/clipfrac_avg': 0.07075471431016922, 'loss/policy_avg': -0.034752778708934784, 'loss/value_avg': 0.4265534281730652, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1091939210891724, 'val/ratio': 0.9961262941360474, 'val/ratio_var': 7.216630820039427e-06, 'val/num_eos_tokens': 0, 'lr': 9.995100440960314e-06, 'episode': 6536, 'epoch': 0.8}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:25:22<1:05:39, 131kB/s]
 80%|████████  | 1635/2041 [2:22:03<39:33,  5.85s/it][A

{'eps': 0, 'objective/kl': 91.0047607421875, 'objective/entropy': 40.560997009277344, 'objective/non_score_reward': -4.550238132476807, 'objective/rlhf_reward': -6.611834526062012, 'objective/scores': -2.061596155166626, 'policy/approxkl_avg': 0.002808968536555767, 'policy/clipfrac_avg': 0.030660375952720642, 'loss/policy_avg': -0.021144017577171326, 'loss/value_avg': 0.29208964109420776, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.815134584903717, 'val/ratio': 0.9930185079574585, 'val/ratio_var': 4.633356002159417e-05, 'val/num_eos_tokens': 0, 'lr': 9.970602645761882e-06, 'episode': 6540, 'epoch': 0.8}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:25:27<1:05:39, 131kB/s]
 80%|████████  | 1636/2041 [2:22:09<38:01,  5.63s/it][A

{'eps': 0, 'objective/kl': 83.23532104492188, 'objective/entropy': 55.26076126098633, 'objective/non_score_reward': -4.161766052246094, 'objective/rlhf_reward': -6.5140581130981445, 'objective/scores': -2.352292060852051, 'policy/approxkl_avg': 0.007482608780264854, 'policy/clipfrac_avg': 0.04245282709598541, 'loss/policy_avg': -0.024218467995524406, 'loss/value_avg': 0.3371940553188324, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8798094987869263, 'val/ratio': 1.0025677680969238, 'val/ratio_var': 1.1211453056603204e-05, 'val/num_eos_tokens': 0, 'lr': 9.946104850563449e-06, 'episode': 6544, 'epoch': 0.8}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:25:32<1:05:39, 131kB/s]
 80%|████████  | 1637/2041 [2:22:14<37:02,  5.50s/it][A

{'eps': 0, 'objective/kl': 93.85821533203125, 'objective/entropy': 32.74333953857422, 'objective/non_score_reward': -4.692910671234131, 'objective/rlhf_reward': -6.72752571105957, 'objective/scores': -2.0346152782440186, 'policy/approxkl_avg': 0.0030914880335330963, 'policy/clipfrac_avg': 0.029481131583452225, 'loss/policy_avg': -0.017586609348654747, 'loss/value_avg': 0.38140928745269775, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6669671535491943, 'val/ratio': 0.9872773885726929, 'val/ratio_var': 0.0001403520000167191, 'val/num_eos_tokens': 0, 'lr': 9.921607055365017e-06, 'episode': 6548, 'epoch': 0.8}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:25:37<1:05:39, 131kB/s]
 80%|████████  | 1638/2041 [2:22:19<36:11,  5.39s/it][A

{'eps': 0, 'objective/kl': 102.22625732421875, 'objective/entropy': 44.59339141845703, 'objective/non_score_reward': -5.1113128662109375, 'objective/rlhf_reward': -6.853178024291992, 'objective/scores': -1.7418650388717651, 'policy/approxkl_avg': 0.0046173888258636, 'policy/clipfrac_avg': 0.025943396613001823, 'loss/policy_avg': -0.018924936652183533, 'loss/value_avg': 0.2959299683570862, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8777420520782471, 'val/ratio': 0.9869033694267273, 'val/ratio_var': 0.0001362374605378136, 'val/num_eos_tokens': 0, 'lr': 9.897109260166585e-06, 'episode': 6552, 'epoch': 0.8}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:25:43<1:05:39, 131kB/s]
 80%|████████  | 1639/2041 [2:22:24<35:43,  5.33s/it][A

{'eps': 0, 'objective/kl': 96.43202209472656, 'objective/entropy': 41.78392791748047, 'objective/non_score_reward': -4.821600914001465, 'objective/rlhf_reward': -6.429169654846191, 'objective/scores': -1.6075685024261475, 'policy/approxkl_avg': 0.003998904023319483, 'policy/clipfrac_avg': 0.040094342082738876, 'loss/policy_avg': -0.02147359773516655, 'loss/value_avg': 0.2995959520339966, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7872698307037354, 'val/ratio': 0.9895857572555542, 'val/ratio_var': 9.815117664402351e-05, 'val/num_eos_tokens': 0, 'lr': 9.872611464968155e-06, 'episode': 6556, 'epoch': 0.8}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:25:48<1:05:39, 131kB/s]
 80%|████████  | 1640/2041 [2:22:29<35:18,  5.28s/it][A

{'eps': 0, 'objective/kl': 81.71710205078125, 'objective/entropy': 44.62117385864258, 'objective/non_score_reward': -4.085855484008789, 'objective/rlhf_reward': -5.527996063232422, 'objective/scores': -1.4421403408050537, 'policy/approxkl_avg': 0.005913854576647282, 'policy/clipfrac_avg': 0.04363207519054413, 'loss/policy_avg': -0.02225748635828495, 'loss/value_avg': 0.354744553565979, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8161712288856506, 'val/ratio': 0.9949319362640381, 'val/ratio_var': 1.3747855518886354e-05, 'val/num_eos_tokens': 0, 'lr': 9.848113669769721e-06, 'episode': 6560, 'epoch': 0.8}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:25:53<1:05:39, 131kB/s]
 80%|████████  | 1641/2041 [2:22:34<34:59,  5.25s/it][A

{'eps': 0, 'objective/kl': 88.63267517089844, 'objective/entropy': 46.36383819580078, 'objective/non_score_reward': -4.431633949279785, 'objective/rlhf_reward': -6.788393497467041, 'objective/scores': -2.356759548187256, 'policy/approxkl_avg': 0.008799515664577484, 'policy/clipfrac_avg': 0.04952830448746681, 'loss/policy_avg': -0.021609865128993988, 'loss/value_avg': 0.3072992265224457, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8742657899856567, 'val/ratio': 1.0088624954223633, 'val/ratio_var': 8.068305032793432e-05, 'val/num_eos_tokens': 0, 'lr': 9.823615874571289e-06, 'episode': 6564, 'epoch': 0.8}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:25:58<1:05:39, 131kB/s]
 80%|████████  | 1642/2041 [2:22:40<34:47,  5.23s/it][A

{'eps': 0, 'objective/kl': 95.30183410644531, 'objective/entropy': 66.8558120727539, 'objective/non_score_reward': -4.765091896057129, 'objective/rlhf_reward': -7.6028218269348145, 'objective/scores': -2.8377299308776855, 'policy/approxkl_avg': 0.0031634981278330088, 'policy/clipfrac_avg': 0.03419811278581619, 'loss/policy_avg': -0.021797556430101395, 'loss/value_avg': 0.4859340190887451, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2345856428146362, 'val/ratio': 1.0002896785736084, 'val/ratio_var': 1.6254554111583275e-06, 'val/num_eos_tokens': 0, 'lr': 9.799118079372857e-06, 'episode': 6568, 'epoch': 0.8}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:26:03<1:05:39, 131kB/s]
 80%|████████  | 1643/2041 [2:22:45<34:29,  5.20s/it][A

{'eps': 0, 'objective/kl': 91.09188079833984, 'objective/entropy': 41.55266571044922, 'objective/non_score_reward': -4.554594039916992, 'objective/rlhf_reward': -7.033167839050293, 'objective/scores': -2.478573799133301, 'policy/approxkl_avg': 0.004074648953974247, 'policy/clipfrac_avg': 0.03773584961891174, 'loss/policy_avg': -0.021178532391786575, 'loss/value_avg': 0.26875972747802734, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6600629091262817, 'val/ratio': 0.9902640581130981, 'val/ratio_var': 8.398036152357236e-05, 'val/num_eos_tokens': 0, 'lr': 9.774620284174425e-06, 'episode': 6572, 'epoch': 0.81}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:26:08<1:05:39, 131kB/s]
 81%|████████  | 1644/2041 [2:22:50<34:24,  5.20s/it][A

{'eps': 0, 'objective/kl': 90.79521179199219, 'objective/entropy': 60.81327819824219, 'objective/non_score_reward': -4.539760589599609, 'objective/rlhf_reward': -7.1045427322387695, 'objective/scores': -2.56478214263916, 'policy/approxkl_avg': 0.013453260064125061, 'policy/clipfrac_avg': 0.06367924064397812, 'loss/policy_avg': -0.03368142247200012, 'loss/value_avg': 0.4249652922153473, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2258102893829346, 'val/ratio': 0.9875372648239136, 'val/ratio_var': 9.459723514737561e-05, 'val/num_eos_tokens': 0, 'lr': 9.750122488975991e-06, 'episode': 6576, 'epoch': 0.81}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:26:14<1:05:39, 131kB/s]
 81%|████████  | 1645/2041 [2:22:55<34:13,  5.19s/it][A

{'eps': 0, 'objective/kl': 89.07474517822266, 'objective/entropy': 91.54000854492188, 'objective/non_score_reward': -4.453737258911133, 'objective/rlhf_reward': -7.15978479385376, 'objective/scores': -2.706047534942627, 'policy/approxkl_avg': 0.005384400021284819, 'policy/clipfrac_avg': 0.06014150753617287, 'loss/policy_avg': -0.03284571319818497, 'loss/value_avg': 0.31886184215545654, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.443481683731079, 'val/ratio': 0.9876775741577148, 'val/ratio_var': 0.0001385548384860158, 'val/num_eos_tokens': 0, 'lr': 9.72562469377756e-06, 'episode': 6580, 'epoch': 0.81}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:26:19<1:05:39, 131kB/s]
 81%|████████  | 1646/2041 [2:23:00<34:05,  5.18s/it][A

{'eps': 0, 'objective/kl': 79.24211120605469, 'objective/entropy': 56.678993225097656, 'objective/non_score_reward': -3.9621055126190186, 'objective/rlhf_reward': -6.270987510681152, 'objective/scores': -2.3088817596435547, 'policy/approxkl_avg': 0.0033030640333890915, 'policy/clipfrac_avg': 0.03773584961891174, 'loss/policy_avg': -0.023237256333231926, 'loss/value_avg': 0.2284928560256958, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1780204772949219, 'val/ratio': 0.9989036321640015, 'val/ratio_var': 3.2393738820246654e-06, 'val/num_eos_tokens': 0, 'lr': 9.70112689857913e-06, 'episode': 6584, 'epoch': 0.81}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:26:24<1:05:39, 131kB/s]
 81%|████████  | 1647/2041 [2:23:05<33:59,  5.18s/it][A

{'eps': 0, 'objective/kl': 78.55818939208984, 'objective/entropy': 53.20879364013672, 'objective/non_score_reward': -3.9279096126556396, 'objective/rlhf_reward': -7.206233024597168, 'objective/scores': -3.2783234119415283, 'policy/approxkl_avg': 0.005029099993407726, 'policy/clipfrac_avg': 0.0554245263338089, 'loss/policy_avg': -0.029460061341524124, 'loss/value_avg': 0.3532157838344574, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9138268232345581, 'val/ratio': 0.9931407570838928, 'val/ratio_var': 4.0556562453275546e-05, 'val/num_eos_tokens': 0, 'lr': 9.676629103380697e-06, 'episode': 6588, 'epoch': 0.81}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:26:29<1:05:39, 131kB/s]
 81%|████████  | 1648/2041 [2:23:11<33:55,  5.18s/it][A

{'eps': 0, 'objective/kl': 84.29054260253906, 'objective/entropy': 32.3446159362793, 'objective/non_score_reward': -4.214527130126953, 'objective/rlhf_reward': -6.551600933074951, 'objective/scores': -2.337073802947998, 'policy/approxkl_avg': 0.0019714338704943657, 'policy/clipfrac_avg': 0.021226413547992706, 'loss/policy_avg': -0.015385495498776436, 'loss/value_avg': 0.2489500790834427, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6613538265228271, 'val/ratio': 0.9944149255752563, 'val/ratio_var': 2.6597257601679303e-05, 'val/num_eos_tokens': 0, 'lr': 9.652131308182264e-06, 'episode': 6592, 'epoch': 0.81}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:26:34<1:05:39, 131kB/s]
 81%|████████  | 1649/2041 [2:23:16<33:42,  5.16s/it][A

{'eps': 0, 'objective/kl': 96.98432159423828, 'objective/entropy': 44.817176818847656, 'objective/non_score_reward': -4.849216461181641, 'objective/rlhf_reward': -7.50368070602417, 'objective/scores': -2.6544642448425293, 'policy/approxkl_avg': 0.004457677714526653, 'policy/clipfrac_avg': 0.048349060118198395, 'loss/policy_avg': -0.02034946158528328, 'loss/value_avg': 0.3667711615562439, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7797873020172119, 'val/ratio': 0.9967418313026428, 'val/ratio_var': 8.525124030711595e-06, 'val/num_eos_tokens': 0, 'lr': 9.627633512983832e-06, 'episode': 6596, 'epoch': 0.81}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:26:39<1:05:39, 131kB/s]
 81%|████████  | 1650/2041 [2:23:21<33:29,  5.14s/it][A

{'eps': 0, 'objective/kl': 92.3011703491211, 'objective/entropy': 38.29458236694336, 'objective/non_score_reward': -4.615058422088623, 'objective/rlhf_reward': -7.579999923706055, 'objective/scores': -2.9649415016174316, 'policy/approxkl_avg': 0.0034941183403134346, 'policy/clipfrac_avg': 0.03891509398818016, 'loss/policy_avg': -0.01763201504945755, 'loss/value_avg': 0.34721943736076355, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7275925874710083, 'val/ratio': 0.9888179302215576, 'val/ratio_var': 0.00010177757212659344, 'val/num_eos_tokens': 0, 'lr': 9.6031357177854e-06, 'episode': 6600, 'epoch': 0.81}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:26:45<1:05:39, 131kB/s]
 81%|████████  | 1651/2041 [2:23:26<33:35,  5.17s/it][A

{'eps': 0, 'objective/kl': 91.893798828125, 'objective/entropy': 42.65263748168945, 'objective/non_score_reward': -4.594689846038818, 'objective/rlhf_reward': -6.927160263061523, 'objective/scores': -2.332470417022705, 'policy/approxkl_avg': 0.004102789331227541, 'policy/clipfrac_avg': 0.03773584961891174, 'loss/policy_avg': -0.023223331198096275, 'loss/value_avg': 0.3461306095123291, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7967909574508667, 'val/ratio': 0.9861096739768982, 'val/ratio_var': 0.0001588072773301974, 'val/num_eos_tokens': 0, 'lr': 9.578637922586968e-06, 'episode': 6604, 'epoch': 0.81}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:26:50<1:05:39, 131kB/s]
 81%|████████  | 1652/2041 [2:23:31<33:27,  5.16s/it][A

{'eps': 0, 'objective/kl': 88.13104248046875, 'objective/entropy': 38.53125, 'objective/non_score_reward': -4.406551837921143, 'objective/rlhf_reward': -6.918341159820557, 'objective/scores': -2.511789321899414, 'policy/approxkl_avg': 0.0031630597077310085, 'policy/clipfrac_avg': 0.030660375952720642, 'loss/policy_avg': -0.019208021461963654, 'loss/value_avg': 0.25306451320648193, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.5995961427688599, 'val/ratio': 0.9884560108184814, 'val/ratio_var': 0.00011420482769608498, 'val/num_eos_tokens': 0, 'lr': 9.554140127388536e-06, 'episode': 6608, 'epoch': 0.81}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:26:55<1:05:39, 131kB/s]
 81%|████████  | 1653/2041 [2:23:36<33:23,  5.16s/it][A

{'eps': 0, 'objective/kl': 70.49781036376953, 'objective/entropy': 30.590091705322266, 'objective/non_score_reward': -3.524890661239624, 'objective/rlhf_reward': -6.445730209350586, 'objective/scores': -2.920839309692383, 'policy/approxkl_avg': 0.005056682042777538, 'policy/clipfrac_avg': 0.041273582726716995, 'loss/policy_avg': -0.0236592385917902, 'loss/value_avg': 0.3791111707687378, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6150777339935303, 'val/ratio': 0.9873170256614685, 'val/ratio_var': 0.0001289073406951502, 'val/num_eos_tokens': 0, 'lr': 9.529642332190102e-06, 'episode': 6612, 'epoch': 0.81}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:27:00<1:05:39, 131kB/s]
 81%|████████  | 1654/2041 [2:23:42<33:11,  5.15s/it][A

{'eps': 0, 'objective/kl': 85.58810424804688, 'objective/entropy': 63.28370666503906, 'objective/non_score_reward': -4.279404640197754, 'objective/rlhf_reward': -6.743763446807861, 'objective/scores': -2.4643588066101074, 'policy/approxkl_avg': 0.005030524916946888, 'policy/clipfrac_avg': 0.05896226316690445, 'loss/policy_avg': -0.029356760904192924, 'loss/value_avg': 0.3415513038635254, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2305599451065063, 'val/ratio': 1.0022854804992676, 'val/ratio_var': 1.1466249816294294e-05, 'val/num_eos_tokens': 0, 'lr': 9.505144536991672e-06, 'episode': 6616, 'epoch': 0.81}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:27:05<1:05:39, 131kB/s]
 81%|████████  | 1655/2041 [2:23:47<33:03,  5.14s/it][A

{'eps': 0, 'objective/kl': 84.92469024658203, 'objective/entropy': 37.59016418457031, 'objective/non_score_reward': -4.246234893798828, 'objective/rlhf_reward': -6.102084159851074, 'objective/scores': -1.8558491468429565, 'policy/approxkl_avg': 0.008164015598595142, 'policy/clipfrac_avg': 0.03183962404727936, 'loss/policy_avg': -0.018521830439567566, 'loss/value_avg': 0.32396265864372253, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8055298924446106, 'val/ratio': 0.9872053265571594, 'val/ratio_var': 0.0001286827027797699, 'val/num_eos_tokens': 0, 'lr': 9.48064674179324e-06, 'episode': 6620, 'epoch': 0.81}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:27:10<1:05:39, 131kB/s]
 81%|████████  | 1656/2041 [2:23:52<32:55,  5.13s/it][A

{'eps': 0, 'objective/kl': 84.70825958251953, 'objective/entropy': 44.88767623901367, 'objective/non_score_reward': -4.23541259765625, 'objective/rlhf_reward': -7.014396667480469, 'objective/scores': -2.7789840698242188, 'policy/approxkl_avg': 0.005502892658114433, 'policy/clipfrac_avg': 0.041273582726716995, 'loss/policy_avg': -0.021762050688266754, 'loss/value_avg': 0.43393659591674805, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7478398084640503, 'val/ratio': 0.9951586127281189, 'val/ratio_var': 1.606067962711677e-05, 'val/num_eos_tokens': 0, 'lr': 9.456148946594808e-06, 'episode': 6624, 'epoch': 0.81}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:27:15<1:05:39, 131kB/s]
 81%|████████  | 1657/2041 [2:23:57<32:51,  5.13s/it][A

{'eps': 0, 'objective/kl': 91.69621276855469, 'objective/entropy': 39.699981689453125, 'objective/non_score_reward': -4.584810733795166, 'objective/rlhf_reward': -6.673650741577148, 'objective/scores': -2.0888402462005615, 'policy/approxkl_avg': 0.008158117532730103, 'policy/clipfrac_avg': 0.04599056765437126, 'loss/policy_avg': -0.024404089897871017, 'loss/value_avg': 0.40457412600517273, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8256930112838745, 'val/ratio': 0.9911722540855408, 'val/ratio_var': 7.090614963090047e-05, 'val/num_eos_tokens': 0, 'lr': 9.431651151396374e-06, 'episode': 6628, 'epoch': 0.81}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:27:21<1:05:39, 131kB/s]
 81%|████████  | 1658/2041 [2:24:02<32:50,  5.15s/it][A

{'eps': 0, 'objective/kl': 88.10334777832031, 'objective/entropy': 50.60997009277344, 'objective/non_score_reward': -4.405167579650879, 'objective/rlhf_reward': -6.580226898193359, 'objective/scores': -2.1750593185424805, 'policy/approxkl_avg': 0.0030286316759884357, 'policy/clipfrac_avg': 0.03419811278581619, 'loss/policy_avg': -0.02128487639129162, 'loss/value_avg': 0.35739368200302124, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9273163080215454, 'val/ratio': 0.9898270964622498, 'val/ratio_var': 8.716581942280754e-05, 'val/num_eos_tokens': 0, 'lr': 9.407153356197942e-06, 'episode': 6632, 'epoch': 0.81}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:27:26<1:05:39, 131kB/s]
 81%|████████▏ | 1659/2041 [2:24:07<32:43,  5.14s/it][A

{'eps': 0, 'objective/kl': 92.80635070800781, 'objective/entropy': 58.9376335144043, 'objective/non_score_reward': -4.640316963195801, 'objective/rlhf_reward': -7.644272804260254, 'objective/scores': -3.003955602645874, 'policy/approxkl_avg': 0.0060064224526286125, 'policy/clipfrac_avg': 0.056603774428367615, 'loss/policy_avg': -0.02917652204632759, 'loss/value_avg': 0.5438058376312256, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0938470363616943, 'val/ratio': 0.9848248958587646, 'val/ratio_var': 0.00019740879361052066, 'val/num_eos_tokens': 0, 'lr': 9.38265556099951e-06, 'episode': 6636, 'epoch': 0.81}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:27:31<1:05:39, 131kB/s]
 81%|████████▏ | 1660/2041 [2:24:12<32:43,  5.15s/it][A

{'eps': 0, 'objective/kl': 81.6278305053711, 'objective/entropy': 47.349205017089844, 'objective/non_score_reward': -4.08139181137085, 'objective/rlhf_reward': -5.970767021179199, 'objective/scores': -1.8893752098083496, 'policy/approxkl_avg': 0.003239239798858762, 'policy/clipfrac_avg': 0.024764152243733406, 'loss/policy_avg': -0.019068123772740364, 'loss/value_avg': 0.35960787534713745, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9077165126800537, 'val/ratio': 0.9903524518013, 'val/ratio_var': 8.775200694799423e-05, 'val/num_eos_tokens': 0, 'lr': 9.358157765801078e-06, 'episode': 6640, 'epoch': 0.81}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:27:36<1:05:39, 131kB/s]
 81%|████████▏ | 1661/2041 [2:24:18<32:46,  5.18s/it][A

{'eps': 0, 'objective/kl': 80.87228393554688, 'objective/entropy': 56.18301010131836, 'objective/non_score_reward': -4.043614387512207, 'objective/rlhf_reward': -6.409514427185059, 'objective/scores': -2.3659000396728516, 'policy/approxkl_avg': 0.00222115870565176, 'policy/clipfrac_avg': 0.02712264098227024, 'loss/policy_avg': -0.02040473185479641, 'loss/value_avg': 0.34841588139533997, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0910022258758545, 'val/ratio': 0.9951981902122498, 'val/ratio_var': 1.9514294763212092e-05, 'val/num_eos_tokens': 0, 'lr': 9.333659970602645e-06, 'episode': 6644, 'epoch': 0.81}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:27:41<1:05:39, 131kB/s]
 81%|████████▏ | 1662/2041 [2:24:23<32:54,  5.21s/it][A

{'eps': 0, 'objective/kl': 87.27204132080078, 'objective/entropy': 39.59008026123047, 'objective/non_score_reward': -4.3636016845703125, 'objective/rlhf_reward': -6.760812759399414, 'objective/scores': -2.3972110748291016, 'policy/approxkl_avg': 0.004051733296364546, 'policy/clipfrac_avg': 0.041273586452007294, 'loss/policy_avg': -0.023862408474087715, 'loss/value_avg': 0.3230949640274048, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8250221610069275, 'val/ratio': 0.9925270080566406, 'val/ratio_var': 4.99783200211823e-05, 'val/num_eos_tokens': 0, 'lr': 9.309162175404215e-06, 'episode': 6648, 'epoch': 0.81}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:27:47<1:05:39, 131kB/s]
 81%|████████▏ | 1663/2041 [2:24:28<32:41,  5.19s/it][A

{'eps': 0, 'objective/kl': 101.13490295410156, 'objective/entropy': 57.65923309326172, 'objective/non_score_reward': -5.056745529174805, 'objective/rlhf_reward': -7.349654197692871, 'objective/scores': -2.2929089069366455, 'policy/approxkl_avg': 0.011923898942768574, 'policy/clipfrac_avg': 0.05778301879763603, 'loss/policy_avg': -0.025726670399308205, 'loss/value_avg': 0.4289698004722595, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9120070934295654, 'val/ratio': 0.9903506636619568, 'val/ratio_var': 5.778064951300621e-05, 'val/num_eos_tokens': 0, 'lr': 9.284664380205783e-06, 'episode': 6652, 'epoch': 0.81}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:27:52<1:05:39, 131kB/s]
 82%|████████▏ | 1664/2041 [2:24:33<32:33,  5.18s/it][A

{'eps': 0, 'objective/kl': 82.83148193359375, 'objective/entropy': 45.17992401123047, 'objective/non_score_reward': -4.141573905944824, 'objective/rlhf_reward': -6.442455291748047, 'objective/scores': -2.3008813858032227, 'policy/approxkl_avg': 0.006533350329846144, 'policy/clipfrac_avg': 0.03655660152435303, 'loss/policy_avg': -0.02176070399582386, 'loss/value_avg': 0.3568400740623474, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7476369142532349, 'val/ratio': 1.0025883913040161, 'val/ratio_var': 1.1871471542690415e-05, 'val/num_eos_tokens': 0, 'lr': 9.26016658500735e-06, 'episode': 6656, 'epoch': 0.82}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:27:57<1:05:39, 131kB/s]
 82%|████████▏ | 1665/2041 [2:24:38<32:22,  5.17s/it][A

{'eps': 0, 'objective/kl': 81.4377670288086, 'objective/entropy': 40.54036331176758, 'objective/non_score_reward': -4.0718889236450195, 'objective/rlhf_reward': -6.545729160308838, 'objective/scores': -2.4738402366638184, 'policy/approxkl_avg': 0.0046464246697723866, 'policy/clipfrac_avg': 0.03183962404727936, 'loss/policy_avg': -0.018359562382102013, 'loss/value_avg': 0.38115495443344116, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7681763172149658, 'val/ratio': 0.9909243583679199, 'val/ratio_var': 7.231983909150586e-05, 'val/num_eos_tokens': 0, 'lr': 9.235668789808917e-06, 'episode': 6660, 'epoch': 0.82}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:28:02<1:05:39, 131kB/s]
 82%|████████▏ | 1666/2041 [2:24:43<32:11,  5.15s/it][A

{'eps': 0, 'objective/kl': 82.60670471191406, 'objective/entropy': 44.5292854309082, 'objective/non_score_reward': -4.130334854125977, 'objective/rlhf_reward': -7.129834175109863, 'objective/scores': -2.999499559402466, 'policy/approxkl_avg': 0.003352199448272586, 'policy/clipfrac_avg': 0.03183962404727936, 'loss/policy_avg': -0.018062513321638107, 'loss/value_avg': 0.41612309217453003, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7934445142745972, 'val/ratio': 0.9997025728225708, 'val/ratio_var': 4.157810735705425e-07, 'val/num_eos_tokens': 0, 'lr': 9.211170994610485e-06, 'episode': 6664, 'epoch': 0.82}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:28:07<1:05:39, 131kB/s]
 82%|████████▏ | 1667/2041 [2:24:49<32:01,  5.14s/it][A

{'eps': 0, 'objective/kl': 99.89649200439453, 'objective/entropy': 37.880210876464844, 'objective/non_score_reward': -4.994824409484863, 'objective/rlhf_reward': -7.489025592803955, 'objective/scores': -2.494201183319092, 'policy/approxkl_avg': 0.002433042274788022, 'policy/clipfrac_avg': 0.030660375952720642, 'loss/policy_avg': -0.015902122482657433, 'loss/value_avg': 0.49589017033576965, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7546021342277527, 'val/ratio': 0.9969814419746399, 'val/ratio_var': 6.2428553064819425e-06, 'val/num_eos_tokens': 0, 'lr': 9.186673199412053e-06, 'episode': 6668, 'epoch': 0.82}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:28:12<1:05:39, 131kB/s]
 82%|████████▏ | 1668/2041 [2:24:54<31:58,  5.14s/it][A

{'eps': 0, 'objective/kl': 85.95574951171875, 'objective/entropy': 34.10020065307617, 'objective/non_score_reward': -4.297787666320801, 'objective/rlhf_reward': -6.784271240234375, 'objective/scores': -2.486483335494995, 'policy/approxkl_avg': 0.005508802831172943, 'policy/clipfrac_avg': 0.03655660152435303, 'loss/policy_avg': -0.022259246557950974, 'loss/value_avg': 0.530245840549469, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6224436163902283, 'val/ratio': 1.0053584575653076, 'val/ratio_var': 3.3348158467561007e-05, 'val/num_eos_tokens': 0, 'lr': 9.162175404213621e-06, 'episode': 6672, 'epoch': 0.82}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:28:17<1:05:39, 131kB/s]
 82%|████████▏ | 1669/2041 [2:24:59<32:01,  5.17s/it][A

{'eps': 0, 'objective/kl': 81.36375427246094, 'objective/entropy': 48.86015701293945, 'objective/non_score_reward': -4.068187713623047, 'objective/rlhf_reward': -6.936501979827881, 'objective/scores': -2.868314266204834, 'policy/approxkl_avg': 0.004361678380519152, 'policy/clipfrac_avg': 0.04363207519054413, 'loss/policy_avg': -0.020465534180402756, 'loss/value_avg': 0.5159295201301575, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9523313045501709, 'val/ratio': 0.9890881180763245, 'val/ratio_var': 8.603144669905305e-05, 'val/num_eos_tokens': 0, 'lr': 9.13767760901519e-06, 'episode': 6676, 'epoch': 0.82}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:28:23<1:05:39, 131kB/s]
 82%|████████▏ | 1670/2041 [2:25:04<31:58,  5.17s/it][A

{'eps': 0, 'objective/kl': 85.50102233886719, 'objective/entropy': 40.51564407348633, 'objective/non_score_reward': -4.275051116943359, 'objective/rlhf_reward': -6.668346881866455, 'objective/scores': -2.3932957649230957, 'policy/approxkl_avg': 0.003425129922106862, 'policy/clipfrac_avg': 0.04245282709598541, 'loss/policy_avg': -0.01767311990261078, 'loss/value_avg': 0.3819938600063324, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8630448579788208, 'val/ratio': 0.9876789450645447, 'val/ratio_var': 0.00012644490925595164, 'val/num_eos_tokens': 0, 'lr': 9.113179813816757e-06, 'episode': 6680, 'epoch': 0.82}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:28:28<1:05:39, 131kB/s]
 82%|████████▏ | 1671/2041 [2:25:09<31:48,  5.16s/it][A

{'eps': 0, 'objective/kl': 89.22575378417969, 'objective/entropy': 64.7737808227539, 'objective/non_score_reward': -4.461287498474121, 'objective/rlhf_reward': -6.924715995788574, 'objective/scores': -2.4634287357330322, 'policy/approxkl_avg': 0.0075348857790231705, 'policy/clipfrac_avg': 0.05778302252292633, 'loss/policy_avg': -0.028583811596035957, 'loss/value_avg': 0.33398568630218506, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2202544212341309, 'val/ratio': 0.9945964813232422, 'val/ratio_var': 2.196530840592459e-05, 'val/num_eos_tokens': 0, 'lr': 9.088682018618325e-06, 'episode': 6684, 'epoch': 0.82}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:28:33<1:05:39, 131kB/s]
 82%|████████▏ | 1672/2041 [2:25:14<31:43,  5.16s/it][A

{'eps': 0, 'objective/kl': 83.97407531738281, 'objective/entropy': 28.88245964050293, 'objective/non_score_reward': -4.198703765869141, 'objective/rlhf_reward': -6.28059720993042, 'objective/scores': -2.0818934440612793, 'policy/approxkl_avg': 0.003319760551676154, 'policy/clipfrac_avg': 0.03183962404727936, 'loss/policy_avg': -0.018505195155739784, 'loss/value_avg': 0.2887333929538727, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6844806671142578, 'val/ratio': 0.9947282671928406, 'val/ratio_var': 2.4772189135546796e-05, 'val/num_eos_tokens': 0, 'lr': 9.064184223419893e-06, 'episode': 6688, 'epoch': 0.82}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:28:38<1:05:39, 131kB/s]
 82%|████████▏ | 1673/2041 [2:25:20<31:37,  5.16s/it][A

{'eps': 0, 'objective/kl': 63.33409118652344, 'objective/entropy': 38.13539505004883, 'objective/non_score_reward': -3.1667044162750244, 'objective/rlhf_reward': -6.188133239746094, 'objective/scores': -3.0214290618896484, 'policy/approxkl_avg': 0.005023781675845385, 'policy/clipfrac_avg': 0.030660375952720642, 'loss/policy_avg': -0.015766367316246033, 'loss/value_avg': 0.4862942695617676, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7536401152610779, 'val/ratio': 1.0017743110656738, 'val/ratio_var': 3.882668806909351e-06, 'val/num_eos_tokens': 0, 'lr': 9.039686428221461e-06, 'episode': 6692, 'epoch': 0.82}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:28:43<1:05:39, 131kB/s]
 82%|████████▏ | 1674/2041 [2:25:25<31:39,  5.18s/it][A

{'eps': 0, 'objective/kl': 82.59750366210938, 'objective/entropy': 38.187835693359375, 'objective/non_score_reward': -4.129875183105469, 'objective/rlhf_reward': -6.332472324371338, 'objective/scores': -2.202597141265869, 'policy/approxkl_avg': 0.0023542798589915037, 'policy/clipfrac_avg': 0.029481131583452225, 'loss/policy_avg': -0.01691623032093048, 'loss/value_avg': 0.32092463970184326, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8545576930046082, 'val/ratio': 0.9930645227432251, 'val/ratio_var': 4.190972322248854e-05, 'val/num_eos_tokens': 0, 'lr': 9.015188633023028e-06, 'episode': 6696, 'epoch': 0.82}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:28:48<1:05:39, 131kB/s]
 82%|████████▏ | 1675/2041 [2:25:30<31:38,  5.19s/it][A

{'eps': 0, 'objective/kl': 90.77144622802734, 'objective/entropy': 54.42582321166992, 'objective/non_score_reward': -4.538572311401367, 'objective/rlhf_reward': -7.013896465301514, 'objective/scores': -2.4753241539001465, 'policy/approxkl_avg': 0.005445542745292187, 'policy/clipfrac_avg': 0.056603774428367615, 'loss/policy_avg': -0.025728978216648102, 'loss/value_avg': 0.46592435240745544, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0012316703796387, 'val/ratio': 0.9969048500061035, 'val/ratio_var': 7.619252301083179e-06, 'val/num_eos_tokens': 0, 'lr': 8.990690837824596e-06, 'episode': 6700, 'epoch': 0.82}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:28:54<1:05:39, 131kB/s]
 82%|████████▏ | 1676/2041 [2:25:35<31:30,  5.18s/it][A

{'eps': 0, 'objective/kl': 84.26780700683594, 'objective/entropy': 54.8505744934082, 'objective/non_score_reward': -4.213390827178955, 'objective/rlhf_reward': -6.6919074058532715, 'objective/scores': -2.4785165786743164, 'policy/approxkl_avg': 0.004933509044349194, 'policy/clipfrac_avg': 0.05070754885673523, 'loss/policy_avg': -0.026302652433514595, 'loss/value_avg': 0.45775067806243896, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1122608184814453, 'val/ratio': 0.9900965690612793, 'val/ratio_var': 8.456146315438673e-05, 'val/num_eos_tokens': 0, 'lr': 8.966193042626164e-06, 'episode': 6704, 'epoch': 0.82}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:28:59<1:05:39, 131kB/s]
 82%|████████▏ | 1677/2041 [2:25:40<31:27,  5.19s/it][A

{'eps': 0, 'objective/kl': 97.38621520996094, 'objective/entropy': 61.92784881591797, 'objective/non_score_reward': -4.86931037902832, 'objective/rlhf_reward': -7.469033241271973, 'objective/scores': -2.5997231006622314, 'policy/approxkl_avg': 0.005146874114871025, 'policy/clipfrac_avg': 0.04716981202363968, 'loss/policy_avg': -0.030648667365312576, 'loss/value_avg': 0.691131055355072, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2654614448547363, 'val/ratio': 0.9980038404464722, 'val/ratio_var': 4.472890850593103e-06, 'val/num_eos_tokens': 0, 'lr': 8.941695247427732e-06, 'episode': 6708, 'epoch': 0.82}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:29:04<1:05:39, 131kB/s]
 82%|████████▏ | 1678/2041 [2:25:46<31:21,  5.18s/it][A

{'eps': 0, 'objective/kl': 86.32337951660156, 'objective/entropy': 31.13812255859375, 'objective/non_score_reward': -4.316169261932373, 'objective/rlhf_reward': -6.9069623947143555, 'objective/scores': -2.5907931327819824, 'policy/approxkl_avg': 0.0016303667798638344, 'policy/clipfrac_avg': 0.01886792480945587, 'loss/policy_avg': -0.01272769644856453, 'loss/value_avg': 0.3618273138999939, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7376712560653687, 'val/ratio': 0.9912709593772888, 'val/ratio_var': 7.361983443843201e-05, 'val/num_eos_tokens': 0, 'lr': 8.9171974522293e-06, 'episode': 6712, 'epoch': 0.82}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:29:09<1:05:39, 131kB/s]
 82%|████████▏ | 1679/2041 [2:25:51<31:22,  5.20s/it][A

{'eps': 0, 'objective/kl': 72.076416015625, 'objective/entropy': 41.44179916381836, 'objective/non_score_reward': -3.60382080078125, 'objective/rlhf_reward': -5.770906925201416, 'objective/scores': -2.167086124420166, 'policy/approxkl_avg': 0.00874320138245821, 'policy/clipfrac_avg': 0.06132075563073158, 'loss/policy_avg': -0.025563372299075127, 'loss/value_avg': 0.284126341342926, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7471528053283691, 'val/ratio': 1.0082460641860962, 'val/ratio_var': 7.34782952349633e-05, 'val/num_eos_tokens': 0, 'lr': 8.892699657030868e-06, 'episode': 6716, 'epoch': 0.82}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:29:14<1:05:39, 131kB/s]
 82%|████████▏ | 1680/2041 [2:25:56<31:16,  5.20s/it][A

{'eps': 0, 'objective/kl': 79.53206634521484, 'objective/entropy': 69.7061767578125, 'objective/non_score_reward': -3.9766030311584473, 'objective/rlhf_reward': -6.163053512573242, 'objective/scores': -2.186450719833374, 'policy/approxkl_avg': 0.003416120307520032, 'policy/clipfrac_avg': 0.030660375952720642, 'loss/policy_avg': -0.023544253781437874, 'loss/value_avg': 0.3572986423969269, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2719751596450806, 'val/ratio': 0.9905449748039246, 'val/ratio_var': 9.028768545249477e-05, 'val/num_eos_tokens': 0, 'lr': 8.868201861832436e-06, 'episode': 6720, 'epoch': 0.82}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:29:20<1:05:39, 131kB/s]
 82%|████████▏ | 1681/2041 [2:26:01<31:06,  5.18s/it][A

{'eps': 0, 'objective/kl': 75.27799987792969, 'objective/entropy': 75.98030853271484, 'objective/non_score_reward': -3.7639002799987793, 'objective/rlhf_reward': -6.225950241088867, 'objective/scores': -2.462049961090088, 'policy/approxkl_avg': 0.008486364968121052, 'policy/clipfrac_avg': 0.06132075563073158, 'loss/policy_avg': -0.026666797697544098, 'loss/value_avg': 0.2770797908306122, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.3727242946624756, 'val/ratio': 0.9848887324333191, 'val/ratio_var': 0.00016562140081077814, 'val/num_eos_tokens': 0, 'lr': 8.843704066634004e-06, 'episode': 6724, 'epoch': 0.82}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:29:25<1:05:39, 131kB/s]
 82%|████████▏ | 1682/2041 [2:26:06<30:57,  5.18s/it][A

{'eps': 0, 'objective/kl': 79.51824951171875, 'objective/entropy': 50.4979248046875, 'objective/non_score_reward': -3.975912570953369, 'objective/rlhf_reward': -6.276718616485596, 'objective/scores': -2.3008060455322266, 'policy/approxkl_avg': 0.0121596185490489, 'policy/clipfrac_avg': 0.05070754513144493, 'loss/policy_avg': -0.02519533969461918, 'loss/value_avg': 0.4089058041572571, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9368419051170349, 'val/ratio': 1.0228124856948853, 'val/ratio_var': 0.000575523532461375, 'val/num_eos_tokens': 0, 'lr': 8.81920627143557e-06, 'episode': 6728, 'epoch': 0.82}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:29:30<1:05:39, 131kB/s]
 82%|████████▏ | 1683/2041 [2:26:11<30:47,  5.16s/it][A

{'eps': 0, 'objective/kl': 68.91152954101562, 'objective/entropy': 48.06227111816406, 'objective/non_score_reward': -3.4455766677856445, 'objective/rlhf_reward': -6.446165561676025, 'objective/scores': -3.000588893890381, 'policy/approxkl_avg': 0.004228720907121897, 'policy/clipfrac_avg': 0.04245283082127571, 'loss/policy_avg': -0.02389240264892578, 'loss/value_avg': 0.3817400634288788, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8797810673713684, 'val/ratio': 0.9938982725143433, 'val/ratio_var': 2.4768392904661596e-05, 'val/num_eos_tokens': 0, 'lr': 8.794708476237138e-06, 'episode': 6732, 'epoch': 0.82}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:29:35<1:05:39, 131kB/s]
 83%|████████▎ | 1684/2041 [2:26:17<30:41,  5.16s/it][A

{'eps': 0, 'objective/kl': 85.2637939453125, 'objective/entropy': 40.90315246582031, 'objective/non_score_reward': -4.263190269470215, 'objective/rlhf_reward': -6.073041915893555, 'objective/scores': -1.8098516464233398, 'policy/approxkl_avg': 0.0035455760080367327, 'policy/clipfrac_avg': 0.033018868416547775, 'loss/policy_avg': -0.018560010939836502, 'loss/value_avg': 0.45432350039482117, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8915112018585205, 'val/ratio': 0.9864251017570496, 'val/ratio_var': 0.00014967053721193224, 'val/num_eos_tokens': 0, 'lr': 8.770210681038706e-06, 'episode': 6736, 'epoch': 0.83}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:29:40<1:05:39, 131kB/s]
 83%|████████▎ | 1685/2041 [2:26:22<30:26,  5.13s/it][A

{'eps': 0, 'objective/kl': 74.83187866210938, 'objective/entropy': 57.02754211425781, 'objective/non_score_reward': -3.741594076156616, 'objective/rlhf_reward': -6.436354637145996, 'objective/scores': -2.694760799407959, 'policy/approxkl_avg': 0.004566696938127279, 'policy/clipfrac_avg': 0.04599056392908096, 'loss/policy_avg': -0.023421233519911766, 'loss/value_avg': 0.4139278531074524, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1092636585235596, 'val/ratio': 0.9898996949195862, 'val/ratio_var': 8.662974141770974e-05, 'val/num_eos_tokens': 0, 'lr': 8.745712885840274e-06, 'episode': 6740, 'epoch': 0.83}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:29:45<1:05:39, 131kB/s]
 83%|████████▎ | 1686/2041 [2:26:27<30:20,  5.13s/it][A

{'eps': 0, 'objective/kl': 77.58543395996094, 'objective/entropy': 60.69089889526367, 'objective/non_score_reward': -3.879271984100342, 'objective/rlhf_reward': -6.212121486663818, 'objective/scores': -2.3328495025634766, 'policy/approxkl_avg': 0.0030639050528407097, 'policy/clipfrac_avg': 0.03419811278581619, 'loss/policy_avg': -0.021254245191812515, 'loss/value_avg': 0.37122321128845215, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.166104793548584, 'val/ratio': 0.9902523159980774, 'val/ratio_var': 8.658102160552517e-05, 'val/num_eos_tokens': 0, 'lr': 8.721215090641843e-06, 'episode': 6744, 'epoch': 0.83}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:29:50<1:05:39, 131kB/s]
 83%|████████▎ | 1687/2041 [2:26:32<30:19,  5.14s/it][A

{'eps': 0, 'objective/kl': 87.01347351074219, 'objective/entropy': 72.05902099609375, 'objective/non_score_reward': -4.350673675537109, 'objective/rlhf_reward': -7.662991046905518, 'objective/scores': -3.312317371368408, 'policy/approxkl_avg': 0.0030587506480515003, 'policy/clipfrac_avg': 0.030660375952720642, 'loss/policy_avg': -0.021544137969613075, 'loss/value_avg': 0.5922356843948364, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.3191031217575073, 'val/ratio': 0.9839922189712524, 'val/ratio_var': 0.00022903759963810444, 'val/num_eos_tokens': 0, 'lr': 8.69671729544341e-06, 'episode': 6748, 'epoch': 0.83}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:29:55<1:05:39, 131kB/s]
 83%|████████▎ | 1688/2041 [2:26:37<30:10,  5.13s/it][A

{'eps': 0, 'objective/kl': 93.75804138183594, 'objective/entropy': 54.73827362060547, 'objective/non_score_reward': -4.687902450561523, 'objective/rlhf_reward': -7.882577896118164, 'objective/scores': -3.1946752071380615, 'policy/approxkl_avg': 0.0039720069617033005, 'policy/clipfrac_avg': 0.03773584961891174, 'loss/policy_avg': -0.021293845027685165, 'loss/value_avg': 0.6439887881278992, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2392266988754272, 'val/ratio': 0.9948151111602783, 'val/ratio_var': 2.6705491109169088e-05, 'val/num_eos_tokens': 0, 'lr': 8.672219500244979e-06, 'episode': 6752, 'epoch': 0.83}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:30:01<1:05:39, 131kB/s]
 83%|████████▎ | 1689/2041 [2:26:42<30:06,  5.13s/it][A

{'eps': 0, 'objective/kl': 80.18770599365234, 'objective/entropy': 74.47852325439453, 'objective/non_score_reward': -4.009385108947754, 'objective/rlhf_reward': -6.66295862197876, 'objective/scores': -2.653573513031006, 'policy/approxkl_avg': 0.0053046527318656445, 'policy/clipfrac_avg': 0.05424528568983078, 'loss/policy_avg': -0.03421482443809509, 'loss/value_avg': 0.3833175599575043, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.4812777042388916, 'val/ratio': 0.9867048263549805, 'val/ratio_var': 0.00016080499335657805, 'val/num_eos_tokens': 0, 'lr': 8.647721705046547e-06, 'episode': 6756, 'epoch': 0.83}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:30:06<1:05:39, 131kB/s]
 83%|████████▎ | 1690/2041 [2:26:47<30:15,  5.17s/it][A

{'eps': 0, 'objective/kl': 72.715576171875, 'objective/entropy': 53.055965423583984, 'objective/non_score_reward': -3.6357789039611816, 'objective/rlhf_reward': -6.626378059387207, 'objective/scores': -2.9905991554260254, 'policy/approxkl_avg': 0.006790119223296642, 'policy/clipfrac_avg': 0.056603774428367615, 'loss/policy_avg': -0.027328185737133026, 'loss/value_avg': 0.3228529095649719, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1171799898147583, 'val/ratio': 0.9885345101356506, 'val/ratio_var': 8.621710003353655e-05, 'val/num_eos_tokens': 0, 'lr': 8.623223909848115e-06, 'episode': 6760, 'epoch': 0.83}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:30:11<1:05:39, 131kB/s]
 83%|████████▎ | 1691/2041 [2:26:53<30:07,  5.16s/it][A

{'eps': 0, 'objective/kl': 72.12664794921875, 'objective/entropy': 27.311359405517578, 'objective/non_score_reward': -3.606332302093506, 'objective/rlhf_reward': -7.074850559234619, 'objective/scores': -3.4685182571411133, 'policy/approxkl_avg': 0.025965603068470955, 'policy/clipfrac_avg': 0.03419811278581619, 'loss/policy_avg': -0.0225075650960207, 'loss/value_avg': 0.46070221066474915, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6489509344100952, 'val/ratio': 0.9841550588607788, 'val/ratio_var': 0.00016214344941545278, 'val/num_eos_tokens': 0, 'lr': 8.598726114649681e-06, 'episode': 6764, 'epoch': 0.83}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:30:16<1:05:39, 131kB/s]
 83%|████████▎ | 1692/2041 [2:26:58<29:57,  5.15s/it][A

{'eps': 0, 'objective/kl': 80.80267333984375, 'objective/entropy': 62.67479705810547, 'objective/non_score_reward': -4.040133476257324, 'objective/rlhf_reward': -6.8145904541015625, 'objective/scores': -2.7744572162628174, 'policy/approxkl_avg': 0.0034184185788035393, 'policy/clipfrac_avg': 0.03419811278581619, 'loss/policy_avg': -0.02681606262922287, 'loss/value_avg': 0.43496641516685486, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2760913372039795, 'val/ratio': 0.989788293838501, 'val/ratio_var': 8.840418740874156e-05, 'val/num_eos_tokens': 0, 'lr': 8.574228319451249e-06, 'episode': 6768, 'epoch': 0.83}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:30:21<1:05:39, 131kB/s]
 83%|████████▎ | 1693/2041 [2:27:03<29:51,  5.15s/it][A

{'eps': 0, 'objective/kl': 82.69828796386719, 'objective/entropy': 95.70988464355469, 'objective/non_score_reward': -4.134914398193359, 'objective/rlhf_reward': -6.674981117248535, 'objective/scores': -2.5400664806365967, 'policy/approxkl_avg': 0.010370522737503052, 'policy/clipfrac_avg': 0.07783018797636032, 'loss/policy_avg': -0.039545658975839615, 'loss/value_avg': 0.44925904273986816, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.634127140045166, 'val/ratio': 0.9853648543357849, 'val/ratio_var': 0.00013140670489519835, 'val/num_eos_tokens': 0, 'lr': 8.549730524252817e-06, 'episode': 6772, 'epoch': 0.83}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:30:26<1:05:39, 131kB/s]
 83%|████████▎ | 1694/2041 [2:27:08<29:45,  5.15s/it][A

{'eps': 0, 'objective/kl': 85.09071350097656, 'objective/entropy': 61.966697692871094, 'objective/non_score_reward': -4.254535675048828, 'objective/rlhf_reward': -6.93974494934082, 'objective/scores': -2.685209274291992, 'policy/approxkl_avg': 0.005481462925672531, 'policy/clipfrac_avg': 0.04952830448746681, 'loss/policy_avg': -0.02642117254436016, 'loss/value_avg': 0.4816919267177582, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.3418583869934082, 'val/ratio': 0.9940720796585083, 'val/ratio_var': 2.6741290639620274e-05, 'val/num_eos_tokens': 0, 'lr': 8.525232729054385e-06, 'episode': 6776, 'epoch': 0.83}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:30:31<1:05:39, 131kB/s]
 83%|████████▎ | 1695/2041 [2:27:13<29:27,  5.11s/it][A

{'eps': 0, 'objective/kl': 83.21971130371094, 'objective/entropy': 99.36219024658203, 'objective/non_score_reward': -4.160985469818115, 'objective/rlhf_reward': -6.988312721252441, 'objective/scores': -2.827327251434326, 'policy/approxkl_avg': 0.006884680595248938, 'policy/clipfrac_avg': 0.07193396240472794, 'loss/policy_avg': -0.036679137498140335, 'loss/value_avg': 0.5685412883758545, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.740206241607666, 'val/ratio': 0.9952602386474609, 'val/ratio_var': 1.5173844985838514e-05, 'val/num_eos_tokens': 0, 'lr': 8.500734933855953e-06, 'episode': 6780, 'epoch': 0.83}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:30:37<1:05:39, 131kB/s]
 83%|████████▎ | 1696/2041 [2:27:18<29:19,  5.10s/it][A

{'eps': 0, 'objective/kl': 82.10430145263672, 'objective/entropy': 91.08967590332031, 'objective/non_score_reward': -4.105215072631836, 'objective/rlhf_reward': -5.84827184677124, 'objective/scores': -1.7430566549301147, 'policy/approxkl_avg': 0.008189013227820396, 'policy/clipfrac_avg': 0.0766509473323822, 'loss/policy_avg': -0.037095047533512115, 'loss/value_avg': 0.42724645137786865, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.5496760606765747, 'val/ratio': 1.0002799034118652, 'val/ratio_var': 4.1189844068867387e-07, 'val/num_eos_tokens': 0, 'lr': 8.476237138657521e-06, 'episode': 6784, 'epoch': 0.83}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:30:42<1:05:39, 131kB/s]
 83%|████████▎ | 1697/2041 [2:27:23<29:24,  5.13s/it][A

{'eps': 0, 'objective/kl': 89.13235473632812, 'objective/entropy': 60.13734817504883, 'objective/non_score_reward': -4.45661735534668, 'objective/rlhf_reward': -7.502844333648682, 'objective/scores': -3.046226978302002, 'policy/approxkl_avg': 0.004998867865651846, 'policy/clipfrac_avg': 0.049528300762176514, 'loss/policy_avg': -0.026221273466944695, 'loss/value_avg': 0.5003237128257751, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.198788046836853, 'val/ratio': 0.9837987422943115, 'val/ratio_var': 0.00021722273959312588, 'val/num_eos_tokens': 0, 'lr': 8.45173934345909e-06, 'episode': 6788, 'epoch': 0.83}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:30:47<1:05:39, 131kB/s]
 83%|████████▎ | 1698/2041 [2:27:28<29:23,  5.14s/it][A

{'eps': 0, 'objective/kl': 79.62828063964844, 'objective/entropy': 66.6798324584961, 'objective/non_score_reward': -3.9814136028289795, 'objective/rlhf_reward': -6.611386775970459, 'objective/scores': -2.6299731731414795, 'policy/approxkl_avg': 0.005175570026040077, 'policy/clipfrac_avg': 0.05188678950071335, 'loss/policy_avg': -0.02771219238638878, 'loss/value_avg': 0.288735032081604, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.222277045249939, 'val/ratio': 0.9956851005554199, 'val/ratio_var': 9.930286068993155e-06, 'val/num_eos_tokens': 0, 'lr': 8.427241548260657e-06, 'episode': 6792, 'epoch': 0.83}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:30:52<1:05:39, 131kB/s]
 83%|████████▎ | 1699/2041 [2:27:34<29:25,  5.16s/it][A

{'eps': 0, 'objective/kl': 78.60833740234375, 'objective/entropy': 70.53010559082031, 'objective/non_score_reward': -3.9304168224334717, 'objective/rlhf_reward': -6.848945617675781, 'objective/scores': -2.9185285568237305, 'policy/approxkl_avg': 0.004041249398142099, 'policy/clipfrac_avg': 0.03419811278581619, 'loss/policy_avg': -0.026370838284492493, 'loss/value_avg': 0.38361072540283203, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.206040620803833, 'val/ratio': 0.9860953688621521, 'val/ratio_var': 0.00016932986909523606, 'val/num_eos_tokens': 0, 'lr': 8.402743753062224e-06, 'episode': 6796, 'epoch': 0.83}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:30:57<1:05:39, 131kB/s]
 83%|████████▎ | 1700/2041 [2:27:39<29:16,  5.15s/it][A

{'eps': 0, 'objective/kl': 89.83116149902344, 'objective/entropy': 72.87551879882812, 'objective/non_score_reward': -4.491558074951172, 'objective/rlhf_reward': -6.949175834655762, 'objective/scores': -2.45761775970459, 'policy/approxkl_avg': 0.010067992843687534, 'policy/clipfrac_avg': 0.07075472176074982, 'loss/policy_avg': -0.03685149550437927, 'loss/value_avg': 0.6323410272598267, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.3566796779632568, 'val/ratio': 0.9842165112495422, 'val/ratio_var': 0.00018031678337138146, 'val/num_eos_tokens': 0, 'lr': 8.378245957863792e-06, 'episode': 6800, 'epoch': 0.83}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:31:02<1:05:39, 131kB/s]
 83%|████████▎ | 1701/2041 [2:27:44<29:12,  5.16s/it][A

{'eps': 0, 'objective/kl': 76.4822998046875, 'objective/entropy': 60.595951080322266, 'objective/non_score_reward': -3.82411527633667, 'objective/rlhf_reward': -6.738795280456543, 'objective/scores': -2.914679765701294, 'policy/approxkl_avg': 0.004525442142039537, 'policy/clipfrac_avg': 0.036556605249643326, 'loss/policy_avg': -0.024712489917874336, 'loss/value_avg': 0.5022667050361633, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.230820894241333, 'val/ratio': 0.9880021810531616, 'val/ratio_var': 0.00012458009587135166, 'val/num_eos_tokens': 0, 'lr': 8.35374816266536e-06, 'episode': 6804, 'epoch': 0.83}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:31:08<1:05:39, 131kB/s]
 83%|████████▎ | 1702/2041 [2:27:49<29:12,  5.17s/it][A

{'eps': 0, 'objective/kl': 93.67332458496094, 'objective/entropy': 59.462135314941406, 'objective/non_score_reward': -4.683666229248047, 'objective/rlhf_reward': -7.4991865158081055, 'objective/scores': -2.8155200481414795, 'policy/approxkl_avg': 0.006691337563097477, 'policy/clipfrac_avg': 0.04599056392908096, 'loss/policy_avg': -0.03044920600950718, 'loss/value_avg': 0.42697209119796753, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2166311740875244, 'val/ratio': 0.9888983368873596, 'val/ratio_var': 8.804702520137653e-05, 'val/num_eos_tokens': 0, 'lr': 8.329250367466928e-06, 'episode': 6808, 'epoch': 0.83}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:31:13<1:05:39, 131kB/s]
 83%|████████▎ | 1703/2041 [2:27:54<29:04,  5.16s/it][A

{'eps': 0, 'objective/kl': 78.76048278808594, 'objective/entropy': 51.54247283935547, 'objective/non_score_reward': -3.9380242824554443, 'objective/rlhf_reward': -7.320323944091797, 'objective/scores': -3.3822994232177734, 'policy/approxkl_avg': 0.001880536088719964, 'policy/clipfrac_avg': 0.02004716917872429, 'loss/policy_avg': -0.016228483989834785, 'loss/value_avg': 0.5142135620117188, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1684532165527344, 'val/ratio': 0.9898322820663452, 'val/ratio_var': 9.460485307499766e-05, 'val/num_eos_tokens': 0, 'lr': 8.304752572268498e-06, 'episode': 6812, 'epoch': 0.83}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:31:18<1:05:39, 131kB/s]
 83%|████████▎ | 1704/2041 [2:27:59<28:56,  5.15s/it][A

{'eps': 0, 'objective/kl': 86.30226135253906, 'objective/entropy': 73.28231811523438, 'objective/non_score_reward': -4.315113544464111, 'objective/rlhf_reward': -7.252533912658691, 'objective/scores': -2.93742036819458, 'policy/approxkl_avg': 0.004616607911884785, 'policy/clipfrac_avg': 0.044811319559812546, 'loss/policy_avg': -0.03077036887407303, 'loss/value_avg': 0.4440966844558716, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2620161771774292, 'val/ratio': 0.9897785782814026, 'val/ratio_var': 7.88567922427319e-05, 'val/num_eos_tokens': 0, 'lr': 8.280254777070064e-06, 'episode': 6816, 'epoch': 0.83}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:31:23<1:05:39, 131kB/s]
 84%|████████▎ | 1705/2041 [2:28:05<28:57,  5.17s/it][A

{'eps': 0, 'objective/kl': 73.95133972167969, 'objective/entropy': 86.84191131591797, 'objective/non_score_reward': -3.6975669860839844, 'objective/rlhf_reward': -6.391812324523926, 'objective/scores': -2.6942451000213623, 'policy/approxkl_avg': 0.0059928251430392265, 'policy/clipfrac_avg': 0.05424527823925018, 'loss/policy_avg': -0.03303755074739456, 'loss/value_avg': 0.3826402425765991, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.61726975440979, 'val/ratio': 0.992774248123169, 'val/ratio_var': 3.8591610064031556e-05, 'val/num_eos_tokens': 0, 'lr': 8.255756981871632e-06, 'episode': 6820, 'epoch': 0.84}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:31:28<1:05:39, 131kB/s]
 84%|████████▎ | 1706/2041 [2:28:10<28:43,  5.14s/it][A

{'eps': 0, 'objective/kl': 58.11920166015625, 'objective/entropy': 34.94342803955078, 'objective/non_score_reward': -2.9059600830078125, 'objective/rlhf_reward': -5.9894914627075195, 'objective/scores': -3.083531379699707, 'policy/approxkl_avg': 0.0038308126386255026, 'policy/clipfrac_avg': 0.03891509398818016, 'loss/policy_avg': -0.022911807522177696, 'loss/value_avg': 0.32303386926651, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7529361844062805, 'val/ratio': 0.988420844078064, 'val/ratio_var': 0.00011364644160494208, 'val/num_eos_tokens': 0, 'lr': 8.2312591866732e-06, 'episode': 6824, 'epoch': 0.84}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:31:33<1:05:39, 131kB/s]
 84%|████████▎ | 1707/2041 [2:28:15<28:30,  5.12s/it][A

{'eps': 0, 'objective/kl': 79.83528137207031, 'objective/entropy': 83.64379119873047, 'objective/non_score_reward': -3.9917638301849365, 'objective/rlhf_reward': -6.3469085693359375, 'objective/scores': -2.35514497756958, 'policy/approxkl_avg': 0.004942374769598246, 'policy/clipfrac_avg': 0.06367924809455872, 'loss/policy_avg': -0.03463345766067505, 'loss/value_avg': 0.4975683093070984, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.5946331024169922, 'val/ratio': 0.9893415570259094, 'val/ratio_var': 9.261161903850734e-05, 'val/num_eos_tokens': 0, 'lr': 8.206761391474768e-06, 'episode': 6828, 'epoch': 0.84}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:31:38<1:05:39, 131kB/s]
 84%|████████▎ | 1708/2041 [2:28:20<28:33,  5.15s/it][A

{'eps': 0, 'objective/kl': 73.87864685058594, 'objective/entropy': 48.87690734863281, 'objective/non_score_reward': -3.69393253326416, 'objective/rlhf_reward': -6.202498435974121, 'objective/scores': -2.508565902709961, 'policy/approxkl_avg': 0.005024620797485113, 'policy/clipfrac_avg': 0.04716981202363968, 'loss/policy_avg': -0.023431692272424698, 'loss/value_avg': 0.3352205157279968, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9731965661048889, 'val/ratio': 0.9902658462524414, 'val/ratio_var': 6.683875108137727e-05, 'val/num_eos_tokens': 0, 'lr': 8.182263596276334e-06, 'episode': 6832, 'epoch': 0.84}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:31:44<1:05:39, 131kB/s]
 84%|████████▎ | 1709/2041 [2:28:25<28:37,  5.17s/it][A

{'eps': 0, 'objective/kl': 80.2205581665039, 'objective/entropy': 62.7989501953125, 'objective/non_score_reward': -4.011027812957764, 'objective/rlhf_reward': -7.140700340270996, 'objective/scores': -3.1296722888946533, 'policy/approxkl_avg': 0.009119781665503979, 'policy/clipfrac_avg': 0.06014150753617287, 'loss/policy_avg': -0.02924875169992447, 'loss/value_avg': 0.5857161283493042, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1922290325164795, 'val/ratio': 0.9928345680236816, 'val/ratio_var': 3.228795321774669e-05, 'val/num_eos_tokens': 0, 'lr': 8.157765801077902e-06, 'episode': 6836, 'epoch': 0.84}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:31:49<1:05:39, 131kB/s]
 84%|████████▍ | 1710/2041 [2:28:30<28:36,  5.19s/it][A

{'eps': 0, 'objective/kl': 85.8691635131836, 'objective/entropy': 79.90702819824219, 'objective/non_score_reward': -4.293458461761475, 'objective/rlhf_reward': -6.768457412719727, 'objective/scores': -2.474998712539673, 'policy/approxkl_avg': 0.005156383849680424, 'policy/clipfrac_avg': 0.06485849618911743, 'loss/policy_avg': -0.034412022680044174, 'loss/value_avg': 0.40842607617378235, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.5136293172836304, 'val/ratio': 0.9955443143844604, 'val/ratio_var': 1.3207570191298146e-05, 'val/num_eos_tokens': 0, 'lr': 8.133268005879472e-06, 'episode': 6840, 'epoch': 0.84}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:31:54<1:05:39, 131kB/s]
 84%|████████▍ | 1711/2041 [2:28:36<28:29,  5.18s/it][A

{'eps': 0, 'objective/kl': 74.26577758789062, 'objective/entropy': 65.67731475830078, 'objective/non_score_reward': -3.7132890224456787, 'objective/rlhf_reward': -6.482379913330078, 'objective/scores': -2.7690906524658203, 'policy/approxkl_avg': 0.005273491609841585, 'policy/clipfrac_avg': 0.056603774428367615, 'loss/policy_avg': -0.03163885697722435, 'loss/value_avg': 0.38183534145355225, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2386928796768188, 'val/ratio': 1.0045757293701172, 'val/ratio_var': 2.236371256003622e-05, 'val/num_eos_tokens': 0, 'lr': 8.10877021068104e-06, 'episode': 6844, 'epoch': 0.84}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:31:59<1:05:39, 131kB/s]
 84%|████████▍ | 1712/2041 [2:28:41<28:25,  5.19s/it][A

{'eps': 0, 'objective/kl': 79.7450180053711, 'objective/entropy': 83.38206481933594, 'objective/non_score_reward': -3.987251043319702, 'objective/rlhf_reward': -6.608168601989746, 'objective/scores': -2.620917558670044, 'policy/approxkl_avg': 0.004262953065335751, 'policy/clipfrac_avg': 0.04245283082127571, 'loss/policy_avg': -0.029256662353873253, 'loss/value_avg': 0.4788872301578522, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.4758098125457764, 'val/ratio': 0.9927061200141907, 'val/ratio_var': 3.696084240800701e-05, 'val/num_eos_tokens': 0, 'lr': 8.084272415482607e-06, 'episode': 6848, 'epoch': 0.84}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:32:04<1:05:39, 131kB/s]
 84%|████████▍ | 1713/2041 [2:28:46<28:24,  5.20s/it][A

{'eps': 0, 'objective/kl': 71.61626434326172, 'objective/entropy': 48.18042755126953, 'objective/non_score_reward': -3.58081316947937, 'objective/rlhf_reward': -7.0599164962768555, 'objective/scores': -3.4791033267974854, 'policy/approxkl_avg': 0.04195667430758476, 'policy/clipfrac_avg': 0.049528300762176514, 'loss/policy_avg': -0.027470463886857033, 'loss/value_avg': 0.4925183057785034, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9239506721496582, 'val/ratio': 0.9868510961532593, 'val/ratio_var': 0.00010944897076115012, 'val/num_eos_tokens': 0, 'lr': 8.059774620284175e-06, 'episode': 6852, 'epoch': 0.84}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:32:10<1:05:39, 131kB/s]
 84%|████████▍ | 1714/2041 [2:28:51<28:19,  5.20s/it][A

{'eps': 0, 'objective/kl': 76.01705932617188, 'objective/entropy': 45.614097595214844, 'objective/non_score_reward': -3.8008527755737305, 'objective/rlhf_reward': -7.509349822998047, 'objective/scores': -3.7084968090057373, 'policy/approxkl_avg': 0.00703087355941534, 'policy/clipfrac_avg': 0.05188679322600365, 'loss/policy_avg': -0.0314129963517189, 'loss/value_avg': 0.4740177094936371, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9893143177032471, 'val/ratio': 0.9936335682868958, 'val/ratio_var': 3.151905184495263e-05, 'val/num_eos_tokens': 0, 'lr': 8.035276825085743e-06, 'episode': 6856, 'epoch': 0.84}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:32:15<1:05:39, 131kB/s]
 84%|████████▍ | 1715/2041 [2:28:56<28:20,  5.21s/it][A

{'eps': 0, 'objective/kl': 73.07972717285156, 'objective/entropy': 93.11051940917969, 'objective/non_score_reward': -3.6539864540100098, 'objective/rlhf_reward': -6.612837791442871, 'objective/scores': -2.9588513374328613, 'policy/approxkl_avg': 0.010308990254998207, 'policy/clipfrac_avg': 0.09787736088037491, 'loss/policy_avg': -0.042622894048690796, 'loss/value_avg': 0.47798728942871094, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.6385747194290161, 'val/ratio': 0.9904833436012268, 'val/ratio_var': 4.886629540123977e-05, 'val/num_eos_tokens': 0, 'lr': 8.01077902988731e-06, 'episode': 6860, 'epoch': 0.84}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:32:20<1:05:39, 131kB/s]
 84%|████████▍ | 1716/2041 [2:29:02<28:10,  5.20s/it][A

{'eps': 0, 'objective/kl': 72.30643463134766, 'objective/entropy': 62.82506561279297, 'objective/non_score_reward': -3.615321636199951, 'objective/rlhf_reward': -7.1035566329956055, 'objective/scores': -3.4882352352142334, 'policy/approxkl_avg': 0.0026257981080561876, 'policy/clipfrac_avg': 0.03773584961891174, 'loss/policy_avg': -0.024743899703025818, 'loss/value_avg': 0.44931161403656006, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2641055583953857, 'val/ratio': 0.9921900033950806, 'val/ratio_var': 5.092895662528463e-05, 'val/num_eos_tokens': 0, 'lr': 7.986281234688879e-06, 'episode': 6864, 'epoch': 0.84}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:32:25<1:05:39, 131kB/s]
 84%|████████▍ | 1717/2041 [2:29:07<27:51,  5.16s/it][A

{'eps': 0, 'objective/kl': 77.00247192382812, 'objective/entropy': 68.21355438232422, 'objective/non_score_reward': -3.850123882293701, 'objective/rlhf_reward': -6.38572883605957, 'objective/scores': -2.5356051921844482, 'policy/approxkl_avg': 0.004788164049386978, 'policy/clipfrac_avg': 0.06367924809455872, 'loss/policy_avg': -0.03399917483329773, 'loss/value_avg': 0.5618267059326172, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.283245325088501, 'val/ratio': 0.986527144908905, 'val/ratio_var': 0.0001387048396281898, 'val/num_eos_tokens': 0, 'lr': 7.961783439490445e-06, 'episode': 6868, 'epoch': 0.84}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:32:30<1:05:39, 131kB/s]
 84%|████████▍ | 1718/2041 [2:29:12<27:46,  5.16s/it][A

{'eps': 0, 'objective/kl': 87.13652038574219, 'objective/entropy': 65.23968505859375, 'objective/non_score_reward': -4.356825828552246, 'objective/rlhf_reward': -6.854784965515137, 'objective/scores': -2.4979588985443115, 'policy/approxkl_avg': 0.025891730561852455, 'policy/clipfrac_avg': 0.07547169923782349, 'loss/policy_avg': -0.039224058389663696, 'loss/value_avg': 0.7493033409118652, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.257354974746704, 'val/ratio': 0.9880108833312988, 'val/ratio_var': 9.097185829887167e-05, 'val/num_eos_tokens': 0, 'lr': 7.937285644292015e-06, 'episode': 6872, 'epoch': 0.84}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:32:35<1:05:39, 131kB/s]
 84%|████████▍ | 1719/2041 [2:29:17<27:41,  5.16s/it][A

{'eps': 0, 'objective/kl': 96.07608032226562, 'objective/entropy': 95.18228912353516, 'objective/non_score_reward': -4.803804397583008, 'objective/rlhf_reward': -6.79213809967041, 'objective/scores': -1.9883337020874023, 'policy/approxkl_avg': 0.0037724191788583994, 'policy/clipfrac_avg': 0.03655660152435303, 'loss/policy_avg': -0.027076585218310356, 'loss/value_avg': 0.799605667591095, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.6254326105117798, 'val/ratio': 0.9965547323226929, 'val/ratio_var': 7.0713435889047105e-06, 'val/num_eos_tokens': 0, 'lr': 7.912787849093583e-06, 'episode': 6876, 'epoch': 0.84}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:32:41<1:05:39, 131kB/s]
 84%|████████▍ | 1720/2041 [2:29:22<27:28,  5.14s/it][A

{'eps': 0, 'objective/kl': 101.7544174194336, 'objective/entropy': 115.52490234375, 'objective/non_score_reward': -5.08772087097168, 'objective/rlhf_reward': -7.457805156707764, 'objective/scores': -2.370084285736084, 'policy/approxkl_avg': 0.006455558352172375, 'policy/clipfrac_avg': 0.0695754662156105, 'loss/policy_avg': -0.034528881311416626, 'loss/value_avg': 0.48598241806030273, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 2.028998374938965, 'val/ratio': 0.9990943074226379, 'val/ratio_var': 2.6025547867902787e-06, 'val/num_eos_tokens': 0, 'lr': 7.888290053895151e-06, 'episode': 6880, 'epoch': 0.84}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:32:46<1:05:39, 131kB/s]
 84%|████████▍ | 1721/2041 [2:29:27<27:26,  5.15s/it][A

{'eps': 0, 'objective/kl': 83.31672668457031, 'objective/entropy': 64.33372497558594, 'objective/non_score_reward': -4.165836334228516, 'objective/rlhf_reward': -6.853513240814209, 'objective/scores': -2.6876769065856934, 'policy/approxkl_avg': 0.005669498350471258, 'policy/clipfrac_avg': 0.048349056392908096, 'loss/policy_avg': -0.03136012330651283, 'loss/value_avg': 0.41198593378067017, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.4562745094299316, 'val/ratio': 0.9880529642105103, 'val/ratio_var': 0.00011098795948782936, 'val/num_eos_tokens': 0, 'lr': 7.863792258696717e-06, 'episode': 6884, 'epoch': 0.84}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:32:51<1:05:39, 131kB/s]
 84%|████████▍ | 1722/2041 [2:29:32<27:18,  5.14s/it][A

{'eps': 0, 'objective/kl': 70.37275695800781, 'objective/entropy': 58.20444869995117, 'objective/non_score_reward': -3.5186381340026855, 'objective/rlhf_reward': -6.120388984680176, 'objective/scores': -2.601750612258911, 'policy/approxkl_avg': 0.007241655141115189, 'policy/clipfrac_avg': 0.03891509398818016, 'loss/policy_avg': -0.02629910409450531, 'loss/value_avg': 0.42130470275878906, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.239839792251587, 'val/ratio': 0.9917967915534973, 'val/ratio_var': 3.8266978663159534e-05, 'val/num_eos_tokens': 0, 'lr': 7.839294463498285e-06, 'episode': 6888, 'epoch': 0.84}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:32:56<1:05:39, 131kB/s]
 84%|████████▍ | 1723/2041 [2:29:38<27:12,  5.14s/it][A

{'eps': 0, 'objective/kl': 66.9892578125, 'objective/entropy': 57.93387222290039, 'objective/non_score_reward': -3.3494627475738525, 'objective/rlhf_reward': -6.222592353820801, 'objective/scores': -2.8731298446655273, 'policy/approxkl_avg': 0.006341810803860426, 'policy/clipfrac_avg': 0.05896226316690445, 'loss/policy_avg': -0.029054593294858932, 'loss/value_avg': 0.40759021043777466, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0992956161499023, 'val/ratio': 0.9988325834274292, 'val/ratio_var': 9.81951075118559e-07, 'val/num_eos_tokens': 0, 'lr': 7.814796668299853e-06, 'episode': 6892, 'epoch': 0.84}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:33:01<1:05:39, 131kB/s]
 84%|████████▍ | 1724/2041 [2:29:43<27:03,  5.12s/it][A

{'eps': 0, 'objective/kl': 76.6099853515625, 'objective/entropy': 65.85746002197266, 'objective/non_score_reward': -3.8304994106292725, 'objective/rlhf_reward': -6.649543762207031, 'objective/scores': -2.819044589996338, 'policy/approxkl_avg': 0.006042352877557278, 'policy/clipfrac_avg': 0.07075471431016922, 'loss/policy_avg': -0.026476528495550156, 'loss/value_avg': 0.4802394211292267, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.191368818283081, 'val/ratio': 0.9994363188743591, 'val/ratio_var': 1.861956206994364e-06, 'val/num_eos_tokens': 0, 'lr': 7.790298873101421e-06, 'episode': 6896, 'epoch': 0.84}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:33:06<1:05:39, 131kB/s]
 85%|████████▍ | 1725/2041 [2:29:48<27:02,  5.13s/it][A

{'eps': 0, 'objective/kl': 76.50682067871094, 'objective/entropy': 72.7701416015625, 'objective/non_score_reward': -3.825340986251831, 'objective/rlhf_reward': -5.528238773345947, 'objective/scores': -1.7028976678848267, 'policy/approxkl_avg': 0.007850688882172108, 'policy/clipfrac_avg': 0.07193395495414734, 'loss/policy_avg': -0.03605272248387337, 'loss/value_avg': 0.4487074017524719, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.3153157234191895, 'val/ratio': 0.991047739982605, 'val/ratio_var': 6.364969158312306e-05, 'val/num_eos_tokens': 0, 'lr': 7.765801077902988e-06, 'episode': 6900, 'epoch': 0.85}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:33:11<1:05:39, 131kB/s]
 85%|████████▍ | 1726/2041 [2:29:53<26:58,  5.14s/it][A

{'eps': 0, 'objective/kl': 69.89257049560547, 'objective/entropy': 59.068321228027344, 'objective/non_score_reward': -3.49462890625, 'objective/rlhf_reward': -6.176792144775391, 'objective/scores': -2.6821634769439697, 'policy/approxkl_avg': 0.006446388550102711, 'policy/clipfrac_avg': 0.04245283082127571, 'loss/policy_avg': -0.021061183884739876, 'loss/value_avg': 0.37299877405166626, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9023740887641907, 'val/ratio': 0.9949589371681213, 'val/ratio_var': 1.3691550520888995e-05, 'val/num_eos_tokens': 0, 'lr': 7.741303282704558e-06, 'episode': 6904, 'epoch': 0.85}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:33:17<1:05:39, 131kB/s]
 85%|████████▍ | 1727/2041 [2:29:58<26:59,  5.16s/it][A

{'eps': 0, 'objective/kl': 66.35565948486328, 'objective/entropy': 73.67578887939453, 'objective/non_score_reward': -3.3177833557128906, 'objective/rlhf_reward': -6.115436553955078, 'objective/scores': -2.7976529598236084, 'policy/approxkl_avg': 0.004804861266165972, 'policy/clipfrac_avg': 0.053066037595272064, 'loss/policy_avg': -0.032926056534051895, 'loss/value_avg': 0.3946593403816223, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.3586747646331787, 'val/ratio': 0.9916155338287354, 'val/ratio_var': 5.354362292564474e-05, 'val/num_eos_tokens': 0, 'lr': 7.716805487506126e-06, 'episode': 6908, 'epoch': 0.85}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:33:22<1:05:39, 131kB/s]
 85%|████████▍ | 1728/2041 [2:30:03<26:51,  5.15s/it][A

{'eps': 0, 'objective/kl': 59.439109802246094, 'objective/entropy': 46.87187576293945, 'objective/non_score_reward': -2.9719555377960205, 'objective/rlhf_reward': -5.716320037841797, 'objective/scores': -2.7443642616271973, 'policy/approxkl_avg': 0.010666560381650925, 'policy/clipfrac_avg': 0.030660375952720642, 'loss/policy_avg': -0.02239859849214554, 'loss/value_avg': 0.2738281488418579, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8977501392364502, 'val/ratio': 0.9868454933166504, 'val/ratio_var': 0.00013526146358344704, 'val/num_eos_tokens': 0, 'lr': 7.692307692307694e-06, 'episode': 6912, 'epoch': 0.85}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:33:27<1:05:39, 131kB/s]
 85%|████████▍ | 1729/2041 [2:30:08<26:41,  5.13s/it][A

{'eps': 0, 'objective/kl': 83.69056701660156, 'objective/entropy': 51.02679443359375, 'objective/non_score_reward': -4.184528350830078, 'objective/rlhf_reward': -5.755707263946533, 'objective/scores': -1.571178913116455, 'policy/approxkl_avg': 0.004256406333297491, 'policy/clipfrac_avg': 0.03891509398818016, 'loss/policy_avg': -0.023685026913881302, 'loss/value_avg': 0.3842700719833374, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.062779188156128, 'val/ratio': 0.9946178197860718, 'val/ratio_var': 2.015708014369011e-05, 'val/num_eos_tokens': 0, 'lr': 7.66780989710926e-06, 'episode': 6916, 'epoch': 0.85}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:33:32<1:05:39, 131kB/s]
 85%|████████▍ | 1730/2041 [2:30:14<26:40,  5.15s/it][A

{'eps': 0, 'objective/kl': 69.55614471435547, 'objective/entropy': 80.25108337402344, 'objective/non_score_reward': -3.47780704498291, 'objective/rlhf_reward': -6.324712753295898, 'objective/scores': -2.8469057083129883, 'policy/approxkl_avg': 0.004231570288538933, 'policy/clipfrac_avg': 0.048349056392908096, 'loss/policy_avg': -0.030010059475898743, 'loss/value_avg': 0.4984920024871826, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.5955685377120972, 'val/ratio': 0.9990715384483337, 'val/ratio_var': 8.767678991716821e-07, 'val/num_eos_tokens': 0, 'lr': 7.643312101910828e-06, 'episode': 6920, 'epoch': 0.85}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:33:37<1:05:39, 131kB/s]
 85%|████████▍ | 1731/2041 [2:30:19<26:28,  5.13s/it][A

{'eps': 0, 'objective/kl': 72.42880249023438, 'objective/entropy': 64.1834716796875, 'objective/non_score_reward': -3.6214396953582764, 'objective/rlhf_reward': -6.025835037231445, 'objective/scores': -2.40439510345459, 'policy/approxkl_avg': 0.005089495796710253, 'policy/clipfrac_avg': 0.05188679322600365, 'loss/policy_avg': -0.026681268587708473, 'loss/value_avg': 0.5132365226745605, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.3009443283081055, 'val/ratio': 0.9899545907974243, 'val/ratio_var': 7.986401760717854e-05, 'val/num_eos_tokens': 0, 'lr': 7.618814306712396e-06, 'episode': 6924, 'epoch': 0.85}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:33:42<1:05:39, 131kB/s]
 85%|████████▍ | 1732/2041 [2:30:24<26:22,  5.12s/it][A

{'eps': 0, 'objective/kl': 98.68632507324219, 'objective/entropy': 81.45480346679688, 'objective/non_score_reward': -4.934316635131836, 'objective/rlhf_reward': -8.490464210510254, 'objective/scores': -3.556147575378418, 'policy/approxkl_avg': 0.0055936588905751705, 'policy/clipfrac_avg': 0.04599056392908096, 'loss/policy_avg': -0.029433950781822205, 'loss/value_avg': 0.7882769107818604, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.4366612434387207, 'val/ratio': 0.9934268593788147, 'val/ratio_var': 3.235696203773841e-05, 'val/num_eos_tokens': 0, 'lr': 7.594316511513964e-06, 'episode': 6928, 'epoch': 0.85}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:33:47<1:05:39, 131kB/s]
 85%|████████▍ | 1733/2041 [2:30:29<26:23,  5.14s/it][A

{'eps': 0, 'objective/kl': 72.40641021728516, 'objective/entropy': 55.67802047729492, 'objective/non_score_reward': -3.6203205585479736, 'objective/rlhf_reward': -7.148221969604492, 'objective/scores': -3.5279011726379395, 'policy/approxkl_avg': 0.0045753163285553455, 'policy/clipfrac_avg': 0.05070754513144493, 'loss/policy_avg': -0.02932366356253624, 'loss/value_avg': 0.6318367123603821, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0880107879638672, 'val/ratio': 0.987838864326477, 'val/ratio_var': 0.0001262948353542015, 'val/num_eos_tokens': 0, 'lr': 7.569818716315533e-06, 'episode': 6932, 'epoch': 0.85}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:33:53<1:05:39, 131kB/s]
 85%|████████▍ | 1734/2041 [2:30:34<26:21,  5.15s/it][A

{'eps': 0, 'objective/kl': 71.2122573852539, 'objective/entropy': 60.198341369628906, 'objective/non_score_reward': -3.560612916946411, 'objective/rlhf_reward': -5.746020793914795, 'objective/scores': -2.185407876968384, 'policy/approxkl_avg': 0.0031380178406834602, 'policy/clipfrac_avg': 0.041273586452007294, 'loss/policy_avg': -0.026042848825454712, 'loss/value_avg': 0.41757869720458984, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.151036024093628, 'val/ratio': 0.9917910099029541, 'val/ratio_var': 6.347573798848316e-05, 'val/num_eos_tokens': 0, 'lr': 7.545320921117099e-06, 'episode': 6936, 'epoch': 0.85}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:33:58<1:05:39, 131kB/s]
 85%|████████▌ | 1735/2041 [2:30:39<26:16,  5.15s/it][A

{'eps': 0, 'objective/kl': 80.36621856689453, 'objective/entropy': 77.3096923828125, 'objective/non_score_reward': -4.018311500549316, 'objective/rlhf_reward': -7.139581680297852, 'objective/scores': -3.121270179748535, 'policy/approxkl_avg': 0.0461413599550724, 'policy/clipfrac_avg': 0.04245283082127571, 'loss/policy_avg': -0.027904942631721497, 'loss/value_avg': 0.8252853155136108, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.4651994705200195, 'val/ratio': 0.9902714490890503, 'val/ratio_var': 4.9071535613620654e-05, 'val/num_eos_tokens': 0, 'lr': 7.520823125918667e-06, 'episode': 6940, 'epoch': 0.85}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:34:03<1:05:39, 131kB/s]
 85%|████████▌ | 1736/2041 [2:30:44<26:17,  5.17s/it][A

{'eps': 0, 'objective/kl': 72.86357116699219, 'objective/entropy': 75.67405700683594, 'objective/non_score_reward': -3.643178701400757, 'objective/rlhf_reward': -5.837068557739258, 'objective/scores': -2.19389009475708, 'policy/approxkl_avg': 0.0036362474784255028, 'policy/clipfrac_avg': 0.048349056392908096, 'loss/policy_avg': -0.02683779038488865, 'loss/value_avg': 0.31841522455215454, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.220694899559021, 'val/ratio': 0.9895843267440796, 'val/ratio_var': 8.604831964476034e-05, 'val/num_eos_tokens': 0, 'lr': 7.496325330720235e-06, 'episode': 6944, 'epoch': 0.85}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:34:08<1:05:39, 131kB/s]
 85%|████████▌ | 1737/2041 [2:30:50<26:06,  5.15s/it][A

{'eps': 0, 'objective/kl': 69.1983871459961, 'objective/entropy': 56.2615966796875, 'objective/non_score_reward': -3.4599192142486572, 'objective/rlhf_reward': -6.345717430114746, 'objective/scores': -2.885798215866089, 'policy/approxkl_avg': 0.007458964828401804, 'policy/clipfrac_avg': 0.04716981202363968, 'loss/policy_avg': -0.027035150676965714, 'loss/value_avg': 0.6165497899055481, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0031728744506836, 'val/ratio': 0.9891263246536255, 'val/ratio_var': 8.759546471992508e-05, 'val/num_eos_tokens': 0, 'lr': 7.471827535521804e-06, 'episode': 6948, 'epoch': 0.85}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:34:13<1:05:39, 131kB/s]
 85%|████████▌ | 1738/2041 [2:30:55<26:04,  5.16s/it][A

{'eps': 0, 'objective/kl': 102.50995635986328, 'objective/entropy': 63.30743408203125, 'objective/non_score_reward': -5.125497817993164, 'objective/rlhf_reward': -8.189788818359375, 'objective/scores': -3.064291477203369, 'policy/approxkl_avg': 0.005224092397838831, 'policy/clipfrac_avg': 0.056603770703077316, 'loss/policy_avg': -0.03320733457803726, 'loss/value_avg': 0.7958246469497681, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.14067542552948, 'val/ratio': 0.9975161552429199, 'val/ratio_var': 2.9164677926019067e-06, 'val/num_eos_tokens': 0, 'lr': 7.447329740323371e-06, 'episode': 6952, 'epoch': 0.85}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:34:18<1:05:39, 131kB/s]
 85%|████████▌ | 1739/2041 [2:31:00<26:02,  5.17s/it][A

{'eps': 0, 'objective/kl': 87.07646179199219, 'objective/entropy': 67.58435821533203, 'objective/non_score_reward': -4.353823184967041, 'objective/rlhf_reward': -6.775844573974609, 'objective/scores': -2.4220213890075684, 'policy/approxkl_avg': 0.006386409513652325, 'policy/clipfrac_avg': 0.04009433835744858, 'loss/policy_avg': -0.026890072971582413, 'loss/value_avg': 0.5223250389099121, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2898482084274292, 'val/ratio': 0.9853410720825195, 'val/ratio_var': 0.0001683626906014979, 'val/num_eos_tokens': 0, 'lr': 7.422831945124939e-06, 'episode': 6956, 'epoch': 0.85}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:34:24<1:05:39, 131kB/s]
 85%|████████▌ | 1740/2041 [2:31:05<25:52,  5.16s/it][A

{'eps': 0, 'objective/kl': 75.34502410888672, 'objective/entropy': 70.98753356933594, 'objective/non_score_reward': -3.767251491546631, 'objective/rlhf_reward': -7.015424728393555, 'objective/scores': -3.248173236846924, 'policy/approxkl_avg': 0.01237243227660656, 'policy/clipfrac_avg': 0.03773584961891174, 'loss/policy_avg': -0.02595646120607853, 'loss/value_avg': 0.7156559228897095, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.251243233680725, 'val/ratio': 0.9903219938278198, 'val/ratio_var': 6.0468955780379474e-05, 'val/num_eos_tokens': 0, 'lr': 7.398334149926507e-06, 'episode': 6960, 'epoch': 0.85}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:34:29<1:05:39, 131kB/s]
 85%|████████▌ | 1741/2041 [2:31:10<25:42,  5.14s/it][A

{'eps': 0, 'objective/kl': 81.43183898925781, 'objective/entropy': 72.67577362060547, 'objective/non_score_reward': -4.071591854095459, 'objective/rlhf_reward': -6.586830139160156, 'objective/scores': -2.515238046646118, 'policy/approxkl_avg': 0.0055228471755981445, 'policy/clipfrac_avg': 0.053066033869981766, 'loss/policy_avg': -0.03157506883144379, 'loss/value_avg': 0.5875874757766724, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.3676939010620117, 'val/ratio': 0.9933899641036987, 'val/ratio_var': 2.750346357061062e-05, 'val/num_eos_tokens': 0, 'lr': 7.373836354728076e-06, 'episode': 6964, 'epoch': 0.85}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:34:34<1:05:39, 131kB/s]
 85%|████████▌ | 1742/2041 [2:31:15<25:35,  5.14s/it][A

{'eps': 0, 'objective/kl': 76.62881469726562, 'objective/entropy': 80.84806823730469, 'objective/non_score_reward': -3.8314406871795654, 'objective/rlhf_reward': -5.965253829956055, 'objective/scores': -2.1338133811950684, 'policy/approxkl_avg': 0.024915408343076706, 'policy/clipfrac_avg': 0.0766509398818016, 'loss/policy_avg': -0.03210899606347084, 'loss/value_avg': 0.5344683527946472, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.4180504083633423, 'val/ratio': 1.009709119796753, 'val/ratio_var': 0.00020430974836926907, 'val/num_eos_tokens': 0, 'lr': 7.349338559529642e-06, 'episode': 6968, 'epoch': 0.85}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:34:39<1:05:39, 131kB/s]
 85%|████████▌ | 1743/2041 [2:31:20<25:32,  5.14s/it][A

{'eps': 0, 'objective/kl': 79.22384643554688, 'objective/entropy': 66.74908447265625, 'objective/non_score_reward': -3.9611926078796387, 'objective/rlhf_reward': -6.449753284454346, 'objective/scores': -2.488560676574707, 'policy/approxkl_avg': 0.0039004040881991386, 'policy/clipfrac_avg': 0.044811319559812546, 'loss/policy_avg': -0.02242809720337391, 'loss/value_avg': 0.49324920773506165, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.3091214895248413, 'val/ratio': 0.986078679561615, 'val/ratio_var': 0.0001634166546864435, 'val/num_eos_tokens': 0, 'lr': 7.32484076433121e-06, 'episode': 6972, 'epoch': 0.85}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:34:44<1:05:39, 131kB/s]
 85%|████████▌ | 1744/2041 [2:31:26<25:25,  5.14s/it][A

{'eps': 0, 'objective/kl': 64.55303955078125, 'objective/entropy': 64.45176696777344, 'objective/non_score_reward': -3.227651834487915, 'objective/rlhf_reward': -6.831661224365234, 'objective/scores': -3.6040093898773193, 'policy/approxkl_avg': 0.005394993349909782, 'policy/clipfrac_avg': 0.041273586452007294, 'loss/policy_avg': -0.02710052579641342, 'loss/value_avg': 0.36195579171180725, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1197348833084106, 'val/ratio': 0.9854635000228882, 'val/ratio_var': 0.00016676855739206076, 'val/num_eos_tokens': 0, 'lr': 7.300342969132779e-06, 'episode': 6976, 'epoch': 0.85}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:34:49<1:05:39, 131kB/s]
 85%|████████▌ | 1745/2041 [2:31:31<25:25,  5.15s/it][A

{'eps': 0, 'objective/kl': 72.4547119140625, 'objective/entropy': 34.386474609375, 'objective/non_score_reward': -3.6227359771728516, 'objective/rlhf_reward': -6.747838497161865, 'objective/scores': -3.1251025199890137, 'policy/approxkl_avg': 0.0034857764840126038, 'policy/clipfrac_avg': 0.024764150381088257, 'loss/policy_avg': -0.02277389168739319, 'loss/value_avg': 0.35529714822769165, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.905083179473877, 'val/ratio': 0.996355414390564, 'val/ratio_var': 9.179987500829156e-06, 'val/num_eos_tokens': 0, 'lr': 7.275845173934347e-06, 'episode': 6980, 'epoch': 0.86}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:34:54<1:05:39, 131kB/s]
 86%|████████▌ | 1746/2041 [2:31:36<25:25,  5.17s/it][A

{'eps': 0, 'objective/kl': 79.38189697265625, 'objective/entropy': 67.2265625, 'objective/non_score_reward': -3.9690945148468018, 'objective/rlhf_reward': -6.530778884887695, 'objective/scores': -2.5616841316223145, 'policy/approxkl_avg': 0.008482121862471104, 'policy/clipfrac_avg': 0.0554245263338089, 'loss/policy_avg': -0.029490001499652863, 'loss/value_avg': 0.6999263763427734, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.25981867313385, 'val/ratio': 0.9927414655685425, 'val/ratio_var': 3.7177556805545464e-05, 'val/num_eos_tokens': 0, 'lr': 7.251347378735913e-06, 'episode': 6984, 'epoch': 0.86}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:35:00<1:05:39, 131kB/s]
 86%|████████▌ | 1747/2041 [2:31:41<25:22,  5.18s/it][A

{'eps': 0, 'objective/kl': 71.25627136230469, 'objective/entropy': 79.58485412597656, 'objective/non_score_reward': -3.5628137588500977, 'objective/rlhf_reward': -6.253485202789307, 'objective/scores': -2.690671443939209, 'policy/approxkl_avg': 0.007474168203771114, 'policy/clipfrac_avg': 0.056603774428367615, 'loss/policy_avg': -0.033290326595306396, 'loss/value_avg': 0.5324517488479614, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2855770587921143, 'val/ratio': 0.989342451095581, 'val/ratio_var': 7.70084370742552e-05, 'val/num_eos_tokens': 0, 'lr': 7.226849583537481e-06, 'episode': 6988, 'epoch': 0.86}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:35:05<1:05:39, 131kB/s]
 86%|████████▌ | 1748/2041 [2:31:46<25:21,  5.19s/it][A

{'eps': 0, 'objective/kl': 68.33104705810547, 'objective/entropy': 52.121150970458984, 'objective/non_score_reward': -3.4165525436401367, 'objective/rlhf_reward': -6.345257759094238, 'objective/scores': -2.9287054538726807, 'policy/approxkl_avg': 0.009479920379817486, 'policy/clipfrac_avg': 0.04363207519054413, 'loss/policy_avg': -0.022566555067896843, 'loss/value_avg': 0.43670451641082764, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9615167379379272, 'val/ratio': 0.9874053001403809, 'val/ratio_var': 0.00010301240399712697, 'val/num_eos_tokens': 0, 'lr': 7.20235178833905e-06, 'episode': 6992, 'epoch': 0.86}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:35:10<1:05:39, 131kB/s]
 86%|████████▌ | 1749/2041 [2:31:51<25:07,  5.16s/it][A

{'eps': 0, 'objective/kl': 67.36225128173828, 'objective/entropy': 53.701480865478516, 'objective/non_score_reward': -3.368112564086914, 'objective/rlhf_reward': -6.102786540985107, 'objective/scores': -2.7346739768981934, 'policy/approxkl_avg': 0.007443631067872047, 'policy/clipfrac_avg': 0.05424527823925018, 'loss/policy_avg': -0.031220275908708572, 'loss/value_avg': 0.3840201795101166, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0863325595855713, 'val/ratio': 0.9976973533630371, 'val/ratio_var': 3.0498447358695557e-06, 'val/num_eos_tokens': 0, 'lr': 7.177853993140618e-06, 'episode': 6996, 'epoch': 0.86}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:35:15<1:05:39, 131kB/s]
 86%|████████▌ | 1750/2041 [2:31:57<25:08,  5.18s/it][A

{'eps': 0, 'objective/kl': 87.24745178222656, 'objective/entropy': 64.55787658691406, 'objective/non_score_reward': -4.362372398376465, 'objective/rlhf_reward': -7.575273513793945, 'objective/scores': -3.2129011154174805, 'policy/approxkl_avg': 0.005497250705957413, 'policy/clipfrac_avg': 0.04952830448746681, 'loss/policy_avg': -0.03286966681480408, 'loss/value_avg': 0.5376623868942261, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.4042956829071045, 'val/ratio': 1.0038046836853027, 'val/ratio_var': 1.2069875083398074e-05, 'val/num_eos_tokens': 0, 'lr': 7.153356197942186e-06, 'episode': 7000, 'epoch': 0.86}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:35:20<1:05:39, 131kB/s]
 86%|████████▌ | 1751/2041 [2:32:02<25:11,  5.21s/it][A

{'eps': 0, 'objective/kl': 74.2739028930664, 'objective/entropy': 52.03766632080078, 'objective/non_score_reward': -3.7136952877044678, 'objective/rlhf_reward': -5.969252586364746, 'objective/scores': -2.2555572986602783, 'policy/approxkl_avg': 0.004962516017258167, 'policy/clipfrac_avg': 0.04363207519054413, 'loss/policy_avg': -0.02584249898791313, 'loss/value_avg': 0.3872430920600891, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0156729221343994, 'val/ratio': 0.9986116886138916, 'val/ratio_var': 1.0001277814808418e-06, 'val/num_eos_tokens': 0, 'lr': 7.128858402743753e-06, 'episode': 7004, 'epoch': 0.86}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:35:26<1:05:39, 131kB/s]
 86%|████████▌ | 1752/2041 [2:32:07<25:05,  5.21s/it][A

{'eps': 0, 'objective/kl': 82.61480712890625, 'objective/entropy': 68.44747924804688, 'objective/non_score_reward': -4.130740165710449, 'objective/rlhf_reward': -6.500938892364502, 'objective/scores': -2.3701987266540527, 'policy/approxkl_avg': 0.002786149736493826, 'policy/clipfrac_avg': 0.033018868416547775, 'loss/policy_avg': -0.024984151124954224, 'loss/value_avg': 0.5823915004730225, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.3579882383346558, 'val/ratio': 0.9961481094360352, 'val/ratio_var': 1.014414920064155e-05, 'val/num_eos_tokens': 0, 'lr': 7.1043606075453216e-06, 'episode': 7008, 'epoch': 0.86}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:35:31<1:05:39, 131kB/s]
 86%|████████▌ | 1753/2041 [2:32:12<24:45,  5.16s/it][A

{'eps': 0, 'objective/kl': 77.31236267089844, 'objective/entropy': 73.29640197753906, 'objective/non_score_reward': -3.8656184673309326, 'objective/rlhf_reward': -5.936976909637451, 'objective/scores': -2.0713584423065186, 'policy/approxkl_avg': 0.011297560296952724, 'policy/clipfrac_avg': 0.04599056392908096, 'loss/policy_avg': -0.028024395927786827, 'loss/value_avg': 0.4163341522216797, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.4811241626739502, 'val/ratio': 1.0172648429870605, 'val/ratio_var': 0.0003150797274429351, 'val/num_eos_tokens': 0, 'lr': 7.07986281234689e-06, 'episode': 7012, 'epoch': 0.86}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:35:36<1:05:39, 131kB/s]
 86%|████████▌ | 1754/2041 [2:32:17<24:33,  5.13s/it][A

{'eps': 0, 'objective/kl': 81.20016479492188, 'objective/entropy': 52.54074478149414, 'objective/non_score_reward': -4.0600080490112305, 'objective/rlhf_reward': -5.8159074783325195, 'objective/scores': -1.7558996677398682, 'policy/approxkl_avg': 0.00532858120277524, 'policy/clipfrac_avg': 0.04363207519054413, 'loss/policy_avg': -0.025260040536522865, 'loss/value_avg': 0.37125056982040405, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1382845640182495, 'val/ratio': 0.9944843053817749, 'val/ratio_var': 1.8903452655649744e-05, 'val/num_eos_tokens': 0, 'lr': 7.055365017148458e-06, 'episode': 7016, 'epoch': 0.86}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:35:41<1:05:39, 131kB/s]
 86%|████████▌ | 1755/2041 [2:32:22<24:29,  5.14s/it][A

{'eps': 0, 'objective/kl': 88.85706329345703, 'objective/entropy': 72.06796264648438, 'objective/non_score_reward': -4.4428534507751465, 'objective/rlhf_reward': -6.9580535888671875, 'objective/scores': -2.515199899673462, 'policy/approxkl_avg': 0.008560105226933956, 'policy/clipfrac_avg': 0.05778301879763603, 'loss/policy_avg': -0.029337935149669647, 'loss/value_avg': 0.4886209964752197, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.3831336498260498, 'val/ratio': 0.9827789068222046, 'val/ratio_var': 0.00020998786203563213, 'val/num_eos_tokens': 0, 'lr': 7.030867221950024e-06, 'episode': 7020, 'epoch': 0.86}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:35:46<1:05:39, 131kB/s]
 86%|████████▌ | 1756/2041 [2:32:28<24:18,  5.12s/it][A

{'eps': 0, 'objective/kl': 74.01224517822266, 'objective/entropy': 73.08950805664062, 'objective/non_score_reward': -3.7006120681762695, 'objective/rlhf_reward': -6.6491827964782715, 'objective/scores': -2.948570728302002, 'policy/approxkl_avg': 0.004649036098271608, 'policy/clipfrac_avg': 0.04599056765437126, 'loss/policy_avg': -0.02762635052204132, 'loss/value_avg': 0.4698619246482849, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.4373542070388794, 'val/ratio': 1.0017322301864624, 'val/ratio_var': 3.916652076441096e-06, 'val/num_eos_tokens': 0, 'lr': 7.006369426751593e-06, 'episode': 7024, 'epoch': 0.86}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:35:51<1:05:39, 131kB/s]
 86%|████████▌ | 1757/2041 [2:32:33<24:21,  5.15s/it][A

{'eps': 0, 'objective/kl': 67.3798828125, 'objective/entropy': 68.06570434570312, 'objective/non_score_reward': -3.3689937591552734, 'objective/rlhf_reward': -5.994762897491455, 'objective/scores': -2.6257691383361816, 'policy/approxkl_avg': 0.002920885570347309, 'policy/clipfrac_avg': 0.041273586452007294, 'loss/policy_avg': -0.02702268958091736, 'loss/value_avg': 0.4028371572494507, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2844734191894531, 'val/ratio': 0.9904249906539917, 'val/ratio_var': 8.849159348756075e-05, 'val/num_eos_tokens': 0, 'lr': 6.981871631553161e-06, 'episode': 7028, 'epoch': 0.86}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:35:56<1:05:39, 131kB/s]
 86%|████████▌ | 1758/2041 [2:32:38<24:16,  5.15s/it][A

{'eps': 0, 'objective/kl': 68.56924438476562, 'objective/entropy': 57.47065734863281, 'objective/non_score_reward': -3.428462028503418, 'objective/rlhf_reward': -7.134311676025391, 'objective/scores': -3.7058496475219727, 'policy/approxkl_avg': 0.0022710319608449936, 'policy/clipfrac_avg': 0.03183962404727936, 'loss/policy_avg': -0.02257356606423855, 'loss/value_avg': 0.37774422764778137, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1157610416412354, 'val/ratio': 0.9908246397972107, 'val/ratio_var': 7.454268052242696e-05, 'val/num_eos_tokens': 0, 'lr': 6.957373836354729e-06, 'episode': 7032, 'epoch': 0.86}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:36:01<1:05:39, 131kB/s]
 86%|████████▌ | 1759/2041 [2:32:43<24:09,  5.14s/it][A

{'eps': 0, 'objective/kl': 69.12730407714844, 'objective/entropy': 62.30282211303711, 'objective/non_score_reward': -3.4563651084899902, 'objective/rlhf_reward': -6.456963539123535, 'objective/scores': -3.000598430633545, 'policy/approxkl_avg': 0.005068755708634853, 'policy/clipfrac_avg': 0.04245283082127571, 'loss/policy_avg': -0.029658932238817215, 'loss/value_avg': 0.4251822829246521, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0859227180480957, 'val/ratio': 0.9893521070480347, 'val/ratio_var': 9.177569154417142e-05, 'val/num_eos_tokens': 0, 'lr': 6.932876041156295e-06, 'episode': 7036, 'epoch': 0.86}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:36:07<1:05:39, 131kB/s]
 86%|████████▌ | 1760/2041 [2:32:48<24:16,  5.18s/it][A

{'eps': 0, 'objective/kl': 71.35497283935547, 'objective/entropy': 59.114139556884766, 'objective/non_score_reward': -3.5677490234375, 'objective/rlhf_reward': -5.951716423034668, 'objective/scores': -2.383967399597168, 'policy/approxkl_avg': 0.00443117693066597, 'policy/clipfrac_avg': 0.04599056765437126, 'loss/policy_avg': -0.023681093007326126, 'loss/value_avg': 0.5145844221115112, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1638031005859375, 'val/ratio': 0.9927983283996582, 'val/ratio_var': 4.2955452954629436e-05, 'val/num_eos_tokens': 0, 'lr': 6.908378245957864e-06, 'episode': 7040, 'epoch': 0.86}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:36:12<1:05:39, 131kB/s]
 86%|████████▋ | 1761/2041 [2:32:54<24:16,  5.20s/it][A

{'eps': 0, 'objective/kl': 86.41691589355469, 'objective/entropy': 79.89427185058594, 'objective/non_score_reward': -4.320845603942871, 'objective/rlhf_reward': -6.333122730255127, 'objective/scores': -2.012277126312256, 'policy/approxkl_avg': 0.002488605445250869, 'policy/clipfrac_avg': 0.02358490601181984, 'loss/policy_avg': -0.025366339832544327, 'loss/value_avg': 0.39861562848091125, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.623579740524292, 'val/ratio': 0.9929965734481812, 'val/ratio_var': 3.5952409234596416e-05, 'val/num_eos_tokens': 0, 'lr': 6.883880450759432e-06, 'episode': 7044, 'epoch': 0.86}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:36:17<1:05:39, 131kB/s]
 86%|████████▋ | 1762/2041 [2:32:59<24:06,  5.18s/it][A

{'eps': 0, 'objective/kl': 86.65327453613281, 'objective/entropy': 66.96173095703125, 'objective/non_score_reward': -4.332663536071777, 'objective/rlhf_reward': -7.764585971832275, 'objective/scores': -3.431922435760498, 'policy/approxkl_avg': 0.022687511518597603, 'policy/clipfrac_avg': 0.041273582726716995, 'loss/policy_avg': -0.029625477269291878, 'loss/value_avg': 0.7546877861022949, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2882684469223022, 'val/ratio': 0.991525411605835, 'val/ratio_var': 3.5946344723924994e-05, 'val/num_eos_tokens': 0, 'lr': 6.859382655561e-06, 'episode': 7048, 'epoch': 0.86}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:36:22<1:05:39, 131kB/s]
 86%|████████▋ | 1763/2041 [2:33:04<23:57,  5.17s/it][A

{'eps': 0, 'objective/kl': 74.9046630859375, 'objective/entropy': 63.32154846191406, 'objective/non_score_reward': -3.7452330589294434, 'objective/rlhf_reward': -5.939374923706055, 'objective/scores': -2.1941418647766113, 'policy/approxkl_avg': 0.006974162068217993, 'policy/clipfrac_avg': 0.04245282709598541, 'loss/policy_avg': -0.027518033981323242, 'loss/value_avg': 0.562347412109375, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.313167929649353, 'val/ratio': 0.984737753868103, 'val/ratio_var': 0.00018238746270071715, 'val/num_eos_tokens': 0, 'lr': 6.834884860362567e-06, 'episode': 7052, 'epoch': 0.86}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:36:28<1:05:39, 131kB/s]
 86%|████████▋ | 1764/2041 [2:33:09<23:58,  5.19s/it][A

{'eps': 0, 'objective/kl': 88.98629760742188, 'objective/entropy': 87.74613189697266, 'objective/non_score_reward': -4.449314594268799, 'objective/rlhf_reward': -6.389529228210449, 'objective/scores': -1.94021475315094, 'policy/approxkl_avg': 0.006065274588763714, 'policy/clipfrac_avg': 0.0554245263338089, 'loss/policy_avg': -0.030672185122966766, 'loss/value_avg': 0.5340452194213867, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.4814294576644897, 'val/ratio': 1.004166603088379, 'val/ratio_var': 2.6662737582228146e-05, 'val/num_eos_tokens': 0, 'lr': 6.8103870651641355e-06, 'episode': 7056, 'epoch': 0.86}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:36:33<1:05:39, 131kB/s]
 86%|████████▋ | 1765/2041 [2:33:14<23:47,  5.17s/it][A

{'eps': 0, 'objective/kl': 85.55073547363281, 'objective/entropy': 69.21823120117188, 'objective/non_score_reward': -4.277536869049072, 'objective/rlhf_reward': -6.4275922775268555, 'objective/scores': -2.150055408477783, 'policy/approxkl_avg': 0.005917021539062262, 'policy/clipfrac_avg': 0.04716981202363968, 'loss/policy_avg': -0.030260536819696426, 'loss/value_avg': 0.8830344080924988, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.3191494941711426, 'val/ratio': 0.9903315305709839, 'val/ratio_var': 6.383770232787356e-05, 'val/num_eos_tokens': 0, 'lr': 6.785889269965704e-06, 'episode': 7060, 'epoch': 0.86}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:36:38<1:05:39, 131kB/s]
 87%|████████▋ | 1766/2041 [2:33:19<23:37,  5.15s/it][A

{'eps': 0, 'objective/kl': 62.318077087402344, 'objective/entropy': 55.70460510253906, 'objective/non_score_reward': -3.115903854370117, 'objective/rlhf_reward': -4.651243686676025, 'objective/scores': -1.5353397130966187, 'policy/approxkl_avg': 0.006893530488014221, 'policy/clipfrac_avg': 0.0554245300590992, 'loss/policy_avg': -0.031192004680633545, 'loss/value_avg': 0.503106415271759, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0845630168914795, 'val/ratio': 0.9996452331542969, 'val/ratio_var': 1.8740483653800766e-07, 'val/num_eos_tokens': 0, 'lr': 6.761391474767272e-06, 'episode': 7064, 'epoch': 0.87}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:36:43<1:05:39, 131kB/s]
 87%|████████▋ | 1767/2041 [2:33:25<23:35,  5.17s/it][A

{'eps': 0, 'objective/kl': 65.98255920410156, 'objective/entropy': 60.658714294433594, 'objective/non_score_reward': -3.2991280555725098, 'objective/rlhf_reward': -6.241260528564453, 'objective/scores': -2.9421324729919434, 'policy/approxkl_avg': 0.002384110353887081, 'policy/clipfrac_avg': 0.021226413547992706, 'loss/policy_avg': -0.020712582394480705, 'loss/value_avg': 0.3695001006126404, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.211256980895996, 'val/ratio': 0.9978426098823547, 'val/ratio_var': 2.6926948066829937e-06, 'val/num_eos_tokens': 0, 'lr': 6.73689367956884e-06, 'episode': 7068, 'epoch': 0.87}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:36:48<1:05:39, 131kB/s]
 87%|████████▋ | 1768/2041 [2:33:30<23:27,  5.16s/it][A

{'eps': 0, 'objective/kl': 82.10578155517578, 'objective/entropy': 79.41552734375, 'objective/non_score_reward': -4.105288982391357, 'objective/rlhf_reward': -6.170585632324219, 'objective/scores': -2.0652968883514404, 'policy/approxkl_avg': 0.006920022424310446, 'policy/clipfrac_avg': 0.06721697747707367, 'loss/policy_avg': -0.034118641167879105, 'loss/value_avg': 0.5895577073097229, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.4749878644943237, 'val/ratio': 0.9949409365653992, 'val/ratio_var': 1.5015689314168412e-05, 'val/num_eos_tokens': 0, 'lr': 6.712395884370407e-06, 'episode': 7072, 'epoch': 0.87}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:36:53<1:05:39, 131kB/s]
 87%|████████▋ | 1769/2041 [2:33:35<23:20,  5.15s/it][A

{'eps': 0, 'objective/kl': 69.36186218261719, 'objective/entropy': 40.30292510986328, 'objective/non_score_reward': -3.4680933952331543, 'objective/rlhf_reward': -6.395777702331543, 'objective/scores': -2.9276845455169678, 'policy/approxkl_avg': 0.0044670719653368, 'policy/clipfrac_avg': 0.03891509398818016, 'loss/policy_avg': -0.02602245658636093, 'loss/value_avg': 0.4074668288230896, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9050078392028809, 'val/ratio': 0.9939334392547607, 'val/ratio_var': 3.3633652492426336e-05, 'val/num_eos_tokens': 0, 'lr': 6.687898089171975e-06, 'episode': 7076, 'epoch': 0.87}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:36:58<1:05:39, 131kB/s]
 87%|████████▋ | 1770/2041 [2:33:40<23:21,  5.17s/it][A

{'eps': 0, 'objective/kl': 71.63174438476562, 'objective/entropy': 65.62136840820312, 'objective/non_score_reward': -3.581587791442871, 'objective/rlhf_reward': -5.291441440582275, 'objective/scores': -1.7098535299301147, 'policy/approxkl_avg': 0.007260075770318508, 'policy/clipfrac_avg': 0.07193396985530853, 'loss/policy_avg': -0.03625166788697243, 'loss/value_avg': 0.4070555567741394, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2476904392242432, 'val/ratio': 1.0058469772338867, 'val/ratio_var': 4.063239612150937e-05, 'val/num_eos_tokens': 0, 'lr': 6.663400293973543e-06, 'episode': 7080, 'epoch': 0.87}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:37:04<1:05:39, 131kB/s]
 87%|████████▋ | 1771/2041 [2:33:45<23:10,  5.15s/it][A

{'eps': 0, 'objective/kl': 76.77946472167969, 'objective/entropy': 92.64483642578125, 'objective/non_score_reward': -3.8389735221862793, 'objective/rlhf_reward': -6.032017707824707, 'objective/scores': -2.1930439472198486, 'policy/approxkl_avg': 0.003927519544959068, 'policy/clipfrac_avg': 0.04599056392908096, 'loss/policy_avg': -0.03063134476542473, 'loss/value_avg': 0.3587523102760315, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.5154907703399658, 'val/ratio': 1.0014965534210205, 'val/ratio_var': 6.861922884127125e-06, 'val/num_eos_tokens': 0, 'lr': 6.638902498775111e-06, 'episode': 7084, 'epoch': 0.87}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:37:09<1:05:39, 131kB/s]
 87%|████████▋ | 1772/2041 [2:33:50<23:05,  5.15s/it][A

{'eps': 0, 'objective/kl': 87.96322631835938, 'objective/entropy': 95.33560180664062, 'objective/non_score_reward': -4.398161888122559, 'objective/rlhf_reward': -7.313680648803711, 'objective/scores': -2.9155189990997314, 'policy/approxkl_avg': 0.11270427703857422, 'policy/clipfrac_avg': 0.053066037595272064, 'loss/policy_avg': -0.03428742289543152, 'loss/value_avg': 0.5032706260681152, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.686229944229126, 'val/ratio': 0.9995958209037781, 'val/ratio_var': 7.3762566898949444e-06, 'val/num_eos_tokens': 0, 'lr': 6.614404703576678e-06, 'episode': 7088, 'epoch': 0.87}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:37:14<1:05:39, 131kB/s]
 87%|████████▋ | 1773/2041 [2:33:55<23:04,  5.16s/it][A

{'eps': 0, 'objective/kl': 81.63034057617188, 'objective/entropy': 105.1711654663086, 'objective/non_score_reward': -4.081517219543457, 'objective/rlhf_reward': -6.742886066436768, 'objective/scores': -2.6613688468933105, 'policy/approxkl_avg': 0.005551179870963097, 'policy/clipfrac_avg': 0.04363207519054413, 'loss/policy_avg': -0.030240369960665703, 'loss/value_avg': 0.6830577254295349, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.811513900756836, 'val/ratio': 0.9895893335342407, 'val/ratio_var': 8.206158963730559e-05, 'val/num_eos_tokens': 0, 'lr': 6.589906908378246e-06, 'episode': 7092, 'epoch': 0.87}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:37:19<1:05:39, 131kB/s]
 87%|████████▋ | 1774/2041 [2:34:01<23:02,  5.18s/it][A

{'eps': 0, 'objective/kl': 86.31236267089844, 'objective/entropy': 72.66612243652344, 'objective/non_score_reward': -4.315618515014648, 'objective/rlhf_reward': -6.932450294494629, 'objective/scores': -2.6168315410614014, 'policy/approxkl_avg': 0.007113198284059763, 'policy/clipfrac_avg': 0.04363207519054413, 'loss/policy_avg': -0.027512511238455772, 'loss/value_avg': 0.6054840683937073, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.3519740104675293, 'val/ratio': 0.9876290559768677, 'val/ratio_var': 0.00010557272617006674, 'val/num_eos_tokens': 0, 'lr': 6.565409113179814e-06, 'episode': 7096, 'epoch': 0.87}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:37:24<1:05:39, 131kB/s]
 87%|████████▋ | 1775/2041 [2:34:06<22:46,  5.14s/it][A

{'eps': 0, 'objective/kl': 77.22801208496094, 'objective/entropy': 81.91462707519531, 'objective/non_score_reward': -3.861400604248047, 'objective/rlhf_reward': -5.7186970710754395, 'objective/scores': -1.8572964668273926, 'policy/approxkl_avg': 0.004087534267455339, 'policy/clipfrac_avg': 0.04363207519054413, 'loss/policy_avg': -0.028682148084044456, 'loss/value_avg': 0.42131251096725464, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.5877678394317627, 'val/ratio': 0.9952574968338013, 'val/ratio_var': 1.30706293930416e-05, 'val/num_eos_tokens': 0, 'lr': 6.540911317981382e-06, 'episode': 7100, 'epoch': 0.87}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:37:29<1:05:39, 131kB/s]
 87%|████████▋ | 1776/2041 [2:34:11<22:47,  5.16s/it][A

{'eps': 0, 'objective/kl': 56.22226333618164, 'objective/entropy': 51.35940170288086, 'objective/non_score_reward': -2.8111133575439453, 'objective/rlhf_reward': -5.434156894683838, 'objective/scores': -2.6230435371398926, 'policy/approxkl_avg': 0.0027394157368689775, 'policy/clipfrac_avg': 0.02004716917872429, 'loss/policy_avg': -0.013790919445455074, 'loss/value_avg': 0.3974534869194031, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9748343229293823, 'val/ratio': 0.9895942211151123, 'val/ratio_var': 9.46524742175825e-05, 'val/num_eos_tokens': 0, 'lr': 6.5164135227829495e-06, 'episode': 7104, 'epoch': 0.87}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:37:34<1:05:39, 131kB/s]
 87%|████████▋ | 1777/2041 [2:34:16<22:36,  5.14s/it][A

{'eps': 0, 'objective/kl': 70.29216003417969, 'objective/entropy': 81.4540786743164, 'objective/non_score_reward': -3.514608383178711, 'objective/rlhf_reward': -6.7885236740112305, 'objective/scores': -3.2739150524139404, 'policy/approxkl_avg': 0.004348739515990019, 'policy/clipfrac_avg': 0.03891509771347046, 'loss/policy_avg': -0.02784663811326027, 'loss/value_avg': 0.6028186678886414, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.4809895753860474, 'val/ratio': 0.9907337427139282, 'val/ratio_var': 7.161812391132116e-05, 'val/num_eos_tokens': 0, 'lr': 6.4919157275845176e-06, 'episode': 7108, 'epoch': 0.87}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:37:40<1:05:39, 131kB/s]
 87%|████████▋ | 1778/2041 [2:34:21<22:35,  5.16s/it][A

{'eps': 0, 'objective/kl': 70.16583251953125, 'objective/entropy': 46.100975036621094, 'objective/non_score_reward': -3.508291482925415, 'objective/rlhf_reward': -5.172675132751465, 'objective/scores': -1.6643837690353394, 'policy/approxkl_avg': 0.00293493689969182, 'policy/clipfrac_avg': 0.033018868416547775, 'loss/policy_avg': -0.021846625953912735, 'loss/value_avg': 0.4770698547363281, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9961577653884888, 'val/ratio': 0.9909388422966003, 'val/ratio_var': 6.984464562265202e-05, 'val/num_eos_tokens': 0, 'lr': 6.467417932386086e-06, 'episode': 7112, 'epoch': 0.87}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:37:45<1:05:39, 131kB/s]
 87%|████████▋ | 1779/2041 [2:34:26<22:37,  5.18s/it][A

{'eps': 0, 'objective/kl': 77.83856201171875, 'objective/entropy': 81.22247314453125, 'objective/non_score_reward': -3.891928195953369, 'objective/rlhf_reward': -6.370765209197998, 'objective/scores': -2.478837013244629, 'policy/approxkl_avg': 0.0038515347987413406, 'policy/clipfrac_avg': 0.04009433835744858, 'loss/policy_avg': -0.029289821162819862, 'loss/value_avg': 0.5539724230766296, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.5331223011016846, 'val/ratio': 0.9833381175994873, 'val/ratio_var': 0.00023780101037118584, 'val/num_eos_tokens': 0, 'lr': 6.442920137187654e-06, 'episode': 7116, 'epoch': 0.87}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:37:50<1:05:39, 131kB/s]
 87%|████████▋ | 1780/2041 [2:34:32<22:25,  5.15s/it][A

{'eps': 0, 'objective/kl': 80.8924560546875, 'objective/entropy': 64.39669799804688, 'objective/non_score_reward': -4.044622421264648, 'objective/rlhf_reward': -7.166003227233887, 'objective/scores': -3.1213808059692383, 'policy/approxkl_avg': 0.004998127464205027, 'policy/clipfrac_avg': 0.04363207519054413, 'loss/policy_avg': -0.02679821103811264, 'loss/value_avg': 0.6183255314826965, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2234513759613037, 'val/ratio': 0.9932308197021484, 'val/ratio_var': 3.356397428433411e-05, 'val/num_eos_tokens': 0, 'lr': 6.418422341989221e-06, 'episode': 7120, 'epoch': 0.87}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:37:55<1:05:39, 131kB/s]
 87%|████████▋ | 1781/2041 [2:34:37<22:23,  5.17s/it][A

{'eps': 0, 'objective/kl': 71.17508697509766, 'objective/entropy': 56.749061584472656, 'objective/non_score_reward': -3.5587544441223145, 'objective/rlhf_reward': -5.153703689575195, 'objective/scores': -1.59494948387146, 'policy/approxkl_avg': 0.011198881082236767, 'policy/clipfrac_avg': 0.029481131583452225, 'loss/policy_avg': -0.02212047390639782, 'loss/value_avg': 0.5258349776268005, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1539459228515625, 'val/ratio': 0.9966485500335693, 'val/ratio_var': 5.500622592080617e-06, 'val/num_eos_tokens': 0, 'lr': 6.393924546790789e-06, 'episode': 7124, 'epoch': 0.87}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:38:00<1:05:39, 131kB/s]
 87%|████████▋ | 1782/2041 [2:34:42<22:12,  5.15s/it][A

{'eps': 0, 'objective/kl': 71.76775360107422, 'objective/entropy': 66.00439453125, 'objective/non_score_reward': -3.5883874893188477, 'objective/rlhf_reward': -5.966723442077637, 'objective/scores': -2.378335952758789, 'policy/approxkl_avg': 0.004499378614127636, 'policy/clipfrac_avg': 0.04245282709598541, 'loss/policy_avg': -0.030091892927885056, 'loss/value_avg': 0.37434151768684387, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.150926113128662, 'val/ratio': 0.9930670261383057, 'val/ratio_var': 3.659074354800396e-05, 'val/num_eos_tokens': 0, 'lr': 6.369426751592357e-06, 'episode': 7128, 'epoch': 0.87}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:38:05<1:05:39, 131kB/s]
 87%|████████▋ | 1783/2041 [2:34:47<22:12,  5.16s/it][A

{'eps': 0, 'objective/kl': 83.03173828125, 'objective/entropy': 62.73799514770508, 'objective/non_score_reward': -4.151587009429932, 'objective/rlhf_reward': -5.648764133453369, 'objective/scores': -1.497177004814148, 'policy/approxkl_avg': 0.005875994451344013, 'policy/clipfrac_avg': 0.04599056765437126, 'loss/policy_avg': -0.028266215696930885, 'loss/value_avg': 0.5745137929916382, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.107392430305481, 'val/ratio': 0.9823681116104126, 'val/ratio_var': 0.0002448501472827047, 'val/num_eos_tokens': 0, 'lr': 6.344928956393925e-06, 'episode': 7132, 'epoch': 0.87}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:38:11<1:05:39, 131kB/s]
 87%|████████▋ | 1784/2041 [2:34:52<22:11,  5.18s/it][A

{'eps': 0, 'objective/kl': 77.8389663696289, 'objective/entropy': 69.30180358886719, 'objective/non_score_reward': -3.8919482231140137, 'objective/rlhf_reward': -5.722816467285156, 'objective/scores': -1.8308683633804321, 'policy/approxkl_avg': 0.004020827356725931, 'policy/clipfrac_avg': 0.03066037781536579, 'loss/policy_avg': -0.02301696315407753, 'loss/value_avg': 0.555933952331543, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2191720008850098, 'val/ratio': 0.989226222038269, 'val/ratio_var': 8.95258563105017e-05, 'val/num_eos_tokens': 0, 'lr': 6.320431161195493e-06, 'episode': 7136, 'epoch': 0.87}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:38:16<1:05:39, 131kB/s]
 87%|████████▋ | 1785/2041 [2:34:57<22:06,  5.18s/it][A

{'eps': 0, 'objective/kl': 84.3685073852539, 'objective/entropy': 69.58175659179688, 'objective/non_score_reward': -4.218425750732422, 'objective/rlhf_reward': -6.65478515625, 'objective/scores': -2.436359405517578, 'policy/approxkl_avg': 0.21043500304222107, 'policy/clipfrac_avg': 0.03419811278581619, 'loss/policy_avg': -0.026159042492508888, 'loss/value_avg': 0.5294893980026245, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2300479412078857, 'val/ratio': 0.9889653921127319, 'val/ratio_var': 7.269696652656421e-05, 'val/num_eos_tokens': 0, 'lr': 6.29593336599706e-06, 'episode': 7140, 'epoch': 0.87}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:38:21<1:05:39, 131kB/s]
 88%|████████▊ | 1786/2041 [2:35:03<22:04,  5.20s/it][A

{'eps': 0, 'objective/kl': 71.53097534179688, 'objective/entropy': 54.27353286743164, 'objective/non_score_reward': -3.5765490531921387, 'objective/rlhf_reward': -5.001643657684326, 'objective/scores': -1.4250946044921875, 'policy/approxkl_avg': 0.002803608775138855, 'policy/clipfrac_avg': 0.030660375952720642, 'loss/policy_avg': -0.02242422103881836, 'loss/value_avg': 0.47666358947753906, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9800169467926025, 'val/ratio': 0.9919266700744629, 'val/ratio_var': 5.340705320122652e-05, 'val/num_eos_tokens': 0, 'lr': 6.271435570798628e-06, 'episode': 7144, 'epoch': 0.88}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:38:26<1:05:39, 131kB/s]
 88%|████████▊ | 1787/2041 [2:35:08<21:49,  5.16s/it][A

{'eps': 0, 'objective/kl': 64.24502563476562, 'objective/entropy': 68.12646484375, 'objective/non_score_reward': -3.2122511863708496, 'objective/rlhf_reward': -5.940716743469238, 'objective/scores': -2.7284653186798096, 'policy/approxkl_avg': 0.0035215469542890787, 'policy/clipfrac_avg': 0.03773584961891174, 'loss/policy_avg': -0.024324806407094002, 'loss/value_avg': 0.5329292416572571, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.3625719547271729, 'val/ratio': 0.990887463092804, 'val/ratio_var': 6.867416232125834e-05, 'val/num_eos_tokens': 0, 'lr': 6.246937775600196e-06, 'episode': 7148, 'epoch': 0.88}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:38:31<1:05:39, 131kB/s]
 88%|████████▊ | 1788/2041 [2:35:13<21:43,  5.15s/it][A

{'eps': 0, 'objective/kl': 61.70362091064453, 'objective/entropy': 55.69799041748047, 'objective/non_score_reward': -3.08518123626709, 'objective/rlhf_reward': -6.024035930633545, 'objective/scores': -2.938854694366455, 'policy/approxkl_avg': 0.0018181424820795655, 'policy/clipfrac_avg': 0.02358490601181984, 'loss/policy_avg': -0.023385770618915558, 'loss/value_avg': 0.3862189054489136, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0819681882858276, 'val/ratio': 0.9945166707038879, 'val/ratio_var': 2.5052189812413417e-05, 'val/num_eos_tokens': 0, 'lr': 6.222439980401764e-06, 'episode': 7152, 'epoch': 0.88}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:38:36<1:05:39, 131kB/s]
 88%|████████▊ | 1789/2041 [2:35:18<21:31,  5.12s/it][A

{'eps': 0, 'objective/kl': 71.32749938964844, 'objective/entropy': 63.69207763671875, 'objective/non_score_reward': -3.5663747787475586, 'objective/rlhf_reward': -5.750638484954834, 'objective/scores': -2.1842637062072754, 'policy/approxkl_avg': 0.0039003014098852873, 'policy/clipfrac_avg': 0.02712264098227024, 'loss/policy_avg': -0.022080205380916595, 'loss/value_avg': 0.437470018863678, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1234424114227295, 'val/ratio': 0.9915704727172852, 'val/ratio_var': 6.361569830914959e-05, 'val/num_eos_tokens': 0, 'lr': 6.197942185203332e-06, 'episode': 7156, 'epoch': 0.88}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:38:41<1:05:39, 131kB/s]
 88%|████████▊ | 1790/2041 [2:35:23<21:24,  5.12s/it][A

{'eps': 0, 'objective/kl': 71.82351684570312, 'objective/entropy': 65.2080307006836, 'objective/non_score_reward': -3.5911760330200195, 'objective/rlhf_reward': -6.326147556304932, 'objective/scores': -2.734971523284912, 'policy/approxkl_avg': 0.003714859252795577, 'policy/clipfrac_avg': 0.044811323285102844, 'loss/policy_avg': -0.028155013918876648, 'loss/value_avg': 0.5310969948768616, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.143669843673706, 'val/ratio': 0.9897326827049255, 'val/ratio_var': 8.435829658992589e-05, 'val/num_eos_tokens': 0, 'lr': 6.1734443900049e-06, 'episode': 7160, 'epoch': 0.88}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:38:47<1:05:39, 131kB/s]
 88%|████████▊ | 1791/2041 [2:35:28<21:16,  5.10s/it][A

{'eps': 0, 'objective/kl': 66.00639343261719, 'objective/entropy': 67.35356140136719, 'objective/non_score_reward': -3.3003196716308594, 'objective/rlhf_reward': -5.920956611633301, 'objective/scores': -2.6206369400024414, 'policy/approxkl_avg': 0.014739558100700378, 'policy/clipfrac_avg': 0.04009433835744858, 'loss/policy_avg': -0.02610674686729908, 'loss/value_avg': 0.5237978100776672, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2038284540176392, 'val/ratio': 0.9859688878059387, 'val/ratio_var': 0.00013371650129556656, 'val/num_eos_tokens': 0, 'lr': 6.148946594806468e-06, 'episode': 7164, 'epoch': 0.88}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:38:52<1:05:39, 131kB/s]
 88%|████████▊ | 1792/2041 [2:35:33<21:19,  5.14s/it][A

{'eps': 0, 'objective/kl': 62.298126220703125, 'objective/entropy': 47.89241409301758, 'objective/non_score_reward': -3.1149067878723145, 'objective/rlhf_reward': -5.119042873382568, 'objective/scores': -2.004136085510254, 'policy/approxkl_avg': 0.0032541612163186073, 'policy/clipfrac_avg': 0.04009433835744858, 'loss/policy_avg': -0.023457283154129982, 'loss/value_avg': 0.6519656777381897, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0009007453918457, 'val/ratio': 0.9980385303497314, 'val/ratio_var': 4.985200121154776e-06, 'val/num_eos_tokens': 0, 'lr': 6.124448799608036e-06, 'episode': 7168, 'epoch': 0.88}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:38:57<1:05:39, 131kB/s]
 88%|████████▊ | 1793/2041 [2:35:39<21:22,  5.17s/it][A

{'eps': 0, 'objective/kl': 66.20999908447266, 'objective/entropy': 42.64598846435547, 'objective/non_score_reward': -3.310500144958496, 'objective/rlhf_reward': -5.228771209716797, 'objective/scores': -1.9182708263397217, 'policy/approxkl_avg': 0.0027143852785229683, 'policy/clipfrac_avg': 0.03655660152435303, 'loss/policy_avg': -0.024780016392469406, 'loss/value_avg': 0.48416581749916077, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8350421190261841, 'val/ratio': 0.9985339641571045, 'val/ratio_var': 3.159128709739889e-06, 'val/num_eos_tokens': 0, 'lr': 6.099951004409604e-06, 'episode': 7172, 'epoch': 0.88}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:39:02<1:05:39, 131kB/s]
 88%|████████▊ | 1794/2041 [2:35:44<21:14,  5.16s/it][A

{'eps': 0, 'objective/kl': 78.43632507324219, 'objective/entropy': 67.24198150634766, 'objective/non_score_reward': -3.921816349029541, 'objective/rlhf_reward': -5.768375873565674, 'objective/scores': -1.8465595245361328, 'policy/approxkl_avg': 0.014759549871087074, 'policy/clipfrac_avg': 0.03773584961891174, 'loss/policy_avg': -0.023745331913232803, 'loss/value_avg': 0.5285482406616211, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2565065622329712, 'val/ratio': 0.9978179335594177, 'val/ratio_var': 3.5669484077516245e-06, 'val/num_eos_tokens': 0, 'lr': 6.075453209211171e-06, 'episode': 7176, 'epoch': 0.88}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:39:07<1:05:39, 131kB/s]
 88%|████████▊ | 1795/2041 [2:35:49<21:10,  5.17s/it][A

{'eps': 0, 'objective/kl': 85.37373352050781, 'objective/entropy': 74.66815185546875, 'objective/non_score_reward': -4.268686771392822, 'objective/rlhf_reward': -6.818888187408447, 'objective/scores': -2.550201416015625, 'policy/approxkl_avg': 0.00684104161337018, 'policy/clipfrac_avg': 0.05424528568983078, 'loss/policy_avg': -0.030379189178347588, 'loss/value_avg': 0.5921352505683899, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.289318323135376, 'val/ratio': 0.9992703199386597, 'val/ratio_var': 1.215621978190029e-06, 'val/num_eos_tokens': 0, 'lr': 6.050955414012739e-06, 'episode': 7180, 'epoch': 0.88}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:39:12<1:05:39, 131kB/s]
 88%|████████▊ | 1796/2041 [2:35:54<21:03,  5.16s/it][A

{'eps': 0, 'objective/kl': 63.03785705566406, 'objective/entropy': 71.34080505371094, 'objective/non_score_reward': -3.151893138885498, 'objective/rlhf_reward': -6.17181921005249, 'objective/scores': -3.019926071166992, 'policy/approxkl_avg': 0.0048080929554998875, 'policy/clipfrac_avg': 0.04009433835744858, 'loss/policy_avg': -0.027340607717633247, 'loss/value_avg': 0.41711854934692383, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2588536739349365, 'val/ratio': 0.9974226355552673, 'val/ratio_var': 5.810393759020371e-06, 'val/num_eos_tokens': 0, 'lr': 6.026457618814307e-06, 'episode': 7184, 'epoch': 0.88}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:39:18<1:05:39, 131kB/s]
 88%|████████▊ | 1797/2041 [2:35:59<20:58,  5.16s/it][A

{'eps': 0, 'objective/kl': 75.10206604003906, 'objective/entropy': 42.8945426940918, 'objective/non_score_reward': -3.75510311126709, 'objective/rlhf_reward': -6.32554817199707, 'objective/scores': -2.5704450607299805, 'policy/approxkl_avg': 0.002550840377807617, 'policy/clipfrac_avg': 0.025943394750356674, 'loss/policy_avg': -0.01952541247010231, 'loss/value_avg': 0.46061205863952637, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.947640597820282, 'val/ratio': 0.9906755685806274, 'val/ratio_var': 7.605246355524287e-05, 'val/num_eos_tokens': 0, 'lr': 6.001959823615875e-06, 'episode': 7188, 'epoch': 0.88}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:39:23<1:05:39, 131kB/s]
 88%|████████▊ | 1798/2041 [2:36:04<20:59,  5.18s/it][A

{'eps': 0, 'objective/kl': 82.61334228515625, 'objective/entropy': 63.390316009521484, 'objective/non_score_reward': -4.130667686462402, 'objective/rlhf_reward': -5.459698677062988, 'objective/scores': -1.329030990600586, 'policy/approxkl_avg': 0.004594971891492605, 'policy/clipfrac_avg': 0.037735845893621445, 'loss/policy_avg': -0.024030879139900208, 'loss/value_avg': 0.6155619621276855, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1439907550811768, 'val/ratio': 0.9859532713890076, 'val/ratio_var': 0.00017343777290079743, 'val/num_eos_tokens': 0, 'lr': 5.977462028417443e-06, 'episode': 7192, 'epoch': 0.88}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:39:28<1:05:39, 131kB/s]
 88%|████████▊ | 1799/2041 [2:36:10<20:54,  5.19s/it][A

{'eps': 0, 'objective/kl': 79.97743225097656, 'objective/entropy': 65.62715911865234, 'objective/non_score_reward': -3.9988715648651123, 'objective/rlhf_reward': -6.459741592407227, 'objective/scores': -2.4608702659606934, 'policy/approxkl_avg': 0.0034280987456440926, 'policy/clipfrac_avg': 0.041273586452007294, 'loss/policy_avg': -0.02825649082660675, 'loss/value_avg': 0.5004684329032898, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2204337120056152, 'val/ratio': 0.9834703207015991, 'val/ratio_var': 0.00024383484560530633, 'val/num_eos_tokens': 0, 'lr': 5.95296423321901e-06, 'episode': 7196, 'epoch': 0.88}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:39:33<1:05:39, 131kB/s]
 88%|████████▊ | 1800/2041 [2:36:15<20:46,  5.17s/it][A

{'eps': 0, 'objective/kl': 77.3201904296875, 'objective/entropy': 89.97988891601562, 'objective/non_score_reward': -3.866009473800659, 'objective/rlhf_reward': -6.111784934997559, 'objective/scores': -2.2457756996154785, 'policy/approxkl_avg': 0.005254472140222788, 'policy/clipfrac_avg': 0.02712264284491539, 'loss/policy_avg': -0.023853007704019547, 'loss/value_avg': 0.5891621708869934, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.4440536499023438, 'val/ratio': 0.9859644770622253, 'val/ratio_var': 0.00016727979527786374, 'val/num_eos_tokens': 0, 'lr': 5.928466438020578e-06, 'episode': 7200, 'epoch': 0.88}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:39:39<1:05:39, 131kB/s]
 88%|████████▊ | 1801/2041 [2:36:20<21:03,  5.26s/it][A

{'eps': 0, 'objective/kl': 66.24937438964844, 'objective/entropy': 52.068511962890625, 'objective/non_score_reward': -3.3124685287475586, 'objective/rlhf_reward': -5.748650550842285, 'objective/scores': -2.4361820220947266, 'policy/approxkl_avg': 0.009186381474137306, 'policy/clipfrac_avg': 0.03183962404727936, 'loss/policy_avg': -0.019845137372612953, 'loss/value_avg': 0.5252434611320496, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1410425901412964, 'val/ratio': 0.9894404411315918, 'val/ratio_var': 8.279734902316704e-05, 'val/num_eos_tokens': 0, 'lr': 5.903968642822146e-06, 'episode': 7204, 'epoch': 0.88}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:39:44<1:05:39, 131kB/s]
 88%|████████▊ | 1802/2041 [2:36:25<20:48,  5.22s/it][A

{'eps': 0, 'objective/kl': 66.52149963378906, 'objective/entropy': 69.70994567871094, 'objective/non_score_reward': -3.326075315475464, 'objective/rlhf_reward': -6.652477264404297, 'objective/scores': -3.326401710510254, 'policy/approxkl_avg': 0.0036360183730721474, 'policy/clipfrac_avg': 0.04009433835744858, 'loss/policy_avg': -0.024635106325149536, 'loss/value_avg': 0.5112735033035278, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.3124017715454102, 'val/ratio': 0.9949315786361694, 'val/ratio_var': 1.8832783098332584e-05, 'val/num_eos_tokens': 0, 'lr': 5.879470847623714e-06, 'episode': 7208, 'epoch': 0.88}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:39:49<1:05:39, 131kB/s]
 88%|████████▊ | 1803/2041 [2:36:30<20:37,  5.20s/it][A

{'eps': 0, 'objective/kl': 66.73637390136719, 'objective/entropy': 61.131980895996094, 'objective/non_score_reward': -3.3368189334869385, 'objective/rlhf_reward': -4.702216148376465, 'objective/scores': -1.3653974533081055, 'policy/approxkl_avg': 0.0020584771409630775, 'policy/clipfrac_avg': 0.024764150381088257, 'loss/policy_avg': -0.02072317712008953, 'loss/value_avg': 0.40831026434898376, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1020551919937134, 'val/ratio': 0.9899519085884094, 'val/ratio_var': 9.175831655738875e-05, 'val/num_eos_tokens': 0, 'lr': 5.854973052425282e-06, 'episode': 7212, 'epoch': 0.88}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:39:54<1:05:39, 131kB/s]
 88%|████████▊ | 1804/2041 [2:36:36<20:23,  5.16s/it][A

{'eps': 0, 'objective/kl': 65.16081237792969, 'objective/entropy': 50.3883056640625, 'objective/non_score_reward': -3.2580409049987793, 'objective/rlhf_reward': -5.437221527099609, 'objective/scores': -2.17918062210083, 'policy/approxkl_avg': 0.0018797415541484952, 'policy/clipfrac_avg': 0.02712264284491539, 'loss/policy_avg': -0.016990147531032562, 'loss/value_avg': 0.6019130945205688, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9210968613624573, 'val/ratio': 0.9955843687057495, 'val/ratio_var': 1.7116457456722856e-05, 'val/num_eos_tokens': 0, 'lr': 5.83047525722685e-06, 'episode': 7216, 'epoch': 0.88}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:39:59<1:05:39, 131kB/s]
 88%|████████▊ | 1805/2041 [2:36:41<20:19,  5.17s/it][A

{'eps': 0, 'objective/kl': 81.610107421875, 'objective/entropy': 58.83445739746094, 'objective/non_score_reward': -4.08050537109375, 'objective/rlhf_reward': -6.445725440979004, 'objective/scores': -2.365219831466675, 'policy/approxkl_avg': 0.004182829055935144, 'policy/clipfrac_avg': 0.05070754513144493, 'loss/policy_avg': -0.026124726980924606, 'loss/value_avg': 0.4926970601081848, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2792167663574219, 'val/ratio': 0.9971261024475098, 'val/ratio_var': 8.115715900203213e-06, 'val/num_eos_tokens': 0, 'lr': 5.805977462028418e-06, 'episode': 7220, 'epoch': 0.88}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:40:04<1:05:39, 131kB/s]
 88%|████████▊ | 1806/2041 [2:36:46<20:09,  5.15s/it][A

{'eps': 0, 'objective/kl': 71.02996826171875, 'objective/entropy': 60.30563735961914, 'objective/non_score_reward': -3.5514986515045166, 'objective/rlhf_reward': -6.239659309387207, 'objective/scores': -2.6881604194641113, 'policy/approxkl_avg': 0.003485865658149123, 'policy/clipfrac_avg': 0.03183962404727936, 'loss/policy_avg': -0.023109670728445053, 'loss/value_avg': 0.4825214445590973, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1793220043182373, 'val/ratio': 0.9986951351165771, 'val/ratio_var': 1.066566824192705e-06, 'val/num_eos_tokens': 0, 'lr': 5.781479666829986e-06, 'episode': 7224, 'epoch': 0.88}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:40:09<1:05:39, 131kB/s]
 89%|████████▊ | 1807/2041 [2:36:51<20:07,  5.16s/it][A

{'eps': 0, 'objective/kl': 84.71389770507812, 'objective/entropy': 75.96880340576172, 'objective/non_score_reward': -4.235694885253906, 'objective/rlhf_reward': -6.168129920959473, 'objective/scores': -1.9324347972869873, 'policy/approxkl_avg': 0.003652901854366064, 'policy/clipfrac_avg': 0.049528300762176514, 'loss/policy_avg': -0.026936890557408333, 'loss/value_avg': 0.5654020309448242, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2213069200515747, 'val/ratio': 0.998330295085907, 'val/ratio_var': 2.157059498131275e-06, 'val/num_eos_tokens': 0, 'lr': 5.756981871631553e-06, 'episode': 7228, 'epoch': 0.89}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:40:15<1:05:39, 131kB/s]
 89%|████████▊ | 1808/2041 [2:36:56<20:06,  5.18s/it][A

{'eps': 0, 'objective/kl': 77.07090759277344, 'objective/entropy': 53.284793853759766, 'objective/non_score_reward': -3.8535451889038086, 'objective/rlhf_reward': -6.077064514160156, 'objective/scores': -2.2235195636749268, 'policy/approxkl_avg': 0.005409941542893648, 'policy/clipfrac_avg': 0.03419811278581619, 'loss/policy_avg': -0.026612702757120132, 'loss/value_avg': 0.5121065378189087, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2777504920959473, 'val/ratio': 0.996320903301239, 'val/ratio_var': 8.55249982123496e-06, 'val/num_eos_tokens': 0, 'lr': 5.732484076433121e-06, 'episode': 7232, 'epoch': 0.89}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:40:20<1:05:39, 131kB/s]
 89%|████████▊ | 1809/2041 [2:37:01<20:02,  5.18s/it][A

{'eps': 0, 'objective/kl': 77.48997497558594, 'objective/entropy': 66.48817443847656, 'objective/non_score_reward': -3.8744986057281494, 'objective/rlhf_reward': -5.728337287902832, 'objective/scores': -1.8538389205932617, 'policy/approxkl_avg': 0.005467695649713278, 'policy/clipfrac_avg': 0.041273586452007294, 'loss/policy_avg': -0.02579631842672825, 'loss/value_avg': 0.4584093987941742, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2851016521453857, 'val/ratio': 0.9917641282081604, 'val/ratio_var': 5.472141128848307e-05, 'val/num_eos_tokens': 0, 'lr': 5.707986281234689e-06, 'episode': 7236, 'epoch': 0.89}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:40:25<1:05:39, 131kB/s]
 89%|████████▊ | 1810/2041 [2:37:07<19:53,  5.17s/it][A

{'eps': 0, 'objective/kl': 74.09715270996094, 'objective/entropy': 61.666839599609375, 'objective/non_score_reward': -3.70485782623291, 'objective/rlhf_reward': -5.555537223815918, 'objective/scores': -1.8506793975830078, 'policy/approxkl_avg': 0.0047691273503005505, 'policy/clipfrac_avg': 0.03183962404727936, 'loss/policy_avg': -0.022954873740673065, 'loss/value_avg': 0.5429979562759399, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2189717292785645, 'val/ratio': 0.9939041137695312, 'val/ratio_var': 2.7709202186088078e-05, 'val/num_eos_tokens': 0, 'lr': 5.683488486036257e-06, 'episode': 7240, 'epoch': 0.89}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:40:30<1:05:39, 131kB/s]
 89%|████████▊ | 1811/2041 [2:37:12<19:44,  5.15s/it][A

{'eps': 0, 'objective/kl': 82.76480102539062, 'objective/entropy': 67.62874603271484, 'objective/non_score_reward': -4.138240337371826, 'objective/rlhf_reward': -5.911529064178467, 'objective/scores': -1.7732888460159302, 'policy/approxkl_avg': 0.002718336181715131, 'policy/clipfrac_avg': 0.030660375952720642, 'loss/policy_avg': -0.022552412003278732, 'loss/value_avg': 0.4540102481842041, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2312300205230713, 'val/ratio': 0.9945195317268372, 'val/ratio_var': 2.8960343115613796e-05, 'val/num_eos_tokens': 0, 'lr': 5.658990690837824e-06, 'episode': 7244, 'epoch': 0.89}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:40:35<1:05:39, 131kB/s]
 89%|████████▉ | 1812/2041 [2:37:17<19:40,  5.15s/it][A

{'eps': 0, 'objective/kl': 66.58319091796875, 'objective/entropy': 60.44390869140625, 'objective/non_score_reward': -3.3291594982147217, 'objective/rlhf_reward': -5.538225173950195, 'objective/scores': -2.2090656757354736, 'policy/approxkl_avg': 0.003351117018610239, 'policy/clipfrac_avg': 0.03419811278581619, 'loss/policy_avg': -0.02380334958434105, 'loss/value_avg': 0.3645535707473755, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1841230392456055, 'val/ratio': 0.9912415742874146, 'val/ratio_var': 6.938698788871989e-05, 'val/num_eos_tokens': 0, 'lr': 5.634492895639393e-06, 'episode': 7248, 'epoch': 0.89}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:40:40<1:05:39, 131kB/s]
 89%|████████▉ | 1813/2041 [2:37:22<19:34,  5.15s/it][A

{'eps': 0, 'objective/kl': 84.84225463867188, 'objective/entropy': 69.04757690429688, 'objective/non_score_reward': -4.24211311340332, 'objective/rlhf_reward': -6.111547946929932, 'objective/scores': -1.8694348335266113, 'policy/approxkl_avg': 0.007804613560438156, 'policy/clipfrac_avg': 0.04716981202363968, 'loss/policy_avg': -0.02948240377008915, 'loss/value_avg': 0.7318705320358276, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2041698694229126, 'val/ratio': 0.9956401586532593, 'val/ratio_var': 1.0036034836957697e-05, 'val/num_eos_tokens': 0, 'lr': 5.60999510044096e-06, 'episode': 7252, 'epoch': 0.89}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:40:46<1:05:39, 131kB/s]
 89%|████████▉ | 1814/2041 [2:37:27<19:29,  5.15s/it][A

{'eps': 0, 'objective/kl': 63.549190521240234, 'objective/entropy': 48.76942443847656, 'objective/non_score_reward': -3.177459716796875, 'objective/rlhf_reward': -5.864147186279297, 'objective/scores': -2.686687707901001, 'policy/approxkl_avg': 0.002583687426522374, 'policy/clipfrac_avg': 0.024764150381088257, 'loss/policy_avg': -0.019152121618390083, 'loss/value_avg': 0.6259143352508545, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9900903105735779, 'val/ratio': 0.9956632852554321, 'val/ratio_var': 1.6636329746688716e-05, 'val/num_eos_tokens': 0, 'lr': 5.585497305242528e-06, 'episode': 7256, 'epoch': 0.89}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:40:51<1:05:39, 131kB/s]
 89%|████████▉ | 1815/2041 [2:37:32<19:27,  5.17s/it][A

{'eps': 0, 'objective/kl': 71.22502136230469, 'objective/entropy': 67.05750274658203, 'objective/non_score_reward': -3.561251163482666, 'objective/rlhf_reward': -5.9521989822387695, 'objective/scores': -2.3909475803375244, 'policy/approxkl_avg': 0.006021407898515463, 'policy/clipfrac_avg': 0.041273586452007294, 'loss/policy_avg': -0.024600498378276825, 'loss/value_avg': 0.612830400466919, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2050721645355225, 'val/ratio': 0.9891901612281799, 'val/ratio_var': 7.868010288802907e-05, 'val/num_eos_tokens': 0, 'lr': 5.5609995100440964e-06, 'episode': 7260, 'epoch': 0.89}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:40:56<1:05:39, 131kB/s]
 89%|████████▉ | 1816/2041 [2:37:37<19:20,  5.16s/it][A

{'eps': 0, 'objective/kl': 64.00198364257812, 'objective/entropy': 56.58416748046875, 'objective/non_score_reward': -3.200099229812622, 'objective/rlhf_reward': -6.057426452636719, 'objective/scores': -2.8573272228240967, 'policy/approxkl_avg': 0.002250795252621174, 'policy/clipfrac_avg': 0.021226413547992706, 'loss/policy_avg': -0.02169949561357498, 'loss/value_avg': 0.4373636841773987, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1468626260757446, 'val/ratio': 0.996167778968811, 'val/ratio_var': 1.4681769243907183e-05, 'val/num_eos_tokens': 0, 'lr': 5.5365017148456645e-06, 'episode': 7264, 'epoch': 0.89}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:41:01<1:05:39, 131kB/s]
 89%|████████▉ | 1817/2041 [2:37:43<19:16,  5.16s/it][A

{'eps': 0, 'objective/kl': 71.65907287597656, 'objective/entropy': 64.8066635131836, 'objective/non_score_reward': -3.582953453063965, 'objective/rlhf_reward': -5.839223384857178, 'objective/scores': -2.256269931793213, 'policy/approxkl_avg': 0.0027524735778570175, 'policy/clipfrac_avg': 0.03655660152435303, 'loss/policy_avg': -0.025648951530456543, 'loss/value_avg': 0.5411984920501709, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.3372441530227661, 'val/ratio': 0.9952384829521179, 'val/ratio_var': 1.7781996575649828e-05, 'val/num_eos_tokens': 0, 'lr': 5.5120039196472325e-06, 'episode': 7268, 'epoch': 0.89}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:41:06<1:05:39, 131kB/s]
 89%|████████▉ | 1818/2041 [2:37:48<19:06,  5.14s/it][A

{'eps': 0, 'objective/kl': 68.1939926147461, 'objective/entropy': 54.79658508300781, 'objective/non_score_reward': -3.4096999168395996, 'objective/rlhf_reward': -6.267105579376221, 'objective/scores': -2.857405662536621, 'policy/approxkl_avg': 0.0034021965693682432, 'policy/clipfrac_avg': 0.025943396613001823, 'loss/policy_avg': -0.02159382589161396, 'loss/value_avg': 0.5118244886398315, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.067756175994873, 'val/ratio': 0.9886444211006165, 'val/ratio_var': 0.00011561671271920204, 'val/num_eos_tokens': 0, 'lr': 5.4875061244488e-06, 'episode': 7272, 'epoch': 0.89}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:41:11<1:05:39, 131kB/s]
 89%|████████▉ | 1819/2041 [2:37:53<19:06,  5.17s/it][A

{'eps': 0, 'objective/kl': 73.66925811767578, 'objective/entropy': 62.719730377197266, 'objective/non_score_reward': -3.6834630966186523, 'objective/rlhf_reward': -5.382654190063477, 'objective/scores': -1.6991910934448242, 'policy/approxkl_avg': 0.003969120793044567, 'policy/clipfrac_avg': 0.033018868416547775, 'loss/policy_avg': -0.024113990366458893, 'loss/value_avg': 0.43981730937957764, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2929902076721191, 'val/ratio': 0.9931468963623047, 'val/ratio_var': 3.485521301627159e-05, 'val/num_eos_tokens': 0, 'lr': 5.463008329250368e-06, 'episode': 7276, 'epoch': 0.89}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:41:17<1:05:39, 131kB/s]
 89%|████████▉ | 1820/2041 [2:37:58<19:03,  5.17s/it][A

{'eps': 0, 'objective/kl': 63.53655242919922, 'objective/entropy': 67.4393539428711, 'objective/non_score_reward': -3.176827907562256, 'objective/rlhf_reward': -5.002378940582275, 'objective/scores': -1.8255510330200195, 'policy/approxkl_avg': 0.009342310950160027, 'policy/clipfrac_avg': 0.04245283082127571, 'loss/policy_avg': -0.024901703000068665, 'loss/value_avg': 0.40824568271636963, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1591145992279053, 'val/ratio': 0.9915443658828735, 'val/ratio_var': 4.7835757868597284e-05, 'val/num_eos_tokens': 0, 'lr': 5.438510534051936e-06, 'episode': 7280, 'epoch': 0.89}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:41:22<1:05:39, 131kB/s]
 89%|████████▉ | 1821/2041 [2:38:03<19:04,  5.20s/it][A

{'eps': 0, 'objective/kl': 70.19656372070312, 'objective/entropy': 72.9930648803711, 'objective/non_score_reward': -3.5098280906677246, 'objective/rlhf_reward': -5.629373550415039, 'objective/scores': -2.1195452213287354, 'policy/approxkl_avg': 0.012586090713739395, 'policy/clipfrac_avg': 0.03537736088037491, 'loss/policy_avg': -0.024555087089538574, 'loss/value_avg': 0.49004924297332764, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.3106129169464111, 'val/ratio': 0.9845700860023499, 'val/ratio_var': 0.00016320669965352863, 'val/num_eos_tokens': 0, 'lr': 5.414012738853504e-06, 'episode': 7284, 'epoch': 0.89}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:41:27<1:05:39, 131kB/s]
 89%|████████▉ | 1822/2041 [2:38:09<18:53,  5.18s/it][A

{'eps': 0, 'objective/kl': 73.18961334228516, 'objective/entropy': 67.31094360351562, 'objective/non_score_reward': -3.659480571746826, 'objective/rlhf_reward': -6.293759346008301, 'objective/scores': -2.6342790126800537, 'policy/approxkl_avg': 0.00493452837690711, 'policy/clipfrac_avg': 0.028301887214183807, 'loss/policy_avg': -0.021036595106124878, 'loss/value_avg': 0.594385027885437, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1752197742462158, 'val/ratio': 0.9914190769195557, 'val/ratio_var': 5.480716936290264e-05, 'val/num_eos_tokens': 0, 'lr': 5.389514943655071e-06, 'episode': 7288, 'epoch': 0.89}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:41:32<1:05:39, 131kB/s]
 89%|████████▉ | 1823/2041 [2:38:14<18:43,  5.15s/it][A

{'eps': 0, 'objective/kl': 67.90142822265625, 'objective/entropy': 41.41258239746094, 'objective/non_score_reward': -3.395071029663086, 'objective/rlhf_reward': -5.289450168609619, 'objective/scores': -1.8943792581558228, 'policy/approxkl_avg': 0.0034084157086908817, 'policy/clipfrac_avg': 0.037735845893621445, 'loss/policy_avg': -0.02269737422466278, 'loss/value_avg': 0.42972999811172485, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8700666427612305, 'val/ratio': 0.9939936399459839, 'val/ratio_var': 3.438938074395992e-05, 'val/num_eos_tokens': 0, 'lr': 5.365017148456639e-06, 'episode': 7292, 'epoch': 0.89}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:41:37<1:05:39, 131kB/s]
 89%|████████▉ | 1824/2041 [2:38:19<18:36,  5.14s/it][A

{'eps': 0, 'objective/kl': 71.02578735351562, 'objective/entropy': 52.09148025512695, 'objective/non_score_reward': -3.5512890815734863, 'objective/rlhf_reward': -5.731853485107422, 'objective/scores': -2.1805644035339355, 'policy/approxkl_avg': 0.006202678196132183, 'policy/clipfrac_avg': 0.01886792480945587, 'loss/policy_avg': -0.01928645931184292, 'loss/value_avg': 0.4988251328468323, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0289747714996338, 'val/ratio': 1.0088810920715332, 'val/ratio_var': 0.0001070380094461143, 'val/num_eos_tokens': 0, 'lr': 5.340519353258207e-06, 'episode': 7296, 'epoch': 0.89}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:41:42<1:05:39, 131kB/s]
 89%|████████▉ | 1825/2041 [2:38:24<18:32,  5.15s/it][A

{'eps': 0, 'objective/kl': 67.53617858886719, 'objective/entropy': 58.46257781982422, 'objective/non_score_reward': -3.3768091201782227, 'objective/rlhf_reward': -6.061316013336182, 'objective/scores': -2.684506893157959, 'policy/approxkl_avg': 0.028933601453900337, 'policy/clipfrac_avg': 0.041273586452007294, 'loss/policy_avg': -0.02353023737668991, 'loss/value_avg': 0.3563247621059418, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.018108606338501, 'val/ratio': 0.9833382964134216, 'val/ratio_var': 0.00018341380928177387, 'val/num_eos_tokens': 0, 'lr': 5.316021558059775e-06, 'episode': 7300, 'epoch': 0.89}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:41:48<1:05:39, 131kB/s]
 89%|████████▉ | 1826/2041 [2:38:29<18:34,  5.18s/it][A

{'eps': 0, 'objective/kl': 68.76427459716797, 'objective/entropy': 40.77411651611328, 'objective/non_score_reward': -3.438214063644409, 'objective/rlhf_reward': -5.667540550231934, 'objective/scores': -2.2293267250061035, 'policy/approxkl_avg': 0.004403069615364075, 'policy/clipfrac_avg': 0.03537735715508461, 'loss/policy_avg': -0.019876059144735336, 'loss/value_avg': 0.36392128467559814, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.828256368637085, 'val/ratio': 0.9917910099029541, 'val/ratio_var': 4.6814584493404254e-05, 'val/num_eos_tokens': 0, 'lr': 5.291523762861342e-06, 'episode': 7304, 'epoch': 0.89}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:41:53<1:05:39, 131kB/s]
 90%|████████▉ | 1827/2041 [2:38:34<18:25,  5.16s/it][A

{'eps': 0, 'objective/kl': 68.70701599121094, 'objective/entropy': 54.80522537231445, 'objective/non_score_reward': -3.4353506565093994, 'objective/rlhf_reward': -5.9581403732299805, 'objective/scores': -2.522789716720581, 'policy/approxkl_avg': 0.001859098207205534, 'policy/clipfrac_avg': 0.01179245300590992, 'loss/policy_avg': -0.016506273299455643, 'loss/value_avg': 0.38157325983047485, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0160621404647827, 'val/ratio': 0.9923335909843445, 'val/ratio_var': 5.5211519793374464e-05, 'val/num_eos_tokens': 0, 'lr': 5.2670259676629104e-06, 'episode': 7308, 'epoch': 0.9}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:41:58<1:05:39, 131kB/s]
 90%|████████▉ | 1828/2041 [2:38:39<18:16,  5.15s/it][A

{'eps': 0, 'objective/kl': 65.25563049316406, 'objective/entropy': 48.43694305419922, 'objective/non_score_reward': -3.262781858444214, 'objective/rlhf_reward': -5.483888626098633, 'objective/scores': -2.221106767654419, 'policy/approxkl_avg': 0.0036742431111633778, 'policy/clipfrac_avg': 0.021226413547992706, 'loss/policy_avg': -0.017232682555913925, 'loss/value_avg': 0.5264027118682861, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0272585153579712, 'val/ratio': 0.9909332990646362, 'val/ratio_var': 6.386199675034732e-05, 'val/num_eos_tokens': 0, 'lr': 5.2425281724644785e-06, 'episode': 7312, 'epoch': 0.9}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:42:03<1:05:39, 131kB/s]
 90%|████████▉ | 1829/2041 [2:38:45<18:11,  5.15s/it][A

{'eps': 0, 'objective/kl': 79.9369888305664, 'objective/entropy': 52.595115661621094, 'objective/non_score_reward': -3.996849536895752, 'objective/rlhf_reward': -5.908656597137451, 'objective/scores': -1.9118070602416992, 'policy/approxkl_avg': 0.002820478053763509, 'policy/clipfrac_avg': 0.03183962032198906, 'loss/policy_avg': -0.021598970517516136, 'loss/value_avg': 0.439919650554657, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1095939874649048, 'val/ratio': 0.9951918721199036, 'val/ratio_var': 1.8703954992815852e-05, 'val/num_eos_tokens': 0, 'lr': 5.2180303772660465e-06, 'episode': 7316, 'epoch': 0.9}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:42:08<1:05:39, 131kB/s]
 90%|████████▉ | 1830/2041 [2:38:50<18:04,  5.14s/it][A

{'eps': 0, 'objective/kl': 76.55416107177734, 'objective/entropy': 79.3048095703125, 'objective/non_score_reward': -3.8277082443237305, 'objective/rlhf_reward': -5.426592826843262, 'objective/scores': -1.5988844633102417, 'policy/approxkl_avg': 0.005168381612747908, 'policy/clipfrac_avg': 0.02358490601181984, 'loss/policy_avg': -0.0214524045586586, 'loss/value_avg': 0.40752148628234863, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.288081169128418, 'val/ratio': 0.9956884980201721, 'val/ratio_var': 1.1159955647599418e-05, 'val/num_eos_tokens': 0, 'lr': 5.1935325820676146e-06, 'episode': 7320, 'epoch': 0.9}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:42:13<1:05:39, 131kB/s]
 90%|████████▉ | 1831/2041 [2:38:55<17:58,  5.14s/it][A

{'eps': 0, 'objective/kl': 71.6672592163086, 'objective/entropy': 59.9411735534668, 'objective/non_score_reward': -3.5833630561828613, 'objective/rlhf_reward': -5.622058391571045, 'objective/scores': -2.0386953353881836, 'policy/approxkl_avg': 0.0025228362064808607, 'policy/clipfrac_avg': 0.024764150381088257, 'loss/policy_avg': -0.023093093186616898, 'loss/value_avg': 0.35957616567611694, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0270166397094727, 'val/ratio': 0.9980877041816711, 'val/ratio_var': 2.4404334908467717e-06, 'val/num_eos_tokens': 0, 'lr': 5.169034786869182e-06, 'episode': 7324, 'epoch': 0.9}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:42:19<1:05:39, 131kB/s]
 90%|████████▉ | 1832/2041 [2:39:00<17:58,  5.16s/it][A

{'eps': 0, 'objective/kl': 78.4194107055664, 'objective/entropy': 72.2212905883789, 'objective/non_score_reward': -3.9209704399108887, 'objective/rlhf_reward': -5.848233222961426, 'objective/scores': -1.927262783050537, 'policy/approxkl_avg': 0.014749592170119286, 'policy/clipfrac_avg': 0.04009433835744858, 'loss/policy_avg': -0.02514282800257206, 'loss/value_avg': 0.6930663585662842, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.4698941707611084, 'val/ratio': 0.986953854560852, 'val/ratio_var': 0.00011978575639659539, 'val/num_eos_tokens': 0, 'lr': 5.14453699167075e-06, 'episode': 7328, 'epoch': 0.9}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:42:24<1:05:39, 131kB/s]
 90%|████████▉ | 1833/2041 [2:39:05<17:52,  5.16s/it][A

{'eps': 0, 'objective/kl': 74.90707397460938, 'objective/entropy': 49.14936447143555, 'objective/non_score_reward': -3.7453534603118896, 'objective/rlhf_reward': -5.166505813598633, 'objective/scores': -1.4211525917053223, 'policy/approxkl_avg': 0.0023083770647644997, 'policy/clipfrac_avg': 0.030660375952720642, 'loss/policy_avg': -0.023752862587571144, 'loss/value_avg': 0.32019704580307007, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9021687507629395, 'val/ratio': 0.9941712021827698, 'val/ratio_var': 2.7168258384335786e-05, 'val/num_eos_tokens': 0, 'lr': 5.120039196472318e-06, 'episode': 7332, 'epoch': 0.9}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:42:29<1:05:39, 131kB/s]
 90%|████████▉ | 1834/2041 [2:39:10<17:40,  5.13s/it][A

{'eps': 0, 'objective/kl': 75.44210815429688, 'objective/entropy': 54.1601676940918, 'objective/non_score_reward': -3.7721056938171387, 'objective/rlhf_reward': -4.596041679382324, 'objective/scores': -0.8239358067512512, 'policy/approxkl_avg': 0.005757966078817844, 'policy/clipfrac_avg': 0.02358490601181984, 'loss/policy_avg': -0.018092207610607147, 'loss/value_avg': 0.5491064786911011, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2486191987991333, 'val/ratio': 0.9908061027526855, 'val/ratio_var': 7.187268056441098e-05, 'val/num_eos_tokens': 0, 'lr': 5.095541401273886e-06, 'episode': 7336, 'epoch': 0.9}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:42:34<1:05:39, 131kB/s]
 90%|████████▉ | 1835/2041 [2:39:15<17:32,  5.11s/it][A

{'eps': 0, 'objective/kl': 64.76066589355469, 'objective/entropy': 67.88008880615234, 'objective/non_score_reward': -3.2380330562591553, 'objective/rlhf_reward': -5.186078071594238, 'objective/scores': -1.948045015335083, 'policy/approxkl_avg': 0.0021205462981015444, 'policy/clipfrac_avg': 0.025943394750356674, 'loss/policy_avg': -0.021757330745458603, 'loss/value_avg': 0.4711785316467285, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1573834419250488, 'val/ratio': 1.0045286417007446, 'val/ratio_var': 2.1621817722916603e-05, 'val/num_eos_tokens': 0, 'lr': 5.071043606075453e-06, 'episode': 7340, 'epoch': 0.9}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:42:39<1:05:39, 131kB/s]
 90%|████████▉ | 1836/2041 [2:39:20<17:27,  5.11s/it][A

{'eps': 0, 'objective/kl': 82.20303344726562, 'objective/entropy': 66.02709197998047, 'objective/non_score_reward': -4.110151767730713, 'objective/rlhf_reward': -6.814481735229492, 'objective/scores': -2.7043299674987793, 'policy/approxkl_avg': 0.001827817875891924, 'policy/clipfrac_avg': 0.01886792480945587, 'loss/policy_avg': -0.018358489498496056, 'loss/value_avg': 0.6204289793968201, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2552409172058105, 'val/ratio': 0.9929713010787964, 'val/ratio_var': 4.205058576189913e-05, 'val/num_eos_tokens': 0, 'lr': 5.046545810877021e-06, 'episode': 7344, 'epoch': 0.9}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:42:44<1:05:39, 131kB/s]
 90%|█████████ | 1837/2041 [2:39:26<17:27,  5.14s/it][A

{'eps': 0, 'objective/kl': 69.73052978515625, 'objective/entropy': 45.86115264892578, 'objective/non_score_reward': -3.4865264892578125, 'objective/rlhf_reward': -5.873893737792969, 'objective/scores': -2.3873672485351562, 'policy/approxkl_avg': 0.016847914084792137, 'policy/clipfrac_avg': 0.049528300762176514, 'loss/policy_avg': -0.026311280205845833, 'loss/value_avg': 0.6914995908737183, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9469242095947266, 'val/ratio': 0.9784365892410278, 'val/ratio_var': 0.0003310644533485174, 'val/num_eos_tokens': 0, 'lr': 5.022048015678589e-06, 'episode': 7348, 'epoch': 0.9}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:42:52<1:05:39, 131kB/s]
 90%|█████████ | 1838/2041 [2:39:34<20:38,  6.10s/it][A

{'eps': 0, 'objective/kl': 69.54843139648438, 'objective/entropy': 46.3276252746582, 'objective/non_score_reward': -3.477421998977661, 'objective/rlhf_reward': -5.48515510559082, 'objective/scores': -2.007733106613159, 'policy/approxkl_avg': 0.0034781168214976788, 'policy/clipfrac_avg': 0.016509434208273888, 'loss/policy_avg': -0.01794234663248062, 'loss/value_avg': 0.44103020429611206, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.964709997177124, 'val/ratio': 0.9882151484489441, 'val/ratio_var': 0.000123315243399702, 'val/num_eos_tokens': 0, 'lr': 4.997550220480157e-06, 'episode': 7352, 'epoch': 0.9}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:42:58<1:05:39, 131kB/s]
 90%|█████████ | 1839/2041 [2:39:39<19:32,  5.80s/it][A

{'eps': 0, 'objective/kl': 64.55010986328125, 'objective/entropy': 51.984519958496094, 'objective/non_score_reward': -3.227505683898926, 'objective/rlhf_reward': -5.545869827270508, 'objective/scores': -2.318364143371582, 'policy/approxkl_avg': 0.005258608143776655, 'policy/clipfrac_avg': 0.02712264284491539, 'loss/policy_avg': -0.01851234957575798, 'loss/value_avg': 0.39926186203956604, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0625029802322388, 'val/ratio': 0.9919360280036926, 'val/ratio_var': 4.6134930016705766e-05, 'val/num_eos_tokens': 0, 'lr': 4.973052425281724e-06, 'episode': 7356, 'epoch': 0.9}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:43:03<1:05:39, 131kB/s]
 90%|█████████ | 1840/2041 [2:39:44<18:44,  5.60s/it][A

{'eps': 0, 'objective/kl': 65.02979278564453, 'objective/entropy': 61.9721565246582, 'objective/non_score_reward': -3.2514896392822266, 'objective/rlhf_reward': -5.617374420166016, 'objective/scores': -2.365884780883789, 'policy/approxkl_avg': 0.008757026866078377, 'policy/clipfrac_avg': 0.03419811278581619, 'loss/policy_avg': -0.02200690656900406, 'loss/value_avg': 0.5736989974975586, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0762351751327515, 'val/ratio': 0.9877935647964478, 'val/ratio_var': 9.621575009077787e-05, 'val/num_eos_tokens': 0, 'lr': 4.9485546300832925e-06, 'episode': 7360, 'epoch': 0.9}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:43:08<1:05:39, 131kB/s]
 90%|█████████ | 1841/2041 [2:39:49<18:12,  5.46s/it][A

{'eps': 0, 'objective/kl': 68.15234375, 'objective/entropy': 80.97587585449219, 'objective/non_score_reward': -3.4076170921325684, 'objective/rlhf_reward': -5.414290904998779, 'objective/scores': -2.006673812866211, 'policy/approxkl_avg': 0.001936521613970399, 'policy/clipfrac_avg': 0.02358490601181984, 'loss/policy_avg': -0.021757064387202263, 'loss/value_avg': 0.3395739197731018, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.4400101900100708, 'val/ratio': 0.9948527812957764, 'val/ratio_var': 2.479232171026524e-05, 'val/num_eos_tokens': 0, 'lr': 4.9240568348848605e-06, 'episode': 7364, 'epoch': 0.9}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:43:13<1:05:39, 131kB/s]
 90%|█████████ | 1842/2041 [2:39:55<17:52,  5.39s/it][A

{'eps': 0, 'objective/kl': 64.00556945800781, 'objective/entropy': 50.06722640991211, 'objective/non_score_reward': -3.2002787590026855, 'objective/rlhf_reward': -4.763332366943359, 'objective/scores': -1.563053846359253, 'policy/approxkl_avg': 0.0022922256030142307, 'policy/clipfrac_avg': 0.02004716917872429, 'loss/policy_avg': -0.01852620765566826, 'loss/value_avg': 0.46145689487457275, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0916075706481934, 'val/ratio': 0.9922212958335876, 'val/ratio_var': 5.01140566484537e-05, 'val/num_eos_tokens': 0, 'lr': 4.8995590396864285e-06, 'episode': 7368, 'epoch': 0.9}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:43:18<1:05:39, 131kB/s]
 90%|█████████ | 1843/2041 [2:40:00<17:35,  5.33s/it][A

{'eps': 0, 'objective/kl': 73.73077392578125, 'objective/entropy': 62.004844665527344, 'objective/non_score_reward': -3.6865384578704834, 'objective/rlhf_reward': -4.6843461990356445, 'objective/scores': -0.9978076815605164, 'policy/approxkl_avg': 0.0035965137649327517, 'policy/clipfrac_avg': 0.02712264098227024, 'loss/policy_avg': -0.023285595700144768, 'loss/value_avg': 0.6171079874038696, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.270637035369873, 'val/ratio': 0.9924807548522949, 'val/ratio_var': 4.205315781291574e-05, 'val/num_eos_tokens': 0, 'lr': 4.875061244487996e-06, 'episode': 7372, 'epoch': 0.9}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:43:23<1:05:39, 131kB/s]
 90%|█████████ | 1844/2041 [2:40:05<17:25,  5.31s/it][A

{'eps': 0, 'objective/kl': 79.27880859375, 'objective/entropy': 66.36638641357422, 'objective/non_score_reward': -3.963940143585205, 'objective/rlhf_reward': -5.667480945587158, 'objective/scores': -1.7035406827926636, 'policy/approxkl_avg': 0.0051600257866084576, 'policy/clipfrac_avg': 0.03183962404727936, 'loss/policy_avg': -0.02270134724676609, 'loss/value_avg': 0.7904955148696899, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.336587905883789, 'val/ratio': 0.9907002449035645, 'val/ratio_var': 6.36488912277855e-05, 'val/num_eos_tokens': 0, 'lr': 4.850563449289565e-06, 'episode': 7376, 'epoch': 0.9}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:43:29<1:05:39, 131kB/s]
 90%|█████████ | 1845/2041 [2:40:10<17:07,  5.24s/it][A

{'eps': 0, 'objective/kl': 79.15457153320312, 'objective/entropy': 78.158203125, 'objective/non_score_reward': -3.957728385925293, 'objective/rlhf_reward': -5.727748394012451, 'objective/scores': -1.7700198888778687, 'policy/approxkl_avg': 0.0032261598389595747, 'policy/clipfrac_avg': 0.03183962404727936, 'loss/policy_avg': -0.023792896419763565, 'loss/value_avg': 0.5373930931091309, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.4274036884307861, 'val/ratio': 0.992570161819458, 'val/ratio_var': 4.4800894102081656e-05, 'val/num_eos_tokens': 0, 'lr': 4.826065654091132e-06, 'episode': 7380, 'epoch': 0.9}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:43:34<1:05:39, 131kB/s]
 90%|█████████ | 1846/2041 [2:40:15<16:50,  5.18s/it][A

{'eps': 0, 'objective/kl': 76.79438781738281, 'objective/entropy': 63.22144317626953, 'objective/non_score_reward': -3.839719772338867, 'objective/rlhf_reward': -5.910165786743164, 'objective/scores': -2.070446014404297, 'policy/approxkl_avg': 0.004333212971687317, 'policy/clipfrac_avg': 0.033018868416547775, 'loss/policy_avg': -0.024612832814455032, 'loss/value_avg': 0.606988787651062, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.164125680923462, 'val/ratio': 0.9916268587112427, 'val/ratio_var': 4.732887100544758e-05, 'val/num_eos_tokens': 0, 'lr': 4.8015678588927e-06, 'episode': 7384, 'epoch': 0.9}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:43:39<1:05:39, 131kB/s]
 90%|█████████ | 1847/2041 [2:40:20<16:40,  5.16s/it][A

{'eps': 0, 'objective/kl': 69.84089660644531, 'objective/entropy': 59.91244125366211, 'objective/non_score_reward': -3.4920451641082764, 'objective/rlhf_reward': -5.985686302185059, 'objective/scores': -2.493640899658203, 'policy/approxkl_avg': 0.002138844458386302, 'policy/clipfrac_avg': 0.024764150381088257, 'loss/policy_avg': -0.021083829924464226, 'loss/value_avg': 0.6066827774047852, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2556836605072021, 'val/ratio': 0.990727424621582, 'val/ratio_var': 7.671735511394218e-05, 'val/num_eos_tokens': 0, 'lr': 4.777070063694268e-06, 'episode': 7388, 'epoch': 0.91}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:43:44<1:05:39, 131kB/s]
 91%|█████████ | 1848/2041 [2:40:25<16:31,  5.14s/it][A

{'eps': 0, 'objective/kl': 75.10888671875, 'objective/entropy': 53.838890075683594, 'objective/non_score_reward': -3.7554445266723633, 'objective/rlhf_reward': -5.273506164550781, 'objective/scores': -1.5180613994598389, 'policy/approxkl_avg': 0.0017810900462791324, 'policy/clipfrac_avg': 0.02004716917872429, 'loss/policy_avg': -0.01960689388215542, 'loss/value_avg': 0.46618837118148804, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0773414373397827, 'val/ratio': 0.9927058219909668, 'val/ratio_var': 4.649817492463626e-05, 'val/num_eos_tokens': 0, 'lr': 4.752572268495836e-06, 'episode': 7392, 'epoch': 0.91}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:43:49<1:05:39, 131kB/s]
 91%|█████████ | 1849/2041 [2:40:30<16:26,  5.14s/it][A

{'eps': 0, 'objective/kl': 87.87020874023438, 'objective/entropy': 68.4683609008789, 'objective/non_score_reward': -4.393510341644287, 'objective/rlhf_reward': -6.270103454589844, 'objective/scores': -1.8765933513641357, 'policy/approxkl_avg': 0.0037803766317665577, 'policy/clipfrac_avg': 0.028301887214183807, 'loss/policy_avg': -0.020921604707837105, 'loss/value_avg': 0.6246470212936401, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.3104612827301025, 'val/ratio': 0.9913167357444763, 'val/ratio_var': 5.5110383982537314e-05, 'val/num_eos_tokens': 0, 'lr': 4.728074473297404e-06, 'episode': 7396, 'epoch': 0.91}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:43:54<1:05:39, 131kB/s]
 91%|█████████ | 1850/2041 [2:40:36<16:27,  5.17s/it][A

{'eps': 0, 'objective/kl': 70.93304443359375, 'objective/entropy': 66.25291442871094, 'objective/non_score_reward': -3.546652317047119, 'objective/rlhf_reward': -5.866194248199463, 'objective/scores': -2.3195419311523438, 'policy/approxkl_avg': 0.0018625161610543728, 'policy/clipfrac_avg': 0.017688680440187454, 'loss/policy_avg': -0.01950075849890709, 'loss/value_avg': 0.5624319911003113, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1889069080352783, 'val/ratio': 0.9920772314071655, 'val/ratio_var': 5.5420168791897595e-05, 'val/num_eos_tokens': 0, 'lr': 4.703576678098971e-06, 'episode': 7400, 'epoch': 0.91}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:43:59<1:05:39, 131kB/s]
 91%|█████████ | 1851/2041 [2:40:41<16:24,  5.18s/it][A

{'eps': 0, 'objective/kl': 72.50505065917969, 'objective/entropy': 63.16341781616211, 'objective/non_score_reward': -3.6252524852752686, 'objective/rlhf_reward': -5.374990940093994, 'objective/scores': -1.7497385740280151, 'policy/approxkl_avg': 0.0028488130774348974, 'policy/clipfrac_avg': 0.021226415410637856, 'loss/policy_avg': -0.017660649493336678, 'loss/value_avg': 0.6314269304275513, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1780130863189697, 'val/ratio': 0.9937283396720886, 'val/ratio_var': 3.274258415331133e-05, 'val/num_eos_tokens': 0, 'lr': 4.679078882900539e-06, 'episode': 7404, 'epoch': 0.91}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:44:04<1:05:39, 131kB/s]
 91%|█████████ | 1852/2041 [2:40:46<16:14,  5.15s/it][A

{'eps': 0, 'objective/kl': 77.53628540039062, 'objective/entropy': 67.94168090820312, 'objective/non_score_reward': -3.876814365386963, 'objective/rlhf_reward': -5.34586763381958, 'objective/scores': -1.4690533876419067, 'policy/approxkl_avg': 0.007307713385671377, 'policy/clipfrac_avg': 0.03773584961891174, 'loss/policy_avg': -0.02402583323419094, 'loss/value_avg': 0.5640735030174255, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2206425666809082, 'val/ratio': 0.9897578954696655, 'val/ratio_var': 7.807646034052595e-05, 'val/num_eos_tokens': 0, 'lr': 4.654581087702107e-06, 'episode': 7408, 'epoch': 0.91}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:44:10<1:05:39, 131kB/s]
 91%|█████████ | 1853/2041 [2:40:51<16:11,  5.17s/it][A

{'eps': 0, 'objective/kl': 77.70654296875, 'objective/entropy': 62.742431640625, 'objective/non_score_reward': -3.885326862335205, 'objective/rlhf_reward': -6.14032506942749, 'objective/scores': -2.254998207092285, 'policy/approxkl_avg': 0.0030821296386420727, 'policy/clipfrac_avg': 0.01886792480945587, 'loss/policy_avg': -0.019362686201930046, 'loss/value_avg': 0.5342935919761658, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2137434482574463, 'val/ratio': 0.9917027354240417, 'val/ratio_var': 5.4194311815081164e-05, 'val/num_eos_tokens': 0, 'lr': 4.630083292503675e-06, 'episode': 7412, 'epoch': 0.91}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:44:15<1:05:39, 131kB/s]
 91%|█████████ | 1854/2041 [2:40:56<16:04,  5.16s/it][A

{'eps': 0, 'objective/kl': 72.21832275390625, 'objective/entropy': 81.90921020507812, 'objective/non_score_reward': -3.6109161376953125, 'objective/rlhf_reward': -5.981127738952637, 'objective/scores': -2.370211362838745, 'policy/approxkl_avg': 0.003345847362652421, 'policy/clipfrac_avg': 0.033018868416547775, 'loss/policy_avg': -0.024173367768526077, 'loss/value_avg': 0.5753371715545654, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.4399067163467407, 'val/ratio': 0.9896049499511719, 'val/ratio_var': 9.013250382849947e-05, 'val/num_eos_tokens': 0, 'lr': 4.6055854973052425e-06, 'episode': 7416, 'epoch': 0.91}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:44:20<1:05:39, 131kB/s]
 91%|█████████ | 1855/2041 [2:41:02<15:59,  5.16s/it][A

{'eps': 0, 'objective/kl': 91.90662384033203, 'objective/entropy': 67.16437530517578, 'objective/non_score_reward': -4.595331192016602, 'objective/rlhf_reward': -6.579745292663574, 'objective/scores': -1.9844141006469727, 'policy/approxkl_avg': 0.0038740152958780527, 'policy/clipfrac_avg': 0.028301887214183807, 'loss/policy_avg': -0.02236153744161129, 'loss/value_avg': 0.7112436294555664, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.300439476966858, 'val/ratio': 0.9907881021499634, 'val/ratio_var': 6.353814387694001e-05, 'val/num_eos_tokens': 0, 'lr': 4.5810877021068106e-06, 'episode': 7420, 'epoch': 0.91}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:44:25<1:05:39, 131kB/s]
 91%|█████████ | 1856/2041 [2:41:07<15:51,  5.14s/it][A

{'eps': 0, 'objective/kl': 61.54814910888672, 'objective/entropy': 57.943599700927734, 'objective/non_score_reward': -3.0774073600769043, 'objective/rlhf_reward': -4.3151350021362305, 'objective/scores': -1.237727403640747, 'policy/approxkl_avg': 0.0018942034803330898, 'policy/clipfrac_avg': 0.02004716917872429, 'loss/policy_avg': -0.018829990178346634, 'loss/value_avg': 0.5947514176368713, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1847995519638062, 'val/ratio': 0.9956992864608765, 'val/ratio_var': 1.7248203221242875e-05, 'val/num_eos_tokens': 0, 'lr': 4.556589906908379e-06, 'episode': 7424, 'epoch': 0.91}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:44:30<1:05:39, 131kB/s]
 91%|█████████ | 1857/2041 [2:41:12<15:44,  5.14s/it][A

{'eps': 0, 'objective/kl': 61.87168884277344, 'objective/entropy': 50.00400924682617, 'objective/non_score_reward': -3.0935845375061035, 'objective/rlhf_reward': -5.702910423278809, 'objective/scores': -2.609325885772705, 'policy/approxkl_avg': 0.0036330504808574915, 'policy/clipfrac_avg': 0.021226415410637856, 'loss/policy_avg': -0.020238079130649567, 'loss/value_avg': 0.5240119099617004, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.132702112197876, 'val/ratio': 0.9884103536605835, 'val/ratio_var': 0.00010703354928409681, 'val/num_eos_tokens': 0, 'lr': 4.532092111709947e-06, 'episode': 7428, 'epoch': 0.91}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:44:35<1:05:39, 131kB/s]
 91%|█████████ | 1858/2041 [2:41:17<15:41,  5.15s/it][A

{'eps': 0, 'objective/kl': 62.418663024902344, 'objective/entropy': 45.8231201171875, 'objective/non_score_reward': -3.1209330558776855, 'objective/rlhf_reward': -5.323154449462891, 'objective/scores': -2.202221632003784, 'policy/approxkl_avg': 0.0009014220559038222, 'policy/clipfrac_avg': 0.01179245300590992, 'loss/policy_avg': -0.012564541772007942, 'loss/value_avg': 0.5556970238685608, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9186466932296753, 'val/ratio': 0.9947330951690674, 'val/ratio_var': 2.6373307264293544e-05, 'val/num_eos_tokens': 0, 'lr': 4.507594316511514e-06, 'episode': 7432, 'epoch': 0.91}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:44:41<1:05:39, 131kB/s]
 91%|█████████ | 1859/2041 [2:41:22<15:36,  5.15s/it][A

{'eps': 0, 'objective/kl': 80.03323364257812, 'objective/entropy': 59.111228942871094, 'objective/non_score_reward': -4.001661777496338, 'objective/rlhf_reward': -7.202889442443848, 'objective/scores': -3.2012274265289307, 'policy/approxkl_avg': 0.002527887001633644, 'policy/clipfrac_avg': 0.025943394750356674, 'loss/policy_avg': -0.020541924983263016, 'loss/value_avg': 0.6199184060096741, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2173011302947998, 'val/ratio': 0.9902631044387817, 'val/ratio_var': 7.970802107593045e-05, 'val/num_eos_tokens': 0, 'lr': 4.483096521313082e-06, 'episode': 7436, 'epoch': 0.91}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:44:46<1:05:39, 131kB/s]
 91%|█████████ | 1860/2041 [2:41:27<15:29,  5.14s/it][A

{'eps': 0, 'objective/kl': 71.50439453125, 'objective/entropy': 55.135215759277344, 'objective/non_score_reward': -3.5752196311950684, 'objective/rlhf_reward': -5.834534645080566, 'objective/scores': -2.259314775466919, 'policy/approxkl_avg': 0.004950630944222212, 'policy/clipfrac_avg': 0.029481131583452225, 'loss/policy_avg': -0.026036201044917107, 'loss/value_avg': 0.5406119227409363, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0926088094711304, 'val/ratio': 0.9889711141586304, 'val/ratio_var': 8.943904686020687e-05, 'val/num_eos_tokens': 0, 'lr': 4.45859872611465e-06, 'episode': 7440, 'epoch': 0.91}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:44:51<1:05:39, 131kB/s]
 91%|█████████ | 1861/2041 [2:41:32<15:20,  5.11s/it][A

{'eps': 0, 'objective/kl': 69.78695678710938, 'objective/entropy': 61.74216842651367, 'objective/non_score_reward': -3.4893479347229004, 'objective/rlhf_reward': -5.5011982917785645, 'objective/scores': -2.011850357055664, 'policy/approxkl_avg': 0.004680868238210678, 'policy/clipfrac_avg': 0.03183962404727936, 'loss/policy_avg': -0.02377147413790226, 'loss/value_avg': 0.42071807384490967, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2181769609451294, 'val/ratio': 0.9941389560699463, 'val/ratio_var': 2.2030539184925146e-05, 'val/num_eos_tokens': 0, 'lr': 4.434100930916218e-06, 'episode': 7444, 'epoch': 0.91}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:44:56<1:05:39, 131kB/s]
 91%|█████████ | 1862/2041 [2:41:37<15:22,  5.15s/it][A

{'eps': 0, 'objective/kl': 71.65899658203125, 'objective/entropy': 66.72124481201172, 'objective/non_score_reward': -3.5829498767852783, 'objective/rlhf_reward': -5.521650314331055, 'objective/scores': -1.938700556755066, 'policy/approxkl_avg': 0.0011595410760492086, 'policy/clipfrac_avg': 0.010613207705318928, 'loss/policy_avg': -0.02002807892858982, 'loss/value_avg': 0.4411385655403137, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.242313265800476, 'val/ratio': 0.9938084483146667, 'val/ratio_var': 3.6147615901427343e-05, 'val/num_eos_tokens': 0, 'lr': 4.409603135717785e-06, 'episode': 7448, 'epoch': 0.91}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:45:01<1:05:39, 131kB/s]
 91%|█████████▏| 1863/2041 [2:41:43<15:20,  5.17s/it][A

{'eps': 0, 'objective/kl': 72.16497802734375, 'objective/entropy': 64.72537231445312, 'objective/non_score_reward': -3.608248710632324, 'objective/rlhf_reward': -5.094008445739746, 'objective/scores': -1.485759973526001, 'policy/approxkl_avg': 0.0016674452926963568, 'policy/clipfrac_avg': 0.02004716917872429, 'loss/policy_avg': -0.021302133798599243, 'loss/value_avg': 0.4066661596298218, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2031440734863281, 'val/ratio': 0.9997533559799194, 'val/ratio_var': 8.020084152349227e-08, 'val/num_eos_tokens': 0, 'lr': 4.385105340519353e-06, 'episode': 7452, 'epoch': 0.91}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:45:06<1:05:39, 131kB/s]
 91%|█████████▏| 1864/2041 [2:41:48<15:13,  5.16s/it][A

{'eps': 0, 'objective/kl': 83.41838073730469, 'objective/entropy': 54.86674499511719, 'objective/non_score_reward': -4.170918941497803, 'objective/rlhf_reward': -6.5048747062683105, 'objective/scores': -2.333955764770508, 'policy/approxkl_avg': 0.003518444951623678, 'policy/clipfrac_avg': 0.025943394750356674, 'loss/policy_avg': -0.02022046223282814, 'loss/value_avg': 0.9017229080200195, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0887824296951294, 'val/ratio': 0.9920628070831299, 'val/ratio_var': 4.62137431895826e-05, 'val/num_eos_tokens': 0, 'lr': 4.360607545320921e-06, 'episode': 7456, 'epoch': 0.91}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:45:11<1:05:39, 131kB/s]
 91%|█████████▏| 1865/2041 [2:41:53<15:08,  5.16s/it][A

{'eps': 0, 'objective/kl': 69.72744750976562, 'objective/entropy': 47.429710388183594, 'objective/non_score_reward': -3.486372470855713, 'objective/rlhf_reward': -6.010279655456543, 'objective/scores': -2.523907423019409, 'policy/approxkl_avg': 0.0022533570881932974, 'policy/clipfrac_avg': 0.02358490601181984, 'loss/policy_avg': -0.01871424913406372, 'loss/value_avg': 0.4763941168785095, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8889682292938232, 'val/ratio': 0.9906891584396362, 'val/ratio_var': 7.169922901084647e-05, 'val/num_eos_tokens': 0, 'lr': 4.336109750122489e-06, 'episode': 7460, 'epoch': 0.91}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:45:16<1:05:39, 131kB/s]
 91%|█████████▏| 1866/2041 [2:41:58<14:56,  5.13s/it][A

{'eps': 0, 'objective/kl': 70.81460571289062, 'objective/entropy': 63.18988037109375, 'objective/non_score_reward': -3.5407304763793945, 'objective/rlhf_reward': -6.12522554397583, 'objective/scores': -2.5844950675964355, 'policy/approxkl_avg': 0.002290126169100404, 'policy/clipfrac_avg': 0.028301887214183807, 'loss/policy_avg': -0.020293932408094406, 'loss/value_avg': 0.6281205415725708, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1692068576812744, 'val/ratio': 0.9928688406944275, 'val/ratio_var': 4.8100348067237064e-05, 'val/num_eos_tokens': 0, 'lr': 4.311611954924057e-06, 'episode': 7464, 'epoch': 0.91}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:45:22<1:05:39, 131kB/s]
 91%|█████████▏| 1867/2041 [2:42:03<14:51,  5.13s/it][A

{'eps': 0, 'objective/kl': 64.07502746582031, 'objective/entropy': 66.3912582397461, 'objective/non_score_reward': -3.203751802444458, 'objective/rlhf_reward': -5.163553237915039, 'objective/scores': -1.959801197052002, 'policy/approxkl_avg': 0.0014911734033375978, 'policy/clipfrac_avg': 0.015330188907682896, 'loss/policy_avg': -0.01624045893549919, 'loss/value_avg': 0.2844797968864441, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0776222944259644, 'val/ratio': 0.9941965341567993, 'val/ratio_var': 2.9801527489325963e-05, 'val/num_eos_tokens': 0, 'lr': 4.2871141597256245e-06, 'episode': 7468, 'epoch': 0.91}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:45:27<1:05:39, 131kB/s]
 92%|█████████▏| 1868/2041 [2:42:08<14:46,  5.13s/it][A

{'eps': 0, 'objective/kl': 66.13304138183594, 'objective/entropy': 45.87892150878906, 'objective/non_score_reward': -3.3066518306732178, 'objective/rlhf_reward': -5.917817115783691, 'objective/scores': -2.6111650466918945, 'policy/approxkl_avg': 0.00353845558129251, 'policy/clipfrac_avg': 0.02712263911962509, 'loss/policy_avg': -0.02347962185740471, 'loss/value_avg': 0.5654397010803223, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0083202123641968, 'val/ratio': 0.9948071837425232, 'val/ratio_var': 1.9620558305177838e-05, 'val/num_eos_tokens': 0, 'lr': 4.262616364527193e-06, 'episode': 7472, 'epoch': 0.92}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:45:32<1:05:39, 131kB/s]
 92%|█████████▏| 1869/2041 [2:42:13<14:40,  5.12s/it][A

{'eps': 0, 'objective/kl': 84.9007568359375, 'objective/entropy': 78.13285827636719, 'objective/non_score_reward': -4.245038032531738, 'objective/rlhf_reward': -6.069767951965332, 'objective/scores': -1.8247299194335938, 'policy/approxkl_avg': 0.0026742287445813417, 'policy/clipfrac_avg': 0.024764150381088257, 'loss/policy_avg': -0.025743816047906876, 'loss/value_avg': 0.8403062224388123, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.5310494899749756, 'val/ratio': 0.999340832233429, 'val/ratio_var': 3.6111018175688514e-07, 'val/num_eos_tokens': 0, 'lr': 4.238118569328761e-06, 'episode': 7476, 'epoch': 0.92}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:45:37<1:05:39, 131kB/s]
 92%|█████████▏| 1870/2041 [2:42:19<14:35,  5.12s/it][A

{'eps': 0, 'objective/kl': 71.97599029541016, 'objective/entropy': 70.35200500488281, 'objective/non_score_reward': -3.598799705505371, 'objective/rlhf_reward': -6.044046878814697, 'objective/scores': -2.445247173309326, 'policy/approxkl_avg': 0.0014366770628839731, 'policy/clipfrac_avg': 0.014150943607091904, 'loss/policy_avg': -0.015548723749816418, 'loss/value_avg': 0.6645388603210449, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2245686054229736, 'val/ratio': 0.9947312474250793, 'val/ratio_var': 2.3342050553765148e-05, 'val/num_eos_tokens': 0, 'lr': 4.213620774130329e-06, 'episode': 7480, 'epoch': 0.92}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:45:42<1:05:39, 131kB/s]
 92%|█████████▏| 1871/2041 [2:42:24<14:31,  5.13s/it][A

{'eps': 0, 'objective/kl': 62.21548843383789, 'objective/entropy': 42.417823791503906, 'objective/non_score_reward': -3.110774517059326, 'objective/rlhf_reward': -5.382974624633789, 'objective/scores': -2.272200345993042, 'policy/approxkl_avg': 0.0010712167713791132, 'policy/clipfrac_avg': 0.012971698306500912, 'loss/policy_avg': -0.012708881869912148, 'loss/value_avg': 0.39680910110473633, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8743895292282104, 'val/ratio': 0.9937691688537598, 'val/ratio_var': 3.7810063076904044e-05, 'val/num_eos_tokens': 0, 'lr': 4.189122978931896e-06, 'episode': 7484, 'epoch': 0.92}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:45:47<1:05:39, 131kB/s]
 92%|█████████▏| 1872/2041 [2:42:29<14:28,  5.14s/it][A

{'eps': 0, 'objective/kl': 66.65406799316406, 'objective/entropy': 48.55487823486328, 'objective/non_score_reward': -3.3327033519744873, 'objective/rlhf_reward': -5.942300796508789, 'objective/scores': -2.6095974445343018, 'policy/approxkl_avg': 0.0020325530786067247, 'policy/clipfrac_avg': 0.021226413547992706, 'loss/policy_avg': -0.015140353702008724, 'loss/value_avg': 0.522210955619812, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8272747993469238, 'val/ratio': 0.9905346632003784, 'val/ratio_var': 7.492344593629241e-05, 'val/num_eos_tokens': 0, 'lr': 4.164625183733464e-06, 'episode': 7488, 'epoch': 0.92}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:45:53<1:05:39, 131kB/s]
 92%|█████████▏| 1873/2041 [2:42:34<14:29,  5.17s/it][A

{'eps': 0, 'objective/kl': 75.7921142578125, 'objective/entropy': 63.25975036621094, 'objective/non_score_reward': -3.7896060943603516, 'objective/rlhf_reward': -6.408876419067383, 'objective/scores': -2.6192703247070312, 'policy/approxkl_avg': 0.006233884021639824, 'policy/clipfrac_avg': 0.03419811278581619, 'loss/policy_avg': -0.019912729039788246, 'loss/value_avg': 0.7339345812797546, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.267709493637085, 'val/ratio': 1.0024726390838623, 'val/ratio_var': 6.534824478876544e-06, 'val/num_eos_tokens': 0, 'lr': 4.140127388535032e-06, 'episode': 7492, 'epoch': 0.92}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:45:58<1:05:39, 131kB/s]
 92%|█████████▏| 1874/2041 [2:42:39<14:17,  5.13s/it][A

{'eps': 0, 'objective/kl': 76.58708190917969, 'objective/entropy': 65.20755767822266, 'objective/non_score_reward': -3.8293538093566895, 'objective/rlhf_reward': -5.754819869995117, 'objective/scores': -1.9254658222198486, 'policy/approxkl_avg': 0.003199018072336912, 'policy/clipfrac_avg': 0.02358490601181984, 'loss/policy_avg': -0.02131044678390026, 'loss/value_avg': 0.47991663217544556, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1666619777679443, 'val/ratio': 0.9884191155433655, 'val/ratio_var': 0.00011564927262952551, 'val/num_eos_tokens': 0, 'lr': 4.1156295933366e-06, 'episode': 7496, 'epoch': 0.92}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:46:03<1:05:39, 131kB/s]
 92%|█████████▏| 1875/2041 [2:42:44<14:11,  5.13s/it][A

{'eps': 0, 'objective/kl': 81.36604309082031, 'objective/entropy': 60.25325012207031, 'objective/non_score_reward': -4.068302631378174, 'objective/rlhf_reward': -6.390223026275635, 'objective/scores': -2.321920394897461, 'policy/approxkl_avg': 0.0025675026699900627, 'policy/clipfrac_avg': 0.016509434208273888, 'loss/policy_avg': -0.0183543860912323, 'loss/value_avg': 0.6027785539627075, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2179566621780396, 'val/ratio': 1.0024969577789307, 'val/ratio_var': 7.754665602988098e-06, 'val/num_eos_tokens': 0, 'lr': 4.091131798138167e-06, 'episode': 7500, 'epoch': 0.92}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:46:08<1:05:39, 131kB/s]
 92%|█████████▏| 1876/2041 [2:42:49<14:09,  5.15s/it][A

{'eps': 0, 'objective/kl': 85.0054931640625, 'objective/entropy': 61.690025329589844, 'objective/non_score_reward': -4.250274658203125, 'objective/rlhf_reward': -6.794891357421875, 'objective/scores': -2.54461669921875, 'policy/approxkl_avg': 0.0026333974674344063, 'policy/clipfrac_avg': 0.03419811278581619, 'loss/policy_avg': -0.02053140476346016, 'loss/value_avg': 0.6547550559043884, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2505078315734863, 'val/ratio': 0.9970338344573975, 'val/ratio_var': 5.815182248625206e-06, 'val/num_eos_tokens': 0, 'lr': 4.066634002939736e-06, 'episode': 7504, 'epoch': 0.92}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:46:13<1:05:39, 131kB/s]
 92%|█████████▏| 1877/2041 [2:42:55<14:02,  5.14s/it][A

{'eps': 0, 'objective/kl': 65.4298095703125, 'objective/entropy': 53.480812072753906, 'objective/non_score_reward': -3.2714903354644775, 'objective/rlhf_reward': -4.618802070617676, 'objective/scores': -1.3473118543624878, 'policy/approxkl_avg': 0.0012710249284282327, 'policy/clipfrac_avg': 0.010613207705318928, 'loss/policy_avg': -0.01726446859538555, 'loss/value_avg': 0.4221729338169098, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.035651445388794, 'val/ratio': 0.9956343173980713, 'val/ratio_var': 1.7787539036362432e-05, 'val/num_eos_tokens': 0, 'lr': 4.042136207741303e-06, 'episode': 7508, 'epoch': 0.92}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:46:18<1:05:39, 131kB/s]
 92%|█████████▏| 1878/2041 [2:43:00<13:57,  5.14s/it][A

{'eps': 0, 'objective/kl': 70.76065063476562, 'objective/entropy': 58.38621520996094, 'objective/non_score_reward': -3.5380325317382812, 'objective/rlhf_reward': -4.589864730834961, 'objective/scores': -1.0518323183059692, 'policy/approxkl_avg': 0.0018151220865547657, 'policy/clipfrac_avg': 0.02004716917872429, 'loss/policy_avg': -0.01793062314391136, 'loss/value_avg': 0.6600009202957153, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1340415477752686, 'val/ratio': 0.9944137930870056, 'val/ratio_var': 3.3039261325029656e-05, 'val/num_eos_tokens': 0, 'lr': 4.017638412542871e-06, 'episode': 7512, 'epoch': 0.92}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:46:23<1:05:39, 131kB/s]
 92%|█████████▏| 1879/2041 [2:43:05<13:55,  5.16s/it][A

{'eps': 0, 'objective/kl': 68.73966979980469, 'objective/entropy': 73.0337905883789, 'objective/non_score_reward': -3.436983823776245, 'objective/rlhf_reward': -6.355805397033691, 'objective/scores': -2.918821334838867, 'policy/approxkl_avg': 0.0015620035119354725, 'policy/clipfrac_avg': 0.017688680440187454, 'loss/policy_avg': -0.019889753311872482, 'loss/value_avg': 0.4567868113517761, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.298840045928955, 'val/ratio': 0.9980806112289429, 'val/ratio_var': 4.5779083848174196e-06, 'val/num_eos_tokens': 0, 'lr': 3.993140617344439e-06, 'episode': 7516, 'epoch': 0.92}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:46:28<1:05:39, 131kB/s]
 92%|█████████▏| 1880/2041 [2:43:10<13:46,  5.13s/it][A

{'eps': 0, 'objective/kl': 71.61390686035156, 'objective/entropy': 60.12010192871094, 'objective/non_score_reward': -3.580695867538452, 'objective/rlhf_reward': -6.303868293762207, 'objective/scores': -2.723172664642334, 'policy/approxkl_avg': 0.004439346957951784, 'policy/clipfrac_avg': 0.03419811278581619, 'loss/policy_avg': -0.020073169842362404, 'loss/value_avg': 0.5394243001937866, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0684237480163574, 'val/ratio': 0.9949084520339966, 'val/ratio_var': 1.8789603927871212e-05, 'val/num_eos_tokens': 0, 'lr': 3.968642822146007e-06, 'episode': 7520, 'epoch': 0.92}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:46:34<1:05:39, 131kB/s]
 92%|█████████▏| 1881/2041 [2:43:15<13:44,  5.15s/it][A

{'eps': 0, 'objective/kl': 84.98895263671875, 'objective/entropy': 63.65934371948242, 'objective/non_score_reward': -4.249447822570801, 'objective/rlhf_reward': -6.387378215789795, 'objective/scores': -2.137930393218994, 'policy/approxkl_avg': 0.0020442954264581203, 'policy/clipfrac_avg': 0.012971698306500912, 'loss/policy_avg': -0.016868093982338905, 'loss/value_avg': 0.4649852216243744, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2696869373321533, 'val/ratio': 0.9928597211837769, 'val/ratio_var': 4.61530544271227e-05, 'val/num_eos_tokens': 0, 'lr': 3.9441450269475755e-06, 'episode': 7524, 'epoch': 0.92}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:46:39<1:05:39, 131kB/s]
 92%|█████████▏| 1882/2041 [2:43:20<13:42,  5.17s/it][A

{'eps': 0, 'objective/kl': 74.06985473632812, 'objective/entropy': 63.817344665527344, 'objective/non_score_reward': -3.7034928798675537, 'objective/rlhf_reward': -5.494551658630371, 'objective/scores': -1.7910585403442383, 'policy/approxkl_avg': 0.002014830010011792, 'policy/clipfrac_avg': 0.01179245300590992, 'loss/policy_avg': -0.014309841208159924, 'loss/value_avg': 0.4802953004837036, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.200972080230713, 'val/ratio': 0.9941274523735046, 'val/ratio_var': 2.9199954951764084e-05, 'val/num_eos_tokens': 0, 'lr': 3.919647231749143e-06, 'episode': 7528, 'epoch': 0.92}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:46:44<1:05:39, 131kB/s]
 92%|█████████▏| 1883/2041 [2:43:26<13:38,  5.18s/it][A

{'eps': 0, 'objective/kl': 70.85244750976562, 'objective/entropy': 85.9132080078125, 'objective/non_score_reward': -3.5426220893859863, 'objective/rlhf_reward': -5.229671955108643, 'objective/scores': -1.6870497465133667, 'policy/approxkl_avg': 0.008309571072459221, 'policy/clipfrac_avg': 0.03183962404727936, 'loss/policy_avg': -0.02886049449443817, 'loss/value_avg': 0.4915178120136261, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.422347068786621, 'val/ratio': 0.9956845045089722, 'val/ratio_var': 1.0205586477241013e-05, 'val/num_eos_tokens': 0, 'lr': 3.895149436550711e-06, 'episode': 7532, 'epoch': 0.92}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:46:49<1:05:39, 131kB/s]
 92%|█████████▏| 1884/2041 [2:43:31<13:30,  5.16s/it][A

{'eps': 0, 'objective/kl': 75.8259048461914, 'objective/entropy': 54.66175842285156, 'objective/non_score_reward': -3.7912955284118652, 'objective/rlhf_reward': -5.408461570739746, 'objective/scores': -1.6171660423278809, 'policy/approxkl_avg': 0.006246573757380247, 'policy/clipfrac_avg': 0.040094342082738876, 'loss/policy_avg': -0.023083792999386787, 'loss/value_avg': 0.4518829584121704, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.040410041809082, 'val/ratio': 0.9963757991790771, 'val/ratio_var': 7.773903234919999e-06, 'val/num_eos_tokens': 0, 'lr': 3.870651641352279e-06, 'episode': 7536, 'epoch': 0.92}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:46:54<1:05:39, 131kB/s]
 92%|█████████▏| 1885/2041 [2:43:36<13:22,  5.14s/it][A

{'eps': 0, 'objective/kl': 69.4356460571289, 'objective/entropy': 44.84682846069336, 'objective/non_score_reward': -3.4717824459075928, 'objective/rlhf_reward': -6.41610050201416, 'objective/scores': -2.9443182945251465, 'policy/approxkl_avg': 0.0008165376493707299, 'policy/clipfrac_avg': 0.010613207705318928, 'loss/policy_avg': -0.0116163594648242, 'loss/value_avg': 0.4507545232772827, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9563086032867432, 'val/ratio': 0.9919841885566711, 'val/ratio_var': 6.814189691795036e-05, 'val/num_eos_tokens': 0, 'lr': 3.846153846153847e-06, 'episode': 7540, 'epoch': 0.92}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:46:59<1:05:39, 131kB/s]
 92%|█████████▏| 1886/2041 [2:43:41<13:19,  5.16s/it][A

{'eps': 0, 'objective/kl': 67.16775512695312, 'objective/entropy': 51.692569732666016, 'objective/non_score_reward': -3.3583879470825195, 'objective/rlhf_reward': -6.089779853820801, 'objective/scores': -2.7313919067382812, 'policy/approxkl_avg': 0.006809852086007595, 'policy/clipfrac_avg': 0.020047171041369438, 'loss/policy_avg': -0.01765303872525692, 'loss/value_avg': 0.4579089283943176, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0080757141113281, 'val/ratio': 1.0090199708938599, 'val/ratio_var': 7.521026418544352e-05, 'val/num_eos_tokens': 0, 'lr': 3.821656050955414e-06, 'episode': 7544, 'epoch': 0.92}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:47:05<1:05:39, 131kB/s]
 92%|█████████▏| 1887/2041 [2:43:46<13:14,  5.16s/it][A

{'eps': 0, 'objective/kl': 83.04182434082031, 'objective/entropy': 59.42464065551758, 'objective/non_score_reward': -4.152091026306152, 'objective/rlhf_reward': -5.679690837860107, 'objective/scores': -1.527599811553955, 'policy/approxkl_avg': 0.0014890520833432674, 'policy/clipfrac_avg': 0.014150943607091904, 'loss/policy_avg': -0.014815657399594784, 'loss/value_avg': 0.516144335269928, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.096043586730957, 'val/ratio': 0.9946781396865845, 'val/ratio_var': 2.486725861672312e-05, 'val/num_eos_tokens': 0, 'lr': 3.797158255756982e-06, 'episode': 7548, 'epoch': 0.92}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:47:10<1:05:39, 131kB/s]
 93%|█████████▎| 1888/2041 [2:43:51<13:12,  5.18s/it][A

{'eps': 0, 'objective/kl': 78.43063354492188, 'objective/entropy': 66.12234497070312, 'objective/non_score_reward': -3.9215316772460938, 'objective/rlhf_reward': -5.447571277618408, 'objective/scores': -1.526039481163025, 'policy/approxkl_avg': 0.002328169997781515, 'policy/clipfrac_avg': 0.021226413547992706, 'loss/policy_avg': -0.019403595477342606, 'loss/value_avg': 0.5824805498123169, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1886985301971436, 'val/ratio': 1.0000724792480469, 'val/ratio_var': 9.908190889973412e-08, 'val/num_eos_tokens': 0, 'lr': 3.7726604605585497e-06, 'episode': 7552, 'epoch': 0.93}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:47:15<1:05:39, 131kB/s]
 93%|█████████▎| 1889/2041 [2:43:57<13:05,  5.17s/it][A

{'eps': 0, 'objective/kl': 70.9437255859375, 'objective/entropy': 50.47108459472656, 'objective/non_score_reward': -3.5471861362457275, 'objective/rlhf_reward': -6.370604515075684, 'objective/scores': -2.823418617248535, 'policy/approxkl_avg': 0.007376250810921192, 'policy/clipfrac_avg': 0.012971698306500912, 'loss/policy_avg': -0.015181020833551884, 'loss/value_avg': 0.6858075857162476, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9817895293235779, 'val/ratio': 0.9913259744644165, 'val/ratio_var': 5.718911052099429e-05, 'val/num_eos_tokens': 0, 'lr': 3.7481626653601177e-06, 'episode': 7556, 'epoch': 0.93}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:47:20<1:05:39, 131kB/s]
 93%|█████████▎| 1890/2041 [2:44:02<13:03,  5.19s/it][A

{'eps': 0, 'objective/kl': 65.22355651855469, 'objective/entropy': 52.06753158569336, 'objective/non_score_reward': -3.2611777782440186, 'objective/rlhf_reward': -5.285782814025879, 'objective/scores': -2.0246050357818604, 'policy/approxkl_avg': 0.0015604525106027722, 'policy/clipfrac_avg': 0.014150943607091904, 'loss/policy_avg': -0.017563024535775185, 'loss/value_avg': 0.3884260952472687, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9504901170730591, 'val/ratio': 0.9917373657226562, 'val/ratio_var': 6.208875129232183e-05, 'val/num_eos_tokens': 0, 'lr': 3.7236648701616853e-06, 'episode': 7560, 'epoch': 0.93}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:47:25<1:05:39, 131kB/s]
 93%|█████████▎| 1891/2041 [2:44:07<12:54,  5.16s/it][A

{'eps': 0, 'objective/kl': 73.64678192138672, 'objective/entropy': 66.9239501953125, 'objective/non_score_reward': -3.6823391914367676, 'objective/rlhf_reward': -5.461638927459717, 'objective/scores': -1.7792997360229492, 'policy/approxkl_avg': 0.001799469580873847, 'policy/clipfrac_avg': 0.01886792480945587, 'loss/policy_avg': -0.01675773598253727, 'loss/value_avg': 0.5350165963172913, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2857201099395752, 'val/ratio': 0.9916638731956482, 'val/ratio_var': 6.100438258727081e-05, 'val/num_eos_tokens': 0, 'lr': 3.6991670749632534e-06, 'episode': 7564, 'epoch': 0.93}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:47:30<1:05:39, 131kB/s]
 93%|█████████▎| 1892/2041 [2:44:12<12:45,  5.14s/it][A

{'eps': 0, 'objective/kl': 74.4547119140625, 'objective/entropy': 55.28786849975586, 'objective/non_score_reward': -3.72273588180542, 'objective/rlhf_reward': -5.145477294921875, 'objective/scores': -1.4227416515350342, 'policy/approxkl_avg': 0.0015214525628834963, 'policy/clipfrac_avg': 0.016509434208273888, 'loss/policy_avg': -0.0164157897233963, 'loss/value_avg': 0.4559653699398041, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0362112522125244, 'val/ratio': 0.9954270720481873, 'val/ratio_var': 1.9572762539610267e-05, 'val/num_eos_tokens': 0, 'lr': 3.674669279764821e-06, 'episode': 7568, 'epoch': 0.93}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:47:36<1:05:39, 131kB/s]
 93%|█████████▎| 1893/2041 [2:44:17<12:44,  5.16s/it][A

{'eps': 0, 'objective/kl': 80.4022216796875, 'objective/entropy': 62.42510986328125, 'objective/non_score_reward': -4.020111083984375, 'objective/rlhf_reward': -6.525445938110352, 'objective/scores': -2.5053348541259766, 'policy/approxkl_avg': 0.005316305905580521, 'policy/clipfrac_avg': 0.025943396613001823, 'loss/policy_avg': -0.020517461001873016, 'loss/value_avg': 0.5818710327148438, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1748816967010498, 'val/ratio': 0.9921546578407288, 'val/ratio_var': 4.229061960359104e-05, 'val/num_eos_tokens': 0, 'lr': 3.6501714845663894e-06, 'episode': 7572, 'epoch': 0.93}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:47:41<1:05:39, 131kB/s]
 93%|█████████▎| 1894/2041 [2:44:22<12:40,  5.18s/it][A

{'eps': 0, 'objective/kl': 91.27275085449219, 'objective/entropy': 52.00110626220703, 'objective/non_score_reward': -4.563637733459473, 'objective/rlhf_reward': -5.554947853088379, 'objective/scores': -0.9913098812103271, 'policy/approxkl_avg': 0.0031817820854485035, 'policy/clipfrac_avg': 0.02712264284491539, 'loss/policy_avg': -0.02053414098918438, 'loss/value_avg': 0.5557538270950317, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.986083447933197, 'val/ratio': 0.9908832311630249, 'val/ratio_var': 6.674462201772258e-05, 'val/num_eos_tokens': 0, 'lr': 3.6256736893679566e-06, 'episode': 7576, 'epoch': 0.93}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:47:46<1:05:39, 131kB/s]
 93%|█████████▎| 1895/2041 [2:44:28<12:39,  5.20s/it][A

{'eps': 0, 'objective/kl': 60.82110595703125, 'objective/entropy': 77.8539810180664, 'objective/non_score_reward': -3.041055202484131, 'objective/rlhf_reward': -5.659328937530518, 'objective/scores': -2.6182737350463867, 'policy/approxkl_avg': 0.0014869233127683401, 'policy/clipfrac_avg': 0.009433962404727936, 'loss/policy_avg': -0.01488510798662901, 'loss/value_avg': 0.3485107719898224, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.3121964931488037, 'val/ratio': 0.9953808784484863, 'val/ratio_var': 1.7412827219231986e-05, 'val/num_eos_tokens': 0, 'lr': 3.601175894169525e-06, 'episode': 7580, 'epoch': 0.93}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:47:51<1:05:39, 131kB/s]
 93%|█████████▎| 1896/2041 [2:44:33<12:29,  5.17s/it][A

{'eps': 0, 'objective/kl': 78.72278594970703, 'objective/entropy': 70.73180389404297, 'objective/non_score_reward': -3.9361393451690674, 'objective/rlhf_reward': -5.788438320159912, 'objective/scores': -1.8522990942001343, 'policy/approxkl_avg': 0.002347563859075308, 'policy/clipfrac_avg': 0.02004716917872429, 'loss/policy_avg': -0.01920463517308235, 'loss/value_avg': 0.5399792194366455, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2202181816101074, 'val/ratio': 0.9921285510063171, 'val/ratio_var': 5.148767013452016e-05, 'val/num_eos_tokens': 0, 'lr': 3.576678098971093e-06, 'episode': 7584, 'epoch': 0.93}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:47:56<1:05:39, 131kB/s]
 93%|█████████▎| 1897/2041 [2:44:38<12:29,  5.20s/it][A

{'eps': 0, 'objective/kl': 69.60948944091797, 'objective/entropy': 85.10818481445312, 'objective/non_score_reward': -3.4804747104644775, 'objective/rlhf_reward': -5.964972496032715, 'objective/scores': -2.4844977855682373, 'policy/approxkl_avg': 0.0015158619498834014, 'policy/clipfrac_avg': 0.014150943607091904, 'loss/policy_avg': -0.017654281109571457, 'loss/value_avg': 0.5559698939323425, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.518250823020935, 'val/ratio': 0.9980202913284302, 'val/ratio_var': 2.8853046387666836e-06, 'val/num_eos_tokens': 0, 'lr': 3.5521803037726608e-06, 'episode': 7588, 'epoch': 0.93}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:48:02<1:05:39, 131kB/s]
 93%|█████████▎| 1898/2041 [2:44:43<12:22,  5.19s/it][A

{'eps': 0, 'objective/kl': 65.79515838623047, 'objective/entropy': 54.502471923828125, 'objective/non_score_reward': -3.2897582054138184, 'objective/rlhf_reward': -5.260533332824707, 'objective/scores': -1.9707752466201782, 'policy/approxkl_avg': 0.005666290409862995, 'policy/clipfrac_avg': 0.01886792480945587, 'loss/policy_avg': -0.016593392938375473, 'loss/value_avg': 0.4549998939037323, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9598333239555359, 'val/ratio': 0.9918668866157532, 'val/ratio_var': 4.6115266741253436e-05, 'val/num_eos_tokens': 0, 'lr': 3.527682508574229e-06, 'episode': 7592, 'epoch': 0.93}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:48:07<1:05:39, 131kB/s]
 93%|█████████▎| 1899/2041 [2:44:48<12:18,  5.20s/it][A

{'eps': 0, 'objective/kl': 80.91740417480469, 'objective/entropy': 69.63258361816406, 'objective/non_score_reward': -4.045870780944824, 'objective/rlhf_reward': -5.322760581970215, 'objective/scores': -1.276889681816101, 'policy/approxkl_avg': 0.0007784833433106542, 'policy/clipfrac_avg': 0.00589622650295496, 'loss/policy_avg': -0.01678883656859398, 'loss/value_avg': 0.5961468815803528, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.354445219039917, 'val/ratio': 0.9950430989265442, 'val/ratio_var': 2.3389855414279737e-05, 'val/num_eos_tokens': 0, 'lr': 3.5031847133757964e-06, 'episode': 7596, 'epoch': 0.93}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:48:12<1:05:39, 131kB/s]
 93%|█████████▎| 1900/2041 [2:44:54<12:12,  5.19s/it][A

{'eps': 0, 'objective/kl': 69.0307846069336, 'objective/entropy': 52.851783752441406, 'objective/non_score_reward': -3.4515390396118164, 'objective/rlhf_reward': -5.251348972320557, 'objective/scores': -1.7998098134994507, 'policy/approxkl_avg': 0.0014288699021562934, 'policy/clipfrac_avg': 0.010613207705318928, 'loss/policy_avg': -0.014484314247965813, 'loss/value_avg': 0.3967006802558899, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9990953803062439, 'val/ratio': 0.9953446388244629, 'val/ratio_var': 1.9279937987448648e-05, 'val/num_eos_tokens': 0, 'lr': 3.4786869181773645e-06, 'episode': 7600, 'epoch': 0.93}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:48:17<1:05:39, 131kB/s]
 93%|█████████▎| 1901/2041 [2:44:59<12:07,  5.19s/it][A

{'eps': 0, 'objective/kl': 69.99817657470703, 'objective/entropy': 51.50182342529297, 'objective/non_score_reward': -3.499908685684204, 'objective/rlhf_reward': -6.16447639465332, 'objective/scores': -2.6645679473876953, 'policy/approxkl_avg': 0.007382730022072792, 'policy/clipfrac_avg': 0.009433962404727936, 'loss/policy_avg': -0.013975849375128746, 'loss/value_avg': 0.5002334713935852, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0130139589309692, 'val/ratio': 0.9948077201843262, 'val/ratio_var': 1.6598594811512157e-05, 'val/num_eos_tokens': 0, 'lr': 3.454189122978932e-06, 'episode': 7604, 'epoch': 0.93}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:48:22<1:05:39, 131kB/s]
 93%|█████████▎| 1902/2041 [2:45:04<12:00,  5.18s/it][A

{'eps': 0, 'objective/kl': 79.12898254394531, 'objective/entropy': 63.156402587890625, 'objective/non_score_reward': -3.956449031829834, 'objective/rlhf_reward': -5.959738731384277, 'objective/scores': -2.0032899379730225, 'policy/approxkl_avg': 0.004042977932840586, 'policy/clipfrac_avg': 0.03419811278581619, 'loss/policy_avg': -0.020348429679870605, 'loss/value_avg': 0.4549097418785095, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0250376462936401, 'val/ratio': 0.9960330724716187, 'val/ratio_var': 1.0980050319631118e-05, 'val/num_eos_tokens': 0, 'lr': 3.4296913277805e-06, 'episode': 7608, 'epoch': 0.93}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:48:28<1:05:39, 131kB/s]
 93%|█████████▎| 1903/2041 [2:45:09<11:55,  5.19s/it][A

{'eps': 0, 'objective/kl': 74.0239486694336, 'objective/entropy': 61.82233428955078, 'objective/non_score_reward': -3.701197624206543, 'objective/rlhf_reward': -5.391523361206055, 'objective/scores': -1.6903257369995117, 'policy/approxkl_avg': 0.0021100020967423916, 'policy/clipfrac_avg': 0.014150943607091904, 'loss/policy_avg': -0.0146221574395895, 'loss/value_avg': 0.4478878974914551, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1367710828781128, 'val/ratio': 0.9919312596321106, 'val/ratio_var': 5.448174852062948e-05, 'val/num_eos_tokens': 0, 'lr': 3.4051935325820678e-06, 'episode': 7612, 'epoch': 0.93}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:48:33<1:05:39, 131kB/s]
 93%|█████████▎| 1904/2041 [2:45:14<11:48,  5.17s/it][A

{'eps': 0, 'objective/kl': 72.506591796875, 'objective/entropy': 71.79772186279297, 'objective/non_score_reward': -3.6253299713134766, 'objective/rlhf_reward': -5.544497489929199, 'objective/scores': -1.9191672801971436, 'policy/approxkl_avg': 0.001915168366394937, 'policy/clipfrac_avg': 0.01886792480945587, 'loss/policy_avg': -0.01791367679834366, 'loss/value_avg': 0.3827255070209503, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2012805938720703, 'val/ratio': 0.9911983013153076, 'val/ratio_var': 6.687719724141061e-05, 'val/num_eos_tokens': 0, 'lr': 3.380695737383636e-06, 'episode': 7616, 'epoch': 0.93}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:48:38<1:05:39, 131kB/s]
 93%|█████████▎| 1905/2041 [2:45:19<11:40,  5.15s/it][A

{'eps': 0, 'objective/kl': 69.61769104003906, 'objective/entropy': 53.446754455566406, 'objective/non_score_reward': -3.480884313583374, 'objective/rlhf_reward': -5.207029342651367, 'objective/scores': -1.7261452674865723, 'policy/approxkl_avg': 0.002756427740678191, 'policy/clipfrac_avg': 0.01886792480945587, 'loss/policy_avg': -0.015509233810007572, 'loss/value_avg': 0.3917108178138733, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1128817796707153, 'val/ratio': 0.9956890344619751, 'val/ratio_var': 1.301059910474578e-05, 'val/num_eos_tokens': 0, 'lr': 3.3561979421852034e-06, 'episode': 7620, 'epoch': 0.93}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:48:43<1:05:39, 131kB/s]
 93%|█████████▎| 1906/2041 [2:45:25<11:38,  5.18s/it][A

{'eps': 0, 'objective/kl': 70.80914306640625, 'objective/entropy': 54.868228912353516, 'objective/non_score_reward': -3.5404574871063232, 'objective/rlhf_reward': -6.666295051574707, 'objective/scores': -3.125837802886963, 'policy/approxkl_avg': 0.0026736357249319553, 'policy/clipfrac_avg': 0.01886792480945587, 'loss/policy_avg': -0.017815500497817993, 'loss/value_avg': 0.5733992457389832, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.019921064376831, 'val/ratio': 0.9963884949684143, 'val/ratio_var': 8.144455932779238e-06, 'val/num_eos_tokens': 0, 'lr': 3.3317001469867715e-06, 'episode': 7624, 'epoch': 0.93}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:48:48<1:05:39, 131kB/s]
 93%|█████████▎| 1907/2041 [2:45:30<11:34,  5.18s/it][A

{'eps': 0, 'objective/kl': 70.72535705566406, 'objective/entropy': 63.26783752441406, 'objective/non_score_reward': -3.5362679958343506, 'objective/rlhf_reward': -5.44706916809082, 'objective/scores': -1.9108014106750488, 'policy/approxkl_avg': 0.0018408560426905751, 'policy/clipfrac_avg': 0.01886792480945587, 'loss/policy_avg': -0.018198588863015175, 'loss/value_avg': 0.5644335746765137, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0422756671905518, 'val/ratio': 0.9932985305786133, 'val/ratio_var': 3.981409827247262e-05, 'val/num_eos_tokens': 0, 'lr': 3.307202351788339e-06, 'episode': 7628, 'epoch': 0.93}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:48:53<1:05:39, 131kB/s]
 93%|█████████▎| 1908/2041 [2:45:35<11:30,  5.19s/it][A

{'eps': 0, 'objective/kl': 78.10135650634766, 'objective/entropy': 71.60523986816406, 'objective/non_score_reward': -3.9050681591033936, 'objective/rlhf_reward': -6.002856254577637, 'objective/scores': -2.097788095474243, 'policy/approxkl_avg': 0.0019247315358370543, 'policy/clipfrac_avg': 0.010613207705318928, 'loss/policy_avg': -0.015654394403100014, 'loss/value_avg': 0.5354397892951965, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1852353811264038, 'val/ratio': 0.99826979637146, 'val/ratio_var': 3.405920097065973e-06, 'val/num_eos_tokens': 0, 'lr': 3.282704556589907e-06, 'episode': 7632, 'epoch': 0.93}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:48:59<1:05:39, 131kB/s]
 94%|█████████▎| 1909/2041 [2:45:40<11:28,  5.21s/it][A

{'eps': 0, 'objective/kl': 77.85083770751953, 'objective/entropy': 49.18144226074219, 'objective/non_score_reward': -3.8925418853759766, 'objective/rlhf_reward': -5.464529514312744, 'objective/scores': -1.571987509727478, 'policy/approxkl_avg': 0.004229809157550335, 'policy/clipfrac_avg': 0.02358490601181984, 'loss/policy_avg': -0.0197809599339962, 'loss/value_avg': 0.5570781826972961, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.951346755027771, 'val/ratio': 0.9893169403076172, 'val/ratio_var': 8.647199865663424e-05, 'val/num_eos_tokens': 0, 'lr': 3.2582067613914748e-06, 'episode': 7636, 'epoch': 0.94}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:49:04<1:05:39, 131kB/s]
 94%|█████████▎| 1910/2041 [2:45:45<11:22,  5.21s/it][A

{'eps': 0, 'objective/kl': 68.48143768310547, 'objective/entropy': 41.752098083496094, 'objective/non_score_reward': -3.424072027206421, 'objective/rlhf_reward': -5.616421699523926, 'objective/scores': -2.192349910736084, 'policy/approxkl_avg': 0.0010821487521752715, 'policy/clipfrac_avg': 0.010613207705318928, 'loss/policy_avg': -0.010388895869255066, 'loss/value_avg': 0.5112835168838501, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8710812926292419, 'val/ratio': 0.9931536912918091, 'val/ratio_var': 4.559861190500669e-05, 'val/num_eos_tokens': 0, 'lr': 3.233708966193043e-06, 'episode': 7640, 'epoch': 0.94}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:49:09<1:05:39, 131kB/s]
 94%|█████████▎| 1911/2041 [2:45:51<11:18,  5.22s/it][A

{'eps': 0, 'objective/kl': 72.28836059570312, 'objective/entropy': 68.50647735595703, 'objective/non_score_reward': -3.6144182682037354, 'objective/rlhf_reward': -5.906013488769531, 'objective/scores': -2.291595220565796, 'policy/approxkl_avg': 0.0014239810407161713, 'policy/clipfrac_avg': 0.01179245300590992, 'loss/policy_avg': -0.01581621915102005, 'loss/value_avg': 0.4668194055557251, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2201056480407715, 'val/ratio': 0.9922361373901367, 'val/ratio_var': 5.4548498155782e-05, 'val/num_eos_tokens': 0, 'lr': 3.2092111709946104e-06, 'episode': 7644, 'epoch': 0.94}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:49:14<1:05:39, 131kB/s]
 94%|█████████▎| 1912/2041 [2:45:56<11:12,  5.22s/it][A

{'eps': 0, 'objective/kl': 71.19389343261719, 'objective/entropy': 64.67850494384766, 'objective/non_score_reward': -3.559694766998291, 'objective/rlhf_reward': -6.125083923339844, 'objective/scores': -2.565389394760132, 'policy/approxkl_avg': 0.0012465202016755939, 'policy/clipfrac_avg': 0.012971698306500912, 'loss/policy_avg': -0.015019511803984642, 'loss/value_avg': 0.4190545678138733, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2554304599761963, 'val/ratio': 0.9948303699493408, 'val/ratio_var': 2.189355291193351e-05, 'val/num_eos_tokens': 0, 'lr': 3.1847133757961785e-06, 'episode': 7648, 'epoch': 0.94}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:49:20<1:05:39, 131kB/s]
 94%|█████████▎| 1913/2041 [2:46:01<11:05,  5.20s/it][A

{'eps': 0, 'objective/kl': 66.96597290039062, 'objective/entropy': 42.19188690185547, 'objective/non_score_reward': -3.3482987880706787, 'objective/rlhf_reward': -6.19618558883667, 'objective/scores': -2.847886800765991, 'policy/approxkl_avg': 0.0029361103661358356, 'policy/clipfrac_avg': 0.01886792480945587, 'loss/policy_avg': -0.01525555644184351, 'loss/value_avg': 0.6757806539535522, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7739648222923279, 'val/ratio': 0.990989625453949, 'val/ratio_var': 6.702378595946357e-05, 'val/num_eos_tokens': 0, 'lr': 3.1602155805977465e-06, 'episode': 7652, 'epoch': 0.94}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:49:25<1:05:39, 131kB/s]
 94%|█████████▍| 1914/2041 [2:46:06<11:00,  5.20s/it][A

{'eps': 0, 'objective/kl': 78.25089263916016, 'objective/entropy': 58.306846618652344, 'objective/non_score_reward': -3.9125447273254395, 'objective/rlhf_reward': -5.001938819885254, 'objective/scores': -1.0893940925598145, 'policy/approxkl_avg': 0.0033887375611811876, 'policy/clipfrac_avg': 0.022405661642551422, 'loss/policy_avg': -0.020118068903684616, 'loss/value_avg': 0.4946437180042267, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2111581563949585, 'val/ratio': 0.9995689392089844, 'val/ratio_var': 1.0729439736678614e-06, 'val/num_eos_tokens': 0, 'lr': 3.135717785399314e-06, 'episode': 7656, 'epoch': 0.94}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:49:30<1:05:39, 131kB/s]
 94%|█████████▍| 1915/2041 [2:46:11<10:50,  5.16s/it][A

{'eps': 0, 'objective/kl': 75.48625183105469, 'objective/entropy': 51.17792510986328, 'objective/non_score_reward': -3.774312973022461, 'objective/rlhf_reward': -5.1884355545043945, 'objective/scores': -1.4141227006912231, 'policy/approxkl_avg': 0.0017146856989711523, 'policy/clipfrac_avg': 0.01179245300590992, 'loss/policy_avg': -0.015096550807356834, 'loss/value_avg': 0.48125147819519043, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1630398035049438, 'val/ratio': 0.9912383556365967, 'val/ratio_var': 6.579898035852239e-05, 'val/num_eos_tokens': 0, 'lr': 3.111219990200882e-06, 'episode': 7660, 'epoch': 0.94}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:49:35<1:05:39, 131kB/s]
 94%|█████████▍| 1916/2041 [2:46:17<10:45,  5.17s/it][A

{'eps': 0, 'objective/kl': 71.62495422363281, 'objective/entropy': 52.69181442260742, 'objective/non_score_reward': -3.5812480449676514, 'objective/rlhf_reward': -5.4396653175354, 'objective/scores': -1.858417272567749, 'policy/approxkl_avg': 0.0007696293178014457, 'policy/clipfrac_avg': 0.007075471803545952, 'loss/policy_avg': -0.014510930515825748, 'loss/value_avg': 0.7971701622009277, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1101343631744385, 'val/ratio': 0.9994029998779297, 'val/ratio_var': 3.634434051491553e-07, 'val/num_eos_tokens': 0, 'lr': 3.08672219500245e-06, 'episode': 7664, 'epoch': 0.94}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:49:40<1:05:39, 131kB/s]
 94%|█████████▍| 1917/2041 [2:46:22<10:40,  5.17s/it][A

{'eps': 0, 'objective/kl': 70.30165100097656, 'objective/entropy': 27.527801513671875, 'objective/non_score_reward': -3.515082836151123, 'objective/rlhf_reward': -6.108280181884766, 'objective/scores': -2.5931973457336426, 'policy/approxkl_avg': 0.0009682439849711955, 'policy/clipfrac_avg': 0.009433962404727936, 'loss/policy_avg': -0.009100816212594509, 'loss/value_avg': 0.3674679398536682, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6199281811714172, 'val/ratio': 0.9956691265106201, 'val/ratio_var': 1.853121466410812e-05, 'val/num_eos_tokens': 0, 'lr': 3.062224399804018e-06, 'episode': 7668, 'epoch': 0.94}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:49:45<1:05:39, 131kB/s]
 94%|█████████▍| 1918/2041 [2:46:27<10:35,  5.17s/it][A

{'eps': 0, 'objective/kl': 72.1170654296875, 'objective/entropy': 53.064903259277344, 'objective/non_score_reward': -3.6058530807495117, 'objective/rlhf_reward': -6.046940326690674, 'objective/scores': -2.441087245941162, 'policy/approxkl_avg': 0.0023535764776170254, 'policy/clipfrac_avg': 0.01886792480945587, 'loss/policy_avg': -0.014747647568583488, 'loss/value_avg': 0.7052170038223267, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0058245658874512, 'val/ratio': 0.9940670728683472, 'val/ratio_var': 3.060948438360356e-05, 'val/num_eos_tokens': 0, 'lr': 3.0377266046055855e-06, 'episode': 7672, 'epoch': 0.94}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:49:50<1:05:39, 131kB/s]
 94%|█████████▍| 1919/2041 [2:46:32<10:29,  5.16s/it][A

{'eps': 0, 'objective/kl': 69.3055419921875, 'objective/entropy': 41.47772979736328, 'objective/non_score_reward': -3.4652771949768066, 'objective/rlhf_reward': -4.743042945861816, 'objective/scores': -1.2777655124664307, 'policy/approxkl_avg': 0.0008526269812136889, 'policy/clipfrac_avg': 0.007075471803545952, 'loss/policy_avg': -0.010837284848093987, 'loss/value_avg': 0.46502381563186646, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7665920257568359, 'val/ratio': 0.9951542615890503, 'val/ratio_var': 2.12026006920496e-05, 'val/num_eos_tokens': 0, 'lr': 3.0132288094071535e-06, 'episode': 7676, 'epoch': 0.94}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:49:56<1:05:39, 131kB/s]
 94%|█████████▍| 1920/2041 [2:46:37<10:29,  5.20s/it][A

{'eps': 0, 'objective/kl': 72.76992797851562, 'objective/entropy': 59.945655822753906, 'objective/non_score_reward': -3.6384966373443604, 'objective/rlhf_reward': -5.343736171722412, 'objective/scores': -1.7052394151687622, 'policy/approxkl_avg': 0.0014820874202996492, 'policy/clipfrac_avg': 0.014150943607091904, 'loss/policy_avg': -0.01526482030749321, 'loss/value_avg': 0.3977835178375244, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1500765085220337, 'val/ratio': 0.9954035878181458, 'val/ratio_var': 1.73872886080062e-05, 'val/num_eos_tokens': 0, 'lr': 2.9887310142087215e-06, 'episode': 7680, 'epoch': 0.94}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:50:01<1:05:39, 131kB/s]
 94%|█████████▍| 1921/2041 [2:46:43<10:24,  5.21s/it][A

{'eps': 0, 'objective/kl': 74.35650634765625, 'objective/entropy': 43.31504821777344, 'objective/non_score_reward': -3.7178256511688232, 'objective/rlhf_reward': -6.0577287673950195, 'objective/scores': -2.3399033546447754, 'policy/approxkl_avg': 0.002272812882438302, 'policy/clipfrac_avg': 0.01886792480945587, 'loss/policy_avg': -0.013791359029710293, 'loss/value_avg': 0.6315140724182129, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7316163182258606, 'val/ratio': 0.994489312171936, 'val/ratio_var': 2.4452097932226025e-05, 'val/num_eos_tokens': 0, 'lr': 2.964233219010289e-06, 'episode': 7684, 'epoch': 0.94}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:50:06<1:05:39, 131kB/s]
 94%|█████████▍| 1922/2041 [2:46:48<10:16,  5.18s/it][A

{'eps': 0, 'objective/kl': 71.1932144165039, 'objective/entropy': 59.27278137207031, 'objective/non_score_reward': -3.5596609115600586, 'objective/rlhf_reward': -5.890618324279785, 'objective/scores': -2.3309574127197266, 'policy/approxkl_avg': 0.000789948389865458, 'policy/clipfrac_avg': 0.010613207705318928, 'loss/policy_avg': -0.01306617259979248, 'loss/value_avg': 0.5270730257034302, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0784580707550049, 'val/ratio': 0.9958016872406006, 'val/ratio_var': 1.6368010619771667e-05, 'val/num_eos_tokens': 0, 'lr': 2.939735423811857e-06, 'episode': 7688, 'epoch': 0.94}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:50:11<1:05:39, 131kB/s]
 94%|█████████▍| 1923/2041 [2:46:53<10:14,  5.21s/it][A

{'eps': 0, 'objective/kl': 75.33206939697266, 'objective/entropy': 47.707149505615234, 'objective/non_score_reward': -3.766603946685791, 'objective/rlhf_reward': -5.543684959411621, 'objective/scores': -1.77708101272583, 'policy/approxkl_avg': 0.0012621819041669369, 'policy/clipfrac_avg': 0.008254717104136944, 'loss/policy_avg': -0.010364308021962643, 'loss/value_avg': 0.43342411518096924, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8667013645172119, 'val/ratio': 0.9971567988395691, 'val/ratio_var': 7.942350748635363e-06, 'val/num_eos_tokens': 0, 'lr': 2.915237628613425e-06, 'episode': 7692, 'epoch': 0.94}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:50:16<1:05:39, 131kB/s]
 94%|█████████▍| 1924/2041 [2:46:58<10:07,  5.19s/it][A

{'eps': 0, 'objective/kl': 70.823486328125, 'objective/entropy': 74.3390884399414, 'objective/non_score_reward': -3.5411744117736816, 'objective/rlhf_reward': -5.494116306304932, 'objective/scores': -1.9529420137405396, 'policy/approxkl_avg': 0.0017907265573740005, 'policy/clipfrac_avg': 0.01179245300590992, 'loss/policy_avg': -0.018279358744621277, 'loss/value_avg': 0.6809821724891663, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2643729448318481, 'val/ratio': 0.9935134649276733, 'val/ratio_var': 3.5739205486606807e-05, 'val/num_eos_tokens': 0, 'lr': 2.890739833414993e-06, 'episode': 7696, 'epoch': 0.94}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:50:22<1:05:39, 131kB/s]
 94%|█████████▍| 1925/2041 [2:47:03<10:00,  5.18s/it][A

{'eps': 0, 'objective/kl': 70.80914306640625, 'objective/entropy': 42.063941955566406, 'objective/non_score_reward': -3.540457248687744, 'objective/rlhf_reward': -4.880904197692871, 'objective/scores': -1.340447187423706, 'policy/approxkl_avg': 0.0003314741770736873, 'policy/clipfrac_avg': 0.001179245300590992, 'loss/policy_avg': -0.008131441660225391, 'loss/value_avg': 0.3374537527561188, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8829272985458374, 'val/ratio': 0.9976305365562439, 'val/ratio_var': 5.776246325694956e-06, 'val/num_eos_tokens': 0, 'lr': 2.8662420382165605e-06, 'episode': 7700, 'epoch': 0.94}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:50:27<1:05:39, 131kB/s]
 94%|█████████▍| 1926/2041 [2:47:08<09:55,  5.18s/it][A

{'eps': 0, 'objective/kl': 71.74172973632812, 'objective/entropy': 60.41804504394531, 'objective/non_score_reward': -3.5870866775512695, 'objective/rlhf_reward': -6.102257251739502, 'objective/scores': -2.5151705741882324, 'policy/approxkl_avg': 0.002227485878393054, 'policy/clipfrac_avg': 0.02004716917872429, 'loss/policy_avg': -0.016089674085378647, 'loss/value_avg': 0.4699697196483612, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0094759464263916, 'val/ratio': 1.0038446187973022, 'val/ratio_var': 1.4444543921854347e-05, 'val/num_eos_tokens': 0, 'lr': 2.8417442430181285e-06, 'episode': 7704, 'epoch': 0.94}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:50:32<1:05:39, 131kB/s]
 94%|█████████▍| 1927/2041 [2:47:14<09:50,  5.18s/it][A

{'eps': 0, 'objective/kl': 62.430686950683594, 'objective/entropy': 35.0996208190918, 'objective/non_score_reward': -3.1215343475341797, 'objective/rlhf_reward': -5.04940128326416, 'objective/scores': -1.9278671741485596, 'policy/approxkl_avg': 0.0009189381962642074, 'policy/clipfrac_avg': 0.012971698306500912, 'loss/policy_avg': -0.010807742364704609, 'loss/value_avg': 0.4493727385997772, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7796667814254761, 'val/ratio': 0.9946304559707642, 'val/ratio_var': 2.8639000447583385e-05, 'val/num_eos_tokens': 0, 'lr': 2.8172464478196966e-06, 'episode': 7708, 'epoch': 0.94}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:50:37<1:05:39, 131kB/s]
 94%|█████████▍| 1928/2041 [2:47:19<09:44,  5.18s/it][A

{'eps': 0, 'objective/kl': 73.35621643066406, 'objective/entropy': 42.393611907958984, 'objective/non_score_reward': -3.6678109169006348, 'objective/rlhf_reward': -5.815530776977539, 'objective/scores': -2.147719621658325, 'policy/approxkl_avg': 0.0005964942392893136, 'policy/clipfrac_avg': 0.00589622650295496, 'loss/policy_avg': -0.010407952591776848, 'loss/value_avg': 0.5523188710212708, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9284673929214478, 'val/ratio': 0.9965814352035522, 'val/ratio_var': 1.1546158930286765e-05, 'val/num_eos_tokens': 0, 'lr': 2.792748652621264e-06, 'episode': 7712, 'epoch': 0.94}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:50:42<1:05:39, 131kB/s]
 95%|█████████▍| 1929/2041 [2:47:24<09:41,  5.19s/it][A

{'eps': 0, 'objective/kl': 72.96047973632812, 'objective/entropy': 62.155609130859375, 'objective/non_score_reward': -3.648024320602417, 'objective/rlhf_reward': -5.817509651184082, 'objective/scores': -2.169485330581665, 'policy/approxkl_avg': 0.0016910543199628592, 'policy/clipfrac_avg': 0.01886792480945587, 'loss/policy_avg': -0.018750309944152832, 'loss/value_avg': 0.5228846073150635, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1767908334732056, 'val/ratio': 0.9915342330932617, 'val/ratio_var': 7.084431854309514e-05, 'val/num_eos_tokens': 0, 'lr': 2.7682508574228322e-06, 'episode': 7716, 'epoch': 0.95}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:50:48<1:05:39, 131kB/s]
 95%|█████████▍| 1930/2041 [2:47:29<09:34,  5.17s/it][A

{'eps': 0, 'objective/kl': 67.271484375, 'objective/entropy': 43.23981857299805, 'objective/non_score_reward': -3.3635740280151367, 'objective/rlhf_reward': -5.465056419372559, 'objective/scores': -2.101482629776001, 'policy/approxkl_avg': 0.0014440658269450068, 'policy/clipfrac_avg': 0.009433962404727936, 'loss/policy_avg': -0.011326814070343971, 'loss/value_avg': 0.36272406578063965, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9063724279403687, 'val/ratio': 0.9940871000289917, 'val/ratio_var': 2.7583742848946713e-05, 'val/num_eos_tokens': 0, 'lr': 2.7437530622244e-06, 'episode': 7720, 'epoch': 0.95}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:50:53<1:05:39, 131kB/s]
 95%|█████████▍| 1931/2041 [2:47:34<09:29,  5.17s/it][A

{'eps': 0, 'objective/kl': 69.08294677734375, 'objective/entropy': 48.907798767089844, 'objective/non_score_reward': -3.4541473388671875, 'objective/rlhf_reward': -5.467194557189941, 'objective/scores': -2.013047456741333, 'policy/approxkl_avg': 0.0003751428739633411, 'policy/clipfrac_avg': 0.002358490601181984, 'loss/policy_avg': -0.007569110952317715, 'loss/value_avg': 0.40955525636672974, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9047887325286865, 'val/ratio': 0.9961526393890381, 'val/ratio_var': 1.4745418411621358e-05, 'val/num_eos_tokens': 0, 'lr': 2.719255267025968e-06, 'episode': 7724, 'epoch': 0.95}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:50:58<1:05:39, 131kB/s]
 95%|█████████▍| 1932/2041 [2:47:39<09:24,  5.17s/it][A

{'eps': 0, 'objective/kl': 76.10928344726562, 'objective/entropy': 90.46653747558594, 'objective/non_score_reward': -3.805464267730713, 'objective/rlhf_reward': -5.0921525955200195, 'objective/scores': -1.2866885662078857, 'policy/approxkl_avg': 0.0013286308385431767, 'policy/clipfrac_avg': 0.015330188907682896, 'loss/policy_avg': -0.01776351034641266, 'loss/value_avg': 0.38686931133270264, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.7200905084609985, 'val/ratio': 0.9962930679321289, 'val/ratio_var': 1.2987122318008915e-05, 'val/num_eos_tokens': 0, 'lr': 2.6947574718275355e-06, 'episode': 7728, 'epoch': 0.95}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:51:03<1:05:39, 131kB/s]
 95%|█████████▍| 1933/2041 [2:47:45<09:18,  5.17s/it][A

{'eps': 0, 'objective/kl': 67.93195343017578, 'objective/entropy': 42.195884704589844, 'objective/non_score_reward': -3.396597385406494, 'objective/rlhf_reward': -4.984591484069824, 'objective/scores': -1.58799409866333, 'policy/approxkl_avg': 0.0009043728350661695, 'policy/clipfrac_avg': 0.004716981202363968, 'loss/policy_avg': -0.008930610492825508, 'loss/value_avg': 0.5166791081428528, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8961188793182373, 'val/ratio': 0.9958054423332214, 'val/ratio_var': 1.4961556189518888e-05, 'val/num_eos_tokens': 0, 'lr': 2.6702596766291036e-06, 'episode': 7732, 'epoch': 0.95}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:51:08<1:05:39, 131kB/s]
 95%|█████████▍| 1934/2041 [2:47:50<09:16,  5.20s/it][A

{'eps': 0, 'objective/kl': 73.49842834472656, 'objective/entropy': 64.96907043457031, 'objective/non_score_reward': -3.6749215126037598, 'objective/rlhf_reward': -5.282284736633301, 'objective/scores': -1.607363224029541, 'policy/approxkl_avg': 0.0005188962677493691, 'policy/clipfrac_avg': 0.004716981202363968, 'loss/policy_avg': -0.011617944575846195, 'loss/value_avg': 0.379758358001709, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.098534345626831, 'val/ratio': 0.9950171709060669, 'val/ratio_var': 2.4055601897998713e-05, 'val/num_eos_tokens': 0, 'lr': 2.645761881430671e-06, 'episode': 7736, 'epoch': 0.95}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:51:13<1:05:39, 131kB/s]
 95%|█████████▍| 1935/2041 [2:47:55<09:08,  5.18s/it][A

{'eps': 0, 'objective/kl': 76.38514709472656, 'objective/entropy': 48.41582489013672, 'objective/non_score_reward': -3.8192572593688965, 'objective/rlhf_reward': -5.097336292266846, 'objective/scores': -1.2780789136886597, 'policy/approxkl_avg': 0.009730164892971516, 'policy/clipfrac_avg': 0.012971698306500912, 'loss/policy_avg': -0.012432019226253033, 'loss/value_avg': 0.6014459729194641, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9049723148345947, 'val/ratio': 0.9941439628601074, 'val/ratio_var': 2.2193029508343898e-05, 'val/num_eos_tokens': 0, 'lr': 2.6212640862322392e-06, 'episode': 7740, 'epoch': 0.95}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:51:19<1:05:39, 131kB/s]
 95%|█████████▍| 1936/2041 [2:48:00<09:06,  5.20s/it][A

{'eps': 0, 'objective/kl': 68.49992370605469, 'objective/entropy': 42.85792541503906, 'objective/non_score_reward': -3.4249961376190186, 'objective/rlhf_reward': -5.3182692527771, 'objective/scores': -1.8932732343673706, 'policy/approxkl_avg': 0.002500242320820689, 'policy/clipfrac_avg': 0.015330188907682896, 'loss/policy_avg': -0.012080482207238674, 'loss/value_avg': 0.36584219336509705, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8045433759689331, 'val/ratio': 0.9955298900604248, 'val/ratio_var': 1.3644396858580876e-05, 'val/num_eos_tokens': 0, 'lr': 2.5967662910338073e-06, 'episode': 7744, 'epoch': 0.95}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:51:24<1:05:39, 131kB/s]
 95%|█████████▍| 1937/2041 [2:48:05<08:57,  5.17s/it][A

{'eps': 0, 'objective/kl': 77.71722412109375, 'objective/entropy': 53.93552017211914, 'objective/non_score_reward': -3.885861396789551, 'objective/rlhf_reward': -5.412851333618164, 'objective/scores': -1.5269899368286133, 'policy/approxkl_avg': 0.0009680981165729463, 'policy/clipfrac_avg': 0.007075471803545952, 'loss/policy_avg': -0.008709494955837727, 'loss/value_avg': 0.41791656613349915, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.006704568862915, 'val/ratio': 0.9979104995727539, 'val/ratio_var': 3.3779981549741933e-06, 'val/num_eos_tokens': 0, 'lr': 2.572268495835375e-06, 'episode': 7748, 'epoch': 0.95}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:51:29<1:05:39, 131kB/s]
 95%|█████████▍| 1938/2041 [2:48:11<08:53,  5.18s/it][A

{'eps': 0, 'objective/kl': 76.55186462402344, 'objective/entropy': 41.24787139892578, 'objective/non_score_reward': -3.8275935649871826, 'objective/rlhf_reward': -6.351733207702637, 'objective/scores': -2.524139881134033, 'policy/approxkl_avg': 0.0017410240834578872, 'policy/clipfrac_avg': 0.009433962404727936, 'loss/policy_avg': -0.012871419079601765, 'loss/value_avg': 0.5848039388656616, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8796653747558594, 'val/ratio': 0.9921471476554871, 'val/ratio_var': 5.497355596162379e-05, 'val/num_eos_tokens': 0, 'lr': 2.547770700636943e-06, 'episode': 7752, 'epoch': 0.95}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:51:34<1:05:39, 131kB/s]
 95%|█████████▌| 1939/2041 [2:48:16<08:47,  5.17s/it][A

{'eps': 0, 'objective/kl': 84.62689208984375, 'objective/entropy': 46.895389556884766, 'objective/non_score_reward': -4.231344223022461, 'objective/rlhf_reward': -6.78230619430542, 'objective/scores': -2.550961971282959, 'policy/approxkl_avg': 0.001447436516173184, 'policy/clipfrac_avg': 0.014150943607091904, 'loss/policy_avg': -0.015459787100553513, 'loss/value_avg': 0.853110671043396, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8957204818725586, 'val/ratio': 0.9926751852035522, 'val/ratio_var': 4.874233854934573e-05, 'val/num_eos_tokens': 0, 'lr': 2.5232729054385106e-06, 'episode': 7756, 'epoch': 0.95}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:51:39<1:05:39, 131kB/s]
 95%|█████████▌| 1940/2041 [2:48:21<08:40,  5.16s/it][A

{'eps': 0, 'objective/kl': 64.01856231689453, 'objective/entropy': 49.73152160644531, 'objective/non_score_reward': -3.200928211212158, 'objective/rlhf_reward': -4.73864221572876, 'objective/scores': -1.537713885307312, 'policy/approxkl_avg': 0.0022651490289717913, 'policy/clipfrac_avg': 0.014150943607091904, 'loss/policy_avg': -0.010435613803565502, 'loss/value_avg': 0.4095168113708496, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8421698808670044, 'val/ratio': 0.995995819568634, 'val/ratio_var': 1.0749347893579397e-05, 'val/num_eos_tokens': 0, 'lr': 2.4987751102400786e-06, 'episode': 7760, 'epoch': 0.95}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:51:45<1:05:39, 131kB/s]
 95%|█████████▌| 1941/2041 [2:48:26<08:39,  5.19s/it][A

{'eps': 0, 'objective/kl': 79.37614440917969, 'objective/entropy': 45.73344421386719, 'objective/non_score_reward': -3.9688076972961426, 'objective/rlhf_reward': -5.7422356605529785, 'objective/scores': -1.773427963256836, 'policy/approxkl_avg': 0.00015989650273695588, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.005434270948171616, 'loss/value_avg': 0.3802862763404846, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9144219160079956, 'val/ratio': 0.9981136918067932, 'val/ratio_var': 3.972163540311158e-06, 'val/num_eos_tokens': 0, 'lr': 2.4742773150416462e-06, 'episode': 7764, 'epoch': 0.95}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:51:50<1:05:39, 131kB/s]
 95%|█████████▌| 1942/2041 [2:48:31<08:34,  5.19s/it][A

{'eps': 0, 'objective/kl': 72.54335021972656, 'objective/entropy': 49.653438568115234, 'objective/non_score_reward': -3.6271677017211914, 'objective/rlhf_reward': -5.5783185958862305, 'objective/scores': -1.95115065574646, 'policy/approxkl_avg': 0.003024417208507657, 'policy/clipfrac_avg': 0.007075471803545952, 'loss/policy_avg': -0.01082928478717804, 'loss/value_avg': 0.46253520250320435, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8559839725494385, 'val/ratio': 0.9947917461395264, 'val/ratio_var': 1.992724537558388e-05, 'val/num_eos_tokens': 0, 'lr': 2.4497795198432143e-06, 'episode': 7768, 'epoch': 0.95}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:51:55<1:05:39, 131kB/s]
 95%|█████████▌| 1943/2041 [2:48:37<08:31,  5.22s/it][A

{'eps': 0, 'objective/kl': 77.71054077148438, 'objective/entropy': 38.93933868408203, 'objective/non_score_reward': -3.8855271339416504, 'objective/rlhf_reward': -5.590425968170166, 'objective/scores': -1.7048988342285156, 'policy/approxkl_avg': 0.0014302018098533154, 'policy/clipfrac_avg': 0.004716981202363968, 'loss/policy_avg': -0.010323604568839073, 'loss/value_avg': 0.4514266550540924, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8708633184432983, 'val/ratio': 0.9976690411567688, 'val/ratio_var': 4.575695129460655e-06, 'val/num_eos_tokens': 0, 'lr': 2.4252817246447823e-06, 'episode': 7772, 'epoch': 0.95}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:52:00<1:05:39, 131kB/s]
 95%|█████████▌| 1944/2041 [2:48:42<08:24,  5.21s/it][A

{'eps': 0, 'objective/kl': 69.83525085449219, 'objective/entropy': 53.39550018310547, 'objective/non_score_reward': -3.491762638092041, 'objective/rlhf_reward': -5.642336368560791, 'objective/scores': -2.15057373046875, 'policy/approxkl_avg': 0.0008232451509684324, 'policy/clipfrac_avg': 0.01179245300590992, 'loss/policy_avg': -0.014165204018354416, 'loss/value_avg': 0.41366010904312134, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8700777292251587, 'val/ratio': 0.99488365650177, 'val/ratio_var': 2.5370007278979756e-05, 'val/num_eos_tokens': 0, 'lr': 2.40078392944635e-06, 'episode': 7776, 'epoch': 0.95}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:52:05<1:05:39, 131kB/s]
 95%|█████████▌| 1945/2041 [2:48:47<08:19,  5.20s/it][A

{'eps': 0, 'objective/kl': 73.28263854980469, 'objective/entropy': 55.746524810791016, 'objective/non_score_reward': -3.6641321182250977, 'objective/rlhf_reward': -5.690234184265137, 'objective/scores': -2.026102066040039, 'policy/approxkl_avg': 0.0018755581695586443, 'policy/clipfrac_avg': 0.014150943607091904, 'loss/policy_avg': -0.01339997909963131, 'loss/value_avg': 0.5576977133750916, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.100538969039917, 'val/ratio': 0.9940037727355957, 'val/ratio_var': 3.241570811951533e-05, 'val/num_eos_tokens': 0, 'lr': 2.376286134247918e-06, 'episode': 7780, 'epoch': 0.95}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:52:11<1:05:39, 131kB/s]
 95%|█████████▌| 1946/2041 [2:48:52<08:13,  5.20s/it][A

{'eps': 0, 'objective/kl': 81.26918029785156, 'objective/entropy': 41.689109802246094, 'objective/non_score_reward': -4.0634589195251465, 'objective/rlhf_reward': -6.780422687530518, 'objective/scores': -2.716963768005371, 'policy/approxkl_avg': 0.0003458302526269108, 'policy/clipfrac_avg': 0.002358490601181984, 'loss/policy_avg': -0.008597020991146564, 'loss/value_avg': 0.5654609203338623, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8633201122283936, 'val/ratio': 0.9978262186050415, 'val/ratio_var': 4.730042292067083e-06, 'val/num_eos_tokens': 0, 'lr': 2.3517883390494856e-06, 'episode': 7784, 'epoch': 0.95}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:52:16<1:05:39, 131kB/s]
 95%|█████████▌| 1947/2041 [2:48:57<08:09,  5.21s/it][A

{'eps': 0, 'objective/kl': 73.13661193847656, 'objective/entropy': 37.54606246948242, 'objective/non_score_reward': -3.6568307876586914, 'objective/rlhf_reward': -4.855114459991455, 'objective/scores': -1.1982836723327637, 'policy/approxkl_avg': 0.00048539225826971233, 'policy/clipfrac_avg': 0.00589622650295496, 'loss/policy_avg': -0.009311416186392307, 'loss/value_avg': 0.38741475343704224, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8052642345428467, 'val/ratio': 1.0000033378601074, 'val/ratio_var': 4.811918685732053e-08, 'val/num_eos_tokens': 0, 'lr': 2.3272905438510536e-06, 'episode': 7788, 'epoch': 0.95}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:52:21<1:05:39, 131kB/s]
 95%|█████████▌| 1948/2041 [2:49:03<08:04,  5.21s/it][A

{'eps': 0, 'objective/kl': 80.21279907226562, 'objective/entropy': 54.67035675048828, 'objective/non_score_reward': -4.0106401443481445, 'objective/rlhf_reward': -6.2167181968688965, 'objective/scores': -2.206078052520752, 'policy/approxkl_avg': 0.001426980597898364, 'policy/clipfrac_avg': 0.008254717104136944, 'loss/policy_avg': -0.011262054555118084, 'loss/value_avg': 0.6057332754135132, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0544427633285522, 'val/ratio': 0.9942830801010132, 'val/ratio_var': 2.8606862542801537e-05, 'val/num_eos_tokens': 0, 'lr': 2.3027927486526213e-06, 'episode': 7792, 'epoch': 0.95}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:52:26<1:05:39, 131kB/s]
 95%|█████████▌| 1949/2041 [2:49:08<07:58,  5.20s/it][A

{'eps': 0, 'objective/kl': 75.20311737060547, 'objective/entropy': 31.499164581298828, 'objective/non_score_reward': -3.7601559162139893, 'objective/rlhf_reward': -5.718549728393555, 'objective/scores': -1.9583935737609863, 'policy/approxkl_avg': 0.003197146812453866, 'policy/clipfrac_avg': 0.015330188907682896, 'loss/policy_avg': -0.012631548568606377, 'loss/value_avg': 0.4940682649612427, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.710590660572052, 'val/ratio': 0.9946004748344421, 'val/ratio_var': 2.0535129806376062e-05, 'val/num_eos_tokens': 0, 'lr': 2.2782949534541893e-06, 'episode': 7796, 'epoch': 0.96}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:52:31<1:05:39, 131kB/s]
 96%|█████████▌| 1950/2041 [2:49:13<07:54,  5.21s/it][A

{'eps': 0, 'objective/kl': 75.74034118652344, 'objective/entropy': 59.56580352783203, 'objective/non_score_reward': -3.787017345428467, 'objective/rlhf_reward': -4.818406105041504, 'objective/scores': -1.0313888788223267, 'policy/approxkl_avg': 0.0026667434722185135, 'policy/clipfrac_avg': 0.012971698306500912, 'loss/policy_avg': -0.012873757630586624, 'loss/value_avg': 0.5300283432006836, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.081935167312622, 'val/ratio': 0.9970628023147583, 'val/ratio_var': 6.316786766547011e-06, 'val/num_eos_tokens': 0, 'lr': 2.253797158255757e-06, 'episode': 7800, 'epoch': 0.96}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:52:37<1:05:39, 131kB/s]
 96%|█████████▌| 1951/2041 [2:49:18<07:48,  5.21s/it][A

{'eps': 0, 'objective/kl': 76.02796936035156, 'objective/entropy': 62.070804595947266, 'objective/non_score_reward': -3.801398277282715, 'objective/rlhf_reward': -5.651394844055176, 'objective/scores': -1.849996566772461, 'policy/approxkl_avg': 0.0012592619750648737, 'policy/clipfrac_avg': 0.00589622650295496, 'loss/policy_avg': -0.014367947354912758, 'loss/value_avg': 0.4736461639404297, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1421847343444824, 'val/ratio': 0.9930731058120728, 'val/ratio_var': 4.3260475649731234e-05, 'val/num_eos_tokens': 0, 'lr': 2.229299363057325e-06, 'episode': 7804, 'epoch': 0.96}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:52:42<1:05:39, 131kB/s]
 96%|█████████▌| 1952/2041 [2:49:23<07:44,  5.22s/it][A

{'eps': 0, 'objective/kl': 86.88285827636719, 'objective/entropy': 47.619850158691406, 'objective/non_score_reward': -4.344143390655518, 'objective/rlhf_reward': -6.135832786560059, 'objective/scores': -1.7916896343231201, 'policy/approxkl_avg': 0.01490689441561699, 'policy/clipfrac_avg': 0.021226417273283005, 'loss/policy_avg': -0.016534095630049706, 'loss/value_avg': 0.8736189007759094, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.000260591506958, 'val/ratio': 0.9922336935997009, 'val/ratio_var': 3.277876021456905e-05, 'val/num_eos_tokens': 0, 'lr': 2.2048015678588926e-06, 'episode': 7808, 'epoch': 0.96}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:52:47<1:05:39, 131kB/s]
 96%|█████████▌| 1953/2041 [2:49:29<07:38,  5.21s/it][A

{'eps': 0, 'objective/kl': 74.96441650390625, 'objective/entropy': 46.9361572265625, 'objective/non_score_reward': -3.748220682144165, 'objective/rlhf_reward': -5.830073356628418, 'objective/scores': -2.081852912902832, 'policy/approxkl_avg': 0.0006953248521313071, 'policy/clipfrac_avg': 0.007075471803545952, 'loss/policy_avg': -0.006982335355132818, 'loss/value_avg': 0.4418685734272003, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9320777654647827, 'val/ratio': 0.9961951971054077, 'val/ratio_var': 1.233595594385406e-05, 'val/num_eos_tokens': 0, 'lr': 2.1803037726604606e-06, 'episode': 7812, 'epoch': 0.96}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:52:52<1:05:39, 131kB/s]
 96%|█████████▌| 1954/2041 [2:49:34<07:33,  5.21s/it][A

{'eps': 0, 'objective/kl': 83.79133605957031, 'objective/entropy': 66.25581359863281, 'objective/non_score_reward': -4.189566612243652, 'objective/rlhf_reward': -5.982664108276367, 'objective/scores': -1.7930972576141357, 'policy/approxkl_avg': 0.0009665484540164471, 'policy/clipfrac_avg': 0.009433962404727936, 'loss/policy_avg': -0.011866769753396511, 'loss/value_avg': 0.6183918714523315, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2038941383361816, 'val/ratio': 0.9947927594184875, 'val/ratio_var': 2.3689421141170897e-05, 'val/num_eos_tokens': 0, 'lr': 2.1558059774620287e-06, 'episode': 7816, 'epoch': 0.96}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:52:58<1:05:39, 131kB/s]
 96%|█████████▌| 1955/2041 [2:49:39<07:30,  5.23s/it][A

{'eps': 0, 'objective/kl': 82.23591613769531, 'objective/entropy': 61.93468475341797, 'objective/non_score_reward': -4.111795425415039, 'objective/rlhf_reward': -5.252646446228027, 'objective/scores': -1.1408510208129883, 'policy/approxkl_avg': 0.0020387982949614525, 'policy/clipfrac_avg': 0.021226415410637856, 'loss/policy_avg': -0.014196665026247501, 'loss/value_avg': 0.6878931522369385, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.040649175643921, 'val/ratio': 1.0018519163131714, 'val/ratio_var': 2.1159521566005424e-06, 'val/num_eos_tokens': 0, 'lr': 2.1313081822635963e-06, 'episode': 7820, 'epoch': 0.96}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:53:03<1:05:39, 131kB/s]
 96%|█████████▌| 1956/2041 [2:49:44<07:23,  5.22s/it][A

{'eps': 0, 'objective/kl': 69.92025756835938, 'objective/entropy': 46.422889709472656, 'objective/non_score_reward': -3.4960126876831055, 'objective/rlhf_reward': -5.268501281738281, 'objective/scores': -1.7724888324737549, 'policy/approxkl_avg': 0.0006060118321329355, 'policy/clipfrac_avg': 0.002358490601181984, 'loss/policy_avg': -0.007335192058235407, 'loss/value_avg': 0.3921567499637604, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8637878894805908, 'val/ratio': 0.9962745904922485, 'val/ratio_var': 1.2979328857909422e-05, 'val/num_eos_tokens': 0, 'lr': 2.1068103870651643e-06, 'episode': 7824, 'epoch': 0.96}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:53:08<1:05:39, 131kB/s]
 96%|█████████▌| 1957/2041 [2:49:49<07:16,  5.20s/it][A

{'eps': 0, 'objective/kl': 72.40794372558594, 'objective/entropy': 59.30681228637695, 'objective/non_score_reward': -3.6203973293304443, 'objective/rlhf_reward': -5.286391258239746, 'objective/scores': -1.6659936904907227, 'policy/approxkl_avg': 0.0007821962935850024, 'policy/clipfrac_avg': 0.004716981202363968, 'loss/policy_avg': -0.00949025433510542, 'loss/value_avg': 0.34832268953323364, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.063618540763855, 'val/ratio': 1.0006808042526245, 'val/ratio_var': 5.535031277759117e-07, 'val/num_eos_tokens': 0, 'lr': 2.082312591866732e-06, 'episode': 7828, 'epoch': 0.96}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:53:13<1:05:39, 131kB/s]
 96%|█████████▌| 1958/2041 [2:49:55<07:10,  5.18s/it][A

{'eps': 0, 'objective/kl': 84.38408660888672, 'objective/entropy': 54.007816314697266, 'objective/non_score_reward': -4.219203948974609, 'objective/rlhf_reward': -4.938819885253906, 'objective/scores': -0.7196157574653625, 'policy/approxkl_avg': 0.0016073703300207853, 'policy/clipfrac_avg': 0.014150943607091904, 'loss/policy_avg': -0.012570644728839397, 'loss/value_avg': 0.6042832136154175, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9369310140609741, 'val/ratio': 0.9998281002044678, 'val/ratio_var': 5.597900099019171e-08, 'val/num_eos_tokens': 0, 'lr': 2.0578147966683e-06, 'episode': 7832, 'epoch': 0.96}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:53:18<1:05:39, 131kB/s]
 96%|█████████▌| 1959/2041 [2:50:00<07:07,  5.21s/it][A

{'eps': 0, 'objective/kl': 69.75642395019531, 'objective/entropy': 33.873985290527344, 'objective/non_score_reward': -3.487821340560913, 'objective/rlhf_reward': -5.955068111419678, 'objective/scores': -2.4672467708587646, 'policy/approxkl_avg': 0.0005022491095587611, 'policy/clipfrac_avg': 0.004716981202363968, 'loss/policy_avg': -0.006167822517454624, 'loss/value_avg': 0.49738356471061707, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8530060648918152, 'val/ratio': 1.000487208366394, 'val/ratio_var': 3.1078394613359706e-07, 'val/num_eos_tokens': 0, 'lr': 2.033317001469868e-06, 'episode': 7836, 'epoch': 0.96}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:53:23<1:05:39, 131kB/s]
 96%|█████████▌| 1960/2041 [2:50:05<07:00,  5.19s/it][A

{'eps': 0, 'objective/kl': 84.52783203125, 'objective/entropy': 46.88513946533203, 'objective/non_score_reward': -4.226391315460205, 'objective/rlhf_reward': -5.1636433601379395, 'objective/scores': -0.9372519254684448, 'policy/approxkl_avg': 0.0002977212716359645, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.009231407195329666, 'loss/value_avg': 0.40993422269821167, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8973489999771118, 'val/ratio': 0.9970359206199646, 'val/ratio_var': 8.904321475711185e-06, 'val/num_eos_tokens': 0, 'lr': 2.0088192062714357e-06, 'episode': 7840, 'epoch': 0.96}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:53:29<1:05:39, 131kB/s]
 96%|█████████▌| 1961/2041 [2:50:10<06:56,  5.21s/it][A

{'eps': 0, 'objective/kl': 71.0831527709961, 'objective/entropy': 71.40336608886719, 'objective/non_score_reward': -3.5541577339172363, 'objective/rlhf_reward': -6.116015434265137, 'objective/scores': -2.5618579387664795, 'policy/approxkl_avg': 0.0005051718326285481, 'policy/clipfrac_avg': 0.00589622650295496, 'loss/policy_avg': -0.013658770360052586, 'loss/value_avg': 0.5027354955673218, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2147294282913208, 'val/ratio': 0.9978095293045044, 'val/ratio_var': 4.345522484072717e-06, 'val/num_eos_tokens': 0, 'lr': 1.9843214110730037e-06, 'episode': 7844, 'epoch': 0.96}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:53:34<1:05:39, 131kB/s]
 96%|█████████▌| 1962/2041 [2:50:15<06:48,  5.17s/it][A

{'eps': 0, 'objective/kl': 69.0877685546875, 'objective/entropy': 52.95707702636719, 'objective/non_score_reward': -3.4543886184692383, 'objective/rlhf_reward': -5.837648391723633, 'objective/scores': -2.3832597732543945, 'policy/approxkl_avg': 0.0006621956126764417, 'policy/clipfrac_avg': 0.007075471803545952, 'loss/policy_avg': -0.012269074097275734, 'loss/value_avg': 0.43687182664871216, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1076040267944336, 'val/ratio': 0.9953669905662537, 'val/ratio_var': 1.9772602172452025e-05, 'val/num_eos_tokens': 0, 'lr': 1.9598236158745713e-06, 'episode': 7848, 'epoch': 0.96}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:53:39<1:05:39, 131kB/s]
 96%|█████████▌| 1963/2041 [2:50:21<06:43,  5.17s/it][A

{'eps': 0, 'objective/kl': 83.6279296875, 'objective/entropy': 25.79566192626953, 'objective/non_score_reward': -4.181396961212158, 'objective/rlhf_reward': -5.698692798614502, 'objective/scores': -1.5172957181930542, 'policy/approxkl_avg': 0.000954213144723326, 'policy/clipfrac_avg': 0.00589622650295496, 'loss/policy_avg': -0.006441298872232437, 'loss/value_avg': 0.5823459625244141, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6548262238502502, 'val/ratio': 0.9971888661384583, 'val/ratio_var': 6.417502390831942e-06, 'val/num_eos_tokens': 0, 'lr': 1.9353258206761394e-06, 'episode': 7852, 'epoch': 0.96}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:53:44<1:05:39, 131kB/s]
 96%|█████████▌| 1964/2041 [2:50:26<06:39,  5.19s/it][A

{'eps': 0, 'objective/kl': 77.81965637207031, 'objective/entropy': 37.04108428955078, 'objective/non_score_reward': -3.8909828662872314, 'objective/rlhf_reward': -5.545097351074219, 'objective/scores': -1.6541142463684082, 'policy/approxkl_avg': 0.00026849444839172065, 'policy/clipfrac_avg': 0.004716981202363968, 'loss/policy_avg': -0.005642420146614313, 'loss/value_avg': 0.3679177761077881, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.787569522857666, 'val/ratio': 1.0009212493896484, 'val/ratio_var': 9.60989268605772e-07, 'val/num_eos_tokens': 0, 'lr': 1.910828025477707e-06, 'episode': 7856, 'epoch': 0.96}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:53:49<1:05:39, 131kB/s]
 96%|█████████▋| 1965/2041 [2:50:31<06:36,  5.22s/it][A

{'eps': 0, 'objective/kl': 70.8751220703125, 'objective/entropy': 33.745826721191406, 'objective/non_score_reward': -3.5437557697296143, 'objective/rlhf_reward': -5.801055908203125, 'objective/scores': -2.2572999000549316, 'policy/approxkl_avg': 0.00019963634258601815, 'policy/clipfrac_avg': 0.001179245300590992, 'loss/policy_avg': -0.0042712208814918995, 'loss/value_avg': 0.4769388437271118, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6838123202323914, 'val/ratio': 0.9980627298355103, 'val/ratio_var': 4.080314283783082e-06, 'val/num_eos_tokens': 0, 'lr': 1.8863302302792748e-06, 'episode': 7860, 'epoch': 0.96}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:53:55<1:05:39, 131kB/s]
 96%|█████████▋| 1966/2041 [2:50:36<06:31,  5.21s/it][A

{'eps': 0, 'objective/kl': 74.97138977050781, 'objective/entropy': 45.77111053466797, 'objective/non_score_reward': -3.748569965362549, 'objective/rlhf_reward': -5.636338710784912, 'objective/scores': -1.8877688646316528, 'policy/approxkl_avg': 0.0004877423634752631, 'policy/clipfrac_avg': 0.007075471803545952, 'loss/policy_avg': -0.008243518881499767, 'loss/value_avg': 0.3931151032447815, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9018633961677551, 'val/ratio': 0.997713565826416, 'val/ratio_var': 4.6514664973074105e-06, 'val/num_eos_tokens': 0, 'lr': 1.8618324350808427e-06, 'episode': 7864, 'epoch': 0.96}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:54:00<1:05:39, 131kB/s]
 96%|█████████▋| 1967/2041 [2:50:41<06:25,  5.21s/it][A

{'eps': 0, 'objective/kl': 74.46833801269531, 'objective/entropy': 50.447479248046875, 'objective/non_score_reward': -3.723417043685913, 'objective/rlhf_reward': -5.463277339935303, 'objective/scores': -1.7398602962493896, 'policy/approxkl_avg': 0.00014632169040851295, 'policy/clipfrac_avg': 0.001179245300590992, 'loss/policy_avg': -0.005643513053655624, 'loss/value_avg': 0.41089892387390137, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8978420495986938, 'val/ratio': 0.9989308714866638, 'val/ratio_var': 1.241586460309918e-06, 'val/num_eos_tokens': 0, 'lr': 1.8373346398824105e-06, 'episode': 7868, 'epoch': 0.96}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:54:05<1:05:39, 131kB/s]
 96%|█████████▋| 1968/2041 [2:50:47<06:18,  5.18s/it][A

{'eps': 0, 'objective/kl': 67.29789733886719, 'objective/entropy': 73.33692169189453, 'objective/non_score_reward': -3.3648948669433594, 'objective/rlhf_reward': -5.065920352935791, 'objective/scores': -1.7010254859924316, 'policy/approxkl_avg': 0.00022326508769765496, 'policy/clipfrac_avg': 0.001179245300590992, 'loss/policy_avg': -0.007238707505166531, 'loss/value_avg': 0.38623178005218506, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1654324531555176, 'val/ratio': 0.9981689453125, 'val/ratio_var': 3.2852085496415384e-06, 'val/num_eos_tokens': 0, 'lr': 1.8128368446839783e-06, 'episode': 7872, 'epoch': 0.96}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:54:10<1:05:39, 131kB/s]
 96%|█████████▋| 1969/2041 [2:50:52<06:13,  5.18s/it][A

{'eps': 0, 'objective/kl': 71.91739654541016, 'objective/entropy': 50.1339225769043, 'objective/non_score_reward': -3.595869779586792, 'objective/rlhf_reward': -5.755502223968506, 'objective/scores': -2.159632444381714, 'policy/approxkl_avg': 0.0010786661878228188, 'policy/clipfrac_avg': 0.008254717104136944, 'loss/policy_avg': -0.012022498063743114, 'loss/value_avg': 0.4316760301589966, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.937964916229248, 'val/ratio': 0.9948354959487915, 'val/ratio_var': 2.43231188505888e-05, 'val/num_eos_tokens': 0, 'lr': 1.7883390494855466e-06, 'episode': 7876, 'epoch': 0.96}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:54:15<1:05:39, 131kB/s]
 97%|█████████▋| 1970/2041 [2:50:57<06:07,  5.17s/it][A

{'eps': 0, 'objective/kl': 74.61112976074219, 'objective/entropy': 45.30206298828125, 'objective/non_score_reward': -3.7305564880371094, 'objective/rlhf_reward': -5.423900604248047, 'objective/scores': -1.6933441162109375, 'policy/approxkl_avg': 0.00012671765580307692, 'policy/clipfrac_avg': 0.001179245300590992, 'loss/policy_avg': -0.004736524540930986, 'loss/value_avg': 0.4379976689815521, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8793219923973083, 'val/ratio': 0.9989532232284546, 'val/ratio_var': 8.958792818702932e-07, 'val/num_eos_tokens': 0, 'lr': 1.7638412542871144e-06, 'episode': 7880, 'epoch': 0.97}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:54:21<1:05:39, 131kB/s]
 97%|█████████▋| 1971/2041 [2:51:02<06:02,  5.18s/it][A

{'eps': 0, 'objective/kl': 74.16422271728516, 'objective/entropy': 55.431976318359375, 'objective/non_score_reward': -3.7082109451293945, 'objective/rlhf_reward': -6.376572608947754, 'objective/scores': -2.6683616638183594, 'policy/approxkl_avg': 0.0013741111615672708, 'policy/clipfrac_avg': 0.009433962404727936, 'loss/policy_avg': -0.00961175374686718, 'loss/value_avg': 0.756164014339447, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0759342908859253, 'val/ratio': 0.9965392351150513, 'val/ratio_var': 8.874486411514226e-06, 'val/num_eos_tokens': 0, 'lr': 1.7393434590886822e-06, 'episode': 7884, 'epoch': 0.97}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:54:26<1:05:39, 131kB/s]
 97%|█████████▋| 1972/2041 [2:51:07<05:56,  5.17s/it][A

{'eps': 0, 'objective/kl': 77.81869506835938, 'objective/entropy': 41.216400146484375, 'objective/non_score_reward': -3.890934705734253, 'objective/rlhf_reward': -5.277246475219727, 'objective/scores': -1.3863117694854736, 'policy/approxkl_avg': 0.0007475065067410469, 'policy/clipfrac_avg': 0.003537735901772976, 'loss/policy_avg': -0.005902003962546587, 'loss/value_avg': 0.4776083827018738, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.836827278137207, 'val/ratio': 0.9963716268539429, 'val/ratio_var': 1.2314582818362396e-05, 'val/num_eos_tokens': 0, 'lr': 1.71484566389025e-06, 'episode': 7888, 'epoch': 0.97}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:54:31<1:05:39, 131kB/s]
 97%|█████████▋| 1973/2041 [2:51:12<05:49,  5.15s/it][A

{'eps': 0, 'objective/kl': 83.66363525390625, 'objective/entropy': 36.869049072265625, 'objective/non_score_reward': -4.1831817626953125, 'objective/rlhf_reward': -6.379825115203857, 'objective/scores': -2.196643352508545, 'policy/approxkl_avg': 0.0006845114403404295, 'policy/clipfrac_avg': 0.007075471803545952, 'loss/policy_avg': -0.008976737968623638, 'loss/value_avg': 0.6646486520767212, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7851974368095398, 'val/ratio': 0.9967522621154785, 'val/ratio_var': 9.496820894128177e-06, 'val/num_eos_tokens': 0, 'lr': 1.690347868691818e-06, 'episode': 7892, 'epoch': 0.97}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:54:36<1:05:39, 131kB/s]
 97%|█████████▋| 1974/2041 [2:51:17<05:45,  5.15s/it][A

{'eps': 0, 'objective/kl': 77.07911682128906, 'objective/entropy': 34.642696380615234, 'objective/non_score_reward': -3.8539562225341797, 'objective/rlhf_reward': -5.540227890014648, 'objective/scores': -1.6862714290618896, 'policy/approxkl_avg': 0.0001541763631394133, 'policy/clipfrac_avg': 0.001179245300590992, 'loss/policy_avg': -0.004860302433371544, 'loss/value_avg': 0.45709431171417236, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.749174952507019, 'val/ratio': 0.9998299479484558, 'val/ratio_var': 1.7157747222995567e-08, 'val/num_eos_tokens': 0, 'lr': 1.6658500734933857e-06, 'episode': 7896, 'epoch': 0.97}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:54:41<1:05:39, 131kB/s]
 97%|█████████▋| 1975/2041 [2:51:23<05:37,  5.12s/it][A

{'eps': 0, 'objective/kl': 72.48989868164062, 'objective/entropy': 47.33900451660156, 'objective/non_score_reward': -3.624495267868042, 'objective/rlhf_reward': -6.02289342880249, 'objective/scores': -2.3983981609344482, 'policy/approxkl_avg': 0.0002706176310312003, 'policy/clipfrac_avg': 0.002358490601181984, 'loss/policy_avg': -0.007135093677788973, 'loss/value_avg': 0.4546189606189728, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8919152021408081, 'val/ratio': 0.9965785145759583, 'val/ratio_var': 1.1803746019722894e-05, 'val/num_eos_tokens': 0, 'lr': 1.6413522782949536e-06, 'episode': 7900, 'epoch': 0.97}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:54:46<1:05:39, 131kB/s]
 97%|█████████▋| 1976/2041 [2:51:28<05:30,  5.09s/it][A

{'eps': 0, 'objective/kl': 71.82549285888672, 'objective/entropy': 48.989463806152344, 'objective/non_score_reward': -3.5912747383117676, 'objective/rlhf_reward': -5.271299839019775, 'objective/scores': -1.6800251007080078, 'policy/approxkl_avg': 0.00022647969308309257, 'policy/clipfrac_avg': 0.002358490601181984, 'loss/policy_avg': -0.006740591488778591, 'loss/value_avg': 0.4251894950866699, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9441099166870117, 'val/ratio': 0.998050332069397, 'val/ratio_var': 3.632569359979243e-06, 'val/num_eos_tokens': 0, 'lr': 1.6168544830965214e-06, 'episode': 7904, 'epoch': 0.97}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:54:51<1:05:39, 131kB/s]
 97%|█████████▋| 1977/2041 [2:51:33<05:26,  5.10s/it][A

{'eps': 0, 'objective/kl': 67.32705688476562, 'objective/entropy': 29.24245834350586, 'objective/non_score_reward': -3.3663525581359863, 'objective/rlhf_reward': -4.58964729309082, 'objective/scores': -1.223294734954834, 'policy/approxkl_avg': 0.00012838574184570462, 'policy/clipfrac_avg': 0.001179245300590992, 'loss/policy_avg': -0.004193789791315794, 'loss/value_avg': 0.3128817081451416, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6813809871673584, 'val/ratio': 0.9994000196456909, 'val/ratio_var': 3.4381002933514537e-07, 'val/num_eos_tokens': 0, 'lr': 1.5923566878980892e-06, 'episode': 7908, 'epoch': 0.97}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:54:56<1:05:39, 131kB/s]
 97%|█████████▋| 1978/2041 [2:51:38<05:20,  5.08s/it][A

{'eps': 0, 'objective/kl': 65.73899841308594, 'objective/entropy': 39.863609313964844, 'objective/non_score_reward': -3.28695011138916, 'objective/rlhf_reward': -5.403286457061768, 'objective/scores': -2.1163363456726074, 'policy/approxkl_avg': 0.0005195607664063573, 'policy/clipfrac_avg': 0.002358490601181984, 'loss/policy_avg': -0.007050542160868645, 'loss/value_avg': 0.7712222337722778, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7820638418197632, 'val/ratio': 0.9977020025253296, 'val/ratio_var': 4.751001142722089e-06, 'val/num_eos_tokens': 0, 'lr': 1.567858892699657e-06, 'episode': 7912, 'epoch': 0.97}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:55:01<1:05:39, 131kB/s]
 97%|█████████▋| 1979/2041 [2:51:43<05:13,  5.05s/it][A

{'eps': 0, 'objective/kl': 77.71870422363281, 'objective/entropy': 46.94035339355469, 'objective/non_score_reward': -3.8859353065490723, 'objective/rlhf_reward': -6.266748428344727, 'objective/scores': -2.3808131217956543, 'policy/approxkl_avg': 0.00048402094398625195, 'policy/clipfrac_avg': 0.007075471803545952, 'loss/policy_avg': -0.007168839685618877, 'loss/value_avg': 0.531376302242279, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8357783555984497, 'val/ratio': 1.0000709295272827, 'val/ratio_var': 1.0734215294405658e-07, 'val/num_eos_tokens': 0, 'lr': 1.543361097501225e-06, 'episode': 7916, 'epoch': 0.97}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:55:06<1:05:39, 131kB/s]
 97%|█████████▋| 1980/2041 [2:51:48<05:08,  5.06s/it][A

{'eps': 0, 'objective/kl': 72.73822021484375, 'objective/entropy': 47.634071350097656, 'objective/non_score_reward': -3.636911153793335, 'objective/rlhf_reward': -5.733490943908691, 'objective/scores': -2.0965800285339355, 'policy/approxkl_avg': 0.0003274301707278937, 'policy/clipfrac_avg': 0.002358490601181984, 'loss/policy_avg': -0.00680613424628973, 'loss/value_avg': 0.5041186809539795, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9329339265823364, 'val/ratio': 0.9956656694412231, 'val/ratio_var': 1.9018518287339248e-05, 'val/num_eos_tokens': 0, 'lr': 1.5188633023027927e-06, 'episode': 7920, 'epoch': 0.97}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:55:11<1:05:39, 131kB/s]
 97%|█████████▋| 1981/2041 [2:51:53<05:05,  5.09s/it][A

{'eps': 0, 'objective/kl': 72.126953125, 'objective/entropy': 41.02149963378906, 'objective/non_score_reward': -3.6063475608825684, 'objective/rlhf_reward': -6.1250200271606445, 'objective/scores': -2.518672466278076, 'policy/approxkl_avg': 0.0005226752255111933, 'policy/clipfrac_avg': 0.003537735901772976, 'loss/policy_avg': -0.006299189291894436, 'loss/value_avg': 0.5799193382263184, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6919794082641602, 'val/ratio': 0.9975650310516357, 'val/ratio_var': 5.34370019522612e-06, 'val/num_eos_tokens': 0, 'lr': 1.4943655071043608e-06, 'episode': 7924, 'epoch': 0.97}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:55:16<1:05:39, 131kB/s]
 97%|█████████▋| 1982/2041 [2:51:58<04:58,  5.07s/it][A

{'eps': 0, 'objective/kl': 72.41890716552734, 'objective/entropy': 62.761016845703125, 'objective/non_score_reward': -3.620945453643799, 'objective/rlhf_reward': -5.076658248901367, 'objective/scores': -1.4557126760482788, 'policy/approxkl_avg': 0.00047073469613678753, 'policy/clipfrac_avg': 0.008254717104136944, 'loss/policy_avg': -0.009546618908643723, 'loss/value_avg': 0.37217724323272705, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0841991901397705, 'val/ratio': 0.9984308481216431, 'val/ratio_var': 2.310425543328165e-06, 'val/num_eos_tokens': 0, 'lr': 1.4698677119059286e-06, 'episode': 7928, 'epoch': 0.97}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:55:21<1:05:39, 131kB/s]
 97%|█████████▋| 1983/2041 [2:52:03<04:54,  5.07s/it][A

{'eps': 0, 'objective/kl': 89.86825561523438, 'objective/entropy': 52.41600799560547, 'objective/non_score_reward': -4.493412971496582, 'objective/rlhf_reward': -6.230943202972412, 'objective/scores': -1.73753023147583, 'policy/approxkl_avg': 0.001165634486824274, 'policy/clipfrac_avg': 0.009433962404727936, 'loss/policy_avg': -0.012184644117951393, 'loss/value_avg': 0.5996905565261841, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0059887170791626, 'val/ratio': 0.9988069534301758, 'val/ratio_var': 2.1097830540384166e-06, 'val/num_eos_tokens': 0, 'lr': 1.4453699167074964e-06, 'episode': 7932, 'epoch': 0.97}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:55:27<1:05:39, 131kB/s]
 97%|█████████▋| 1984/2041 [2:52:08<04:48,  5.07s/it][A

{'eps': 0, 'objective/kl': 75.71775817871094, 'objective/entropy': 32.179927825927734, 'objective/non_score_reward': -3.7858877182006836, 'objective/rlhf_reward': -5.162524223327637, 'objective/scores': -1.3766365051269531, 'policy/approxkl_avg': 0.0012976168654859066, 'policy/clipfrac_avg': 0.00589622650295496, 'loss/policy_avg': -0.005949062295258045, 'loss/value_avg': 0.5161148309707642, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6859638690948486, 'val/ratio': 0.9966278076171875, 'val/ratio_var': 8.40182929096045e-06, 'val/num_eos_tokens': 0, 'lr': 1.4208721215090643e-06, 'episode': 7936, 'epoch': 0.97}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:55:32<1:05:39, 131kB/s]
 97%|█████████▋| 1985/2041 [2:52:13<04:43,  5.07s/it][A

{'eps': 0, 'objective/kl': 67.07148742675781, 'objective/entropy': 39.63185119628906, 'objective/non_score_reward': -3.35357403755188, 'objective/rlhf_reward': -4.393538475036621, 'objective/scores': -1.039964199066162, 'policy/approxkl_avg': 0.0005272791022434831, 'policy/clipfrac_avg': 0.007075471803545952, 'loss/policy_avg': -0.008238790556788445, 'loss/value_avg': 0.4438188970088959, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8550430536270142, 'val/ratio': 0.9985954761505127, 'val/ratio_var': 1.850386411206273e-06, 'val/num_eos_tokens': 0, 'lr': 1.396374326310632e-06, 'episode': 7940, 'epoch': 0.97}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:55:37<1:05:39, 131kB/s]
 97%|█████████▋| 1986/2041 [2:52:18<04:38,  5.06s/it][A

{'eps': 0, 'objective/kl': 63.85596466064453, 'objective/entropy': 36.93379211425781, 'objective/non_score_reward': -3.192798137664795, 'objective/rlhf_reward': -4.701782703399658, 'objective/scores': -1.5089845657348633, 'policy/approxkl_avg': 0.00011051051114918664, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.005336263217031956, 'loss/value_avg': 0.39782601594924927, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7110075950622559, 'val/ratio': 0.9980454444885254, 'val/ratio_var': 3.5395187296671793e-06, 'val/num_eos_tokens': 0, 'lr': 1.3718765311122e-06, 'episode': 7944, 'epoch': 0.97}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:55:42<1:05:39, 131kB/s]
 97%|█████████▋| 1987/2041 [2:52:23<04:32,  5.04s/it][A

{'eps': 0, 'objective/kl': 77.15980529785156, 'objective/entropy': 39.76350784301758, 'objective/non_score_reward': -3.857990026473999, 'objective/rlhf_reward': -6.123495578765869, 'objective/scores': -2.26550555229187, 'policy/approxkl_avg': 7.53160857129842e-05, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.003108985023573041, 'loss/value_avg': 0.6726380586624146, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8291096687316895, 'val/ratio': 0.998741090297699, 'val/ratio_var': 1.650788362894673e-06, 'val/num_eos_tokens': 0, 'lr': 1.3473787359137678e-06, 'episode': 7948, 'epoch': 0.97}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:55:47<1:05:39, 131kB/s]
 97%|█████████▋| 1988/2041 [2:52:28<04:29,  5.08s/it][A

{'eps': 0, 'objective/kl': 73.66862487792969, 'objective/entropy': 53.490509033203125, 'objective/non_score_reward': -3.683431386947632, 'objective/rlhf_reward': -5.2089972496032715, 'objective/scores': -1.52556574344635, 'policy/approxkl_avg': 0.0012041551526635885, 'policy/clipfrac_avg': 0.01179245300590992, 'loss/policy_avg': -0.008154693059623241, 'loss/value_avg': 0.6193750500679016, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8668850660324097, 'val/ratio': 0.9965634942054749, 'val/ratio_var': 9.052282621269114e-06, 'val/num_eos_tokens': 0, 'lr': 1.3228809407153356e-06, 'episode': 7952, 'epoch': 0.97}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:55:52<1:05:39, 131kB/s]
 97%|█████████▋| 1989/2041 [2:52:33<04:23,  5.06s/it][A

{'eps': 0, 'objective/kl': 78.36337280273438, 'objective/entropy': 43.221832275390625, 'objective/non_score_reward': -3.918168306350708, 'objective/rlhf_reward': -5.870867729187012, 'objective/scores': -1.9526991844177246, 'policy/approxkl_avg': 8.696261647855863e-05, 'policy/clipfrac_avg': 0.001179245300590992, 'loss/policy_avg': -0.002823543967679143, 'loss/value_avg': 0.5330685377120972, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8586670160293579, 'val/ratio': 1.0002816915512085, 'val/ratio_var': 4.763144900721272e-08, 'val/num_eos_tokens': 0, 'lr': 1.2983831455169036e-06, 'episode': 7956, 'epoch': 0.97}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:55:57<1:05:39, 131kB/s]
 98%|█████████▊| 1990/2041 [2:52:38<04:17,  5.05s/it][A

{'eps': 0, 'objective/kl': 73.08611297607422, 'objective/entropy': 55.36656951904297, 'objective/non_score_reward': -3.6543054580688477, 'objective/rlhf_reward': -5.824368476867676, 'objective/scores': -2.170063018798828, 'policy/approxkl_avg': 0.00035073209437541664, 'policy/clipfrac_avg': 0.003537735901772976, 'loss/policy_avg': -0.009644989855587482, 'loss/value_avg': 0.5357506275177002, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8941912651062012, 'val/ratio': 0.9978443384170532, 'val/ratio_var': 4.473538410820765e-06, 'val/num_eos_tokens': 0, 'lr': 1.2738853503184715e-06, 'episode': 7960, 'epoch': 0.98}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:56:02<1:05:39, 131kB/s]
 98%|█████████▊| 1991/2041 [2:52:44<04:13,  5.07s/it][A

{'eps': 0, 'objective/kl': 68.75375366210938, 'objective/entropy': 52.746551513671875, 'objective/non_score_reward': -3.437687873840332, 'objective/rlhf_reward': -5.95832633972168, 'objective/scores': -2.5206384658813477, 'policy/approxkl_avg': 0.0003506179782561958, 'policy/clipfrac_avg': 0.003537735901772976, 'loss/policy_avg': -0.00923666451126337, 'loss/value_avg': 0.5780375003814697, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9654182195663452, 'val/ratio': 0.9980669021606445, 'val/ratio_var': 3.604210178309586e-06, 'val/num_eos_tokens': 0, 'lr': 1.2493875551200393e-06, 'episode': 7964, 'epoch': 0.98}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:56:07<1:05:39, 131kB/s]
 98%|█████████▊| 1992/2041 [2:52:49<04:08,  5.07s/it][A

{'eps': 0, 'objective/kl': 81.0240478515625, 'objective/entropy': 51.06901550292969, 'objective/non_score_reward': -4.051202774047852, 'objective/rlhf_reward': -5.318042755126953, 'objective/scores': -1.2668399810791016, 'policy/approxkl_avg': 0.000350169517332688, 'policy/clipfrac_avg': 0.004716981202363968, 'loss/policy_avg': -0.007919521071016788, 'loss/value_avg': 0.5183614492416382, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0299866199493408, 'val/ratio': 0.9994654655456543, 'val/ratio_var': 2.3996506115508964e-07, 'val/num_eos_tokens': 0, 'lr': 1.2248897599216071e-06, 'episode': 7968, 'epoch': 0.98}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:56:12<1:05:39, 131kB/s]
 98%|█████████▊| 1993/2041 [2:52:54<04:01,  5.03s/it][A

{'eps': 0, 'objective/kl': 78.537109375, 'objective/entropy': 39.73344421386719, 'objective/non_score_reward': -3.9268558025360107, 'objective/rlhf_reward': -5.949764251708984, 'objective/scores': -2.0229082107543945, 'policy/approxkl_avg': 5.598470306722447e-05, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.002126801759004593, 'loss/value_avg': 0.5649391412734985, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.752139687538147, 'val/ratio': 0.9986909627914429, 'val/ratio_var': 1.6761040342316846e-06, 'val/num_eos_tokens': 0, 'lr': 1.200391964723175e-06, 'episode': 7972, 'epoch': 0.98}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:56:17<1:05:39, 131kB/s]
 98%|█████████▊| 1994/2041 [2:52:59<03:58,  5.07s/it][A

{'eps': 0, 'objective/kl': 73.29306030273438, 'objective/entropy': 65.23606872558594, 'objective/non_score_reward': -3.6646528244018555, 'objective/rlhf_reward': -5.7671098709106445, 'objective/scores': -2.102457046508789, 'policy/approxkl_avg': 0.00027043104637414217, 'policy/clipfrac_avg': 0.002358490601181984, 'loss/policy_avg': -0.010389216244220734, 'loss/value_avg': 0.4864853620529175, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0543162822723389, 'val/ratio': 0.9983808994293213, 'val/ratio_var': 2.3685681753704557e-06, 'val/num_eos_tokens': 0, 'lr': 1.1758941695247428e-06, 'episode': 7976, 'epoch': 0.98}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:56:22<1:05:39, 131kB/s]
 98%|█████████▊| 1995/2041 [2:53:04<03:53,  5.07s/it][A

{'eps': 0, 'objective/kl': 70.54035186767578, 'objective/entropy': 33.37078857421875, 'objective/non_score_reward': -3.527017593383789, 'objective/rlhf_reward': -5.178847789764404, 'objective/scores': -1.6518301963806152, 'policy/approxkl_avg': 6.620016210945323e-05, 'policy/clipfrac_avg': 0.001179245300590992, 'loss/policy_avg': -0.0022663322743028402, 'loss/value_avg': 0.4093402624130249, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7218071222305298, 'val/ratio': 0.9990858435630798, 'val/ratio_var': 8.139912210936018e-07, 'val/num_eos_tokens': 0, 'lr': 1.1513963743263106e-06, 'episode': 7980, 'epoch': 0.98}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:56:27<1:05:39, 131kB/s]
 98%|█████████▊| 1996/2041 [2:53:09<03:47,  5.05s/it][A

{'eps': 0, 'objective/kl': 72.96913146972656, 'objective/entropy': 51.82139587402344, 'objective/non_score_reward': -3.6484568119049072, 'objective/rlhf_reward': -6.075906753540039, 'objective/scores': -2.427449941635132, 'policy/approxkl_avg': 0.00019004066416528076, 'policy/clipfrac_avg': 0.002358490601181984, 'loss/policy_avg': -0.005167210008949041, 'loss/value_avg': 0.5576829314231873, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.003253698348999, 'val/ratio': 0.9969505071640015, 'val/ratio_var': 9.216655598720536e-06, 'val/num_eos_tokens': 0, 'lr': 1.1268985791278785e-06, 'episode': 7984, 'epoch': 0.98}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:56:32<1:05:39, 131kB/s]
 98%|█████████▊| 1997/2041 [2:53:14<03:42,  5.06s/it][A

{'eps': 0, 'objective/kl': 73.83067321777344, 'objective/entropy': 43.91339111328125, 'objective/non_score_reward': -3.6915338039398193, 'objective/rlhf_reward': -5.839438438415527, 'objective/scores': -2.147904396057129, 'policy/approxkl_avg': 0.0006280748057179153, 'policy/clipfrac_avg': 0.004716981202363968, 'loss/policy_avg': -0.00607440248131752, 'loss/value_avg': 0.5342835187911987, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8485981822013855, 'val/ratio': 0.9973958134651184, 'val/ratio_var': 6.160773864394287e-06, 'val/num_eos_tokens': 0, 'lr': 1.1024007839294463e-06, 'episode': 7988, 'epoch': 0.98}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:56:37<1:05:39, 131kB/s]
 98%|█████████▊| 1998/2041 [2:53:19<03:37,  5.05s/it][A

{'eps': 0, 'objective/kl': 75.2479476928711, 'objective/entropy': 46.60773849487305, 'objective/non_score_reward': -3.762397527694702, 'objective/rlhf_reward': -5.989917755126953, 'objective/scores': -2.22752046585083, 'policy/approxkl_avg': 0.00016211361798923463, 'policy/clipfrac_avg': 0.001179245300590992, 'loss/policy_avg': -0.005389190744608641, 'loss/value_avg': 0.4427396059036255, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.914292573928833, 'val/ratio': 0.9990767240524292, 'val/ratio_var': 8.404869618061639e-07, 'val/num_eos_tokens': 0, 'lr': 1.0779029887310143e-06, 'episode': 7992, 'epoch': 0.98}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:56:42<1:05:39, 131kB/s]
 98%|█████████▊| 1999/2041 [2:53:24<03:31,  5.04s/it][A

{'eps': 0, 'objective/kl': 79.14717102050781, 'objective/entropy': 35.636619567871094, 'objective/non_score_reward': -3.9573588371276855, 'objective/rlhf_reward': -5.314541339874268, 'objective/scores': -1.3571823835372925, 'policy/approxkl_avg': 4.394997449708171e-05, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.002148719970136881, 'loss/value_avg': 0.35949110984802246, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6880813241004944, 'val/ratio': 1.000187635421753, 'val/ratio_var': 2.442171620486988e-08, 'val/num_eos_tokens': 0, 'lr': 1.0534051935325822e-06, 'episode': 7996, 'epoch': 0.98}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:56:47<1:05:39, 131kB/s]
 98%|█████████▊| 2000/2041 [2:53:29<03:26,  5.03s/it][A

{'eps': 0, 'objective/kl': 81.55775451660156, 'objective/entropy': 64.94529724121094, 'objective/non_score_reward': -4.077887535095215, 'objective/rlhf_reward': -5.664158344268799, 'objective/scores': -1.586270809173584, 'policy/approxkl_avg': 0.0007507667178288102, 'policy/clipfrac_avg': 0.00589622650295496, 'loss/policy_avg': -0.007726353593170643, 'loss/value_avg': 0.40545958280563354, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0561200380325317, 'val/ratio': 0.9966843128204346, 'val/ratio_var': 9.379825314681511e-06, 'val/num_eos_tokens': 0, 'lr': 1.02890739833415e-06, 'episode': 8000, 'epoch': 0.98}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:57:11<1:05:39, 131kB/s]
 98%|█████████▊| 2001/2041 [2:53:52<07:00, 10.51s/it][A

{'eps': 0, 'objective/kl': 83.8127670288086, 'objective/entropy': 41.718605041503906, 'objective/non_score_reward': -4.190638542175293, 'objective/rlhf_reward': -6.250845909118652, 'objective/scores': -2.0602073669433594, 'policy/approxkl_avg': 0.00016515892639290541, 'policy/clipfrac_avg': 0.001179245300590992, 'loss/policy_avg': -0.00642442749813199, 'loss/value_avg': 0.9642190933227539, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.909037172794342, 'val/ratio': 0.9980159997940063, 'val/ratio_var': 4.200488092465093e-06, 'val/num_eos_tokens': 0, 'lr': 1.0044096031357178e-06, 'episode': 8004, 'epoch': 0.98}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:57:16<1:05:39, 131kB/s]
 98%|█████████▊| 2002/2041 [2:53:57<05:46,  8.88s/it][A

{'eps': 0, 'objective/kl': 64.92723083496094, 'objective/entropy': 49.01142883300781, 'objective/non_score_reward': -3.246361255645752, 'objective/rlhf_reward': -5.595562934875488, 'objective/scores': -2.3492014408111572, 'policy/approxkl_avg': 4.8664729547454044e-05, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.002967663574963808, 'loss/value_avg': 0.41702258586883545, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7688495516777039, 'val/ratio': 0.9984942674636841, 'val/ratio_var': 2.3146365037973737e-06, 'val/num_eos_tokens': 0, 'lr': 9.799118079372857e-07, 'episode': 8008, 'epoch': 0.98}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:57:21<1:05:39, 131kB/s]
 98%|█████████▊| 2003/2041 [2:54:02<04:53,  7.72s/it][A

{'eps': 0, 'objective/kl': 70.32791137695312, 'objective/entropy': 38.24301528930664, 'objective/non_score_reward': -3.5163958072662354, 'objective/rlhf_reward': -5.99047327041626, 'objective/scores': -2.4740774631500244, 'policy/approxkl_avg': 6.164890510262921e-05, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.00357325398363173, 'loss/value_avg': 0.6311167478561401, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9047173261642456, 'val/ratio': 0.9997984170913696, 'val/ratio_var': 4.265585218377055e-08, 'val/num_eos_tokens': 0, 'lr': 9.554140127388535e-07, 'episode': 8012, 'epoch': 0.98}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:57:26<1:05:39, 131kB/s]
 98%|█████████▊| 2004/2041 [2:54:07<04:15,  6.92s/it][A

{'eps': 0, 'objective/kl': 73.50724792480469, 'objective/entropy': 39.556785583496094, 'objective/non_score_reward': -3.6753625869750977, 'objective/rlhf_reward': -5.493093013763428, 'objective/scores': -1.81773042678833, 'policy/approxkl_avg': 9.853790834313259e-05, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.004380056168884039, 'loss/value_avg': 0.40174439549446106, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.818933367729187, 'val/ratio': 0.9988755583763123, 'val/ratio_var': 1.2883941735708504e-06, 'val/num_eos_tokens': 0, 'lr': 9.309162175404213e-07, 'episode': 8016, 'epoch': 0.98}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:57:31<1:05:39, 131kB/s]
 98%|█████████▊| 2005/2041 [2:54:12<03:49,  6.38s/it][A

{'eps': 0, 'objective/kl': 95.94831848144531, 'objective/entropy': 53.63697814941406, 'objective/non_score_reward': -4.797415733337402, 'objective/rlhf_reward': -6.435074806213379, 'objective/scores': -1.6376591920852661, 'policy/approxkl_avg': 0.00016429404786322266, 'policy/clipfrac_avg': 0.002358490601181984, 'loss/policy_avg': -0.00492909736931324, 'loss/value_avg': 0.5599507093429565, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9928528070449829, 'val/ratio': 0.9982805252075195, 'val/ratio_var': 2.7687256078934297e-06, 'val/num_eos_tokens': 0, 'lr': 9.064184223419892e-07, 'episode': 8020, 'epoch': 0.98}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:57:36<1:05:39, 131kB/s]
 98%|█████████▊| 2006/2041 [2:54:18<03:29,  6.00s/it][A

{'eps': 0, 'objective/kl': 78.13520812988281, 'objective/entropy': 35.20056915283203, 'objective/non_score_reward': -3.9067604541778564, 'objective/rlhf_reward': -6.001277923583984, 'objective/scores': -2.094517707824707, 'policy/approxkl_avg': 3.804029984166846e-05, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.0024944795295596123, 'loss/value_avg': 0.4099545478820801, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7380735874176025, 'val/ratio': 0.9987477660179138, 'val/ratio_var': 1.501109636592446e-06, 'val/num_eos_tokens': 0, 'lr': 8.819206271435572e-07, 'episode': 8024, 'epoch': 0.98}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:57:41<1:05:39, 131kB/s]
 98%|█████████▊| 2007/2041 [2:54:23<03:13,  5.70s/it][A

{'eps': 0, 'objective/kl': 73.73357391357422, 'objective/entropy': 33.872684478759766, 'objective/non_score_reward': -3.686678409576416, 'objective/rlhf_reward': -5.2649664878845215, 'objective/scores': -1.578287959098816, 'policy/approxkl_avg': 0.00016815272101666778, 'policy/clipfrac_avg': 0.001179245300590992, 'loss/policy_avg': -0.004212504252791405, 'loss/value_avg': 0.6036030054092407, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7262141108512878, 'val/ratio': 0.9996941089630127, 'val/ratio_var': 5.884628251351387e-08, 'val/num_eos_tokens': 0, 'lr': 8.57422831945125e-07, 'episode': 8028, 'epoch': 0.98}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:57:46<1:05:39, 131kB/s]
 98%|█████████▊| 2008/2041 [2:54:28<03:01,  5.51s/it][A

{'eps': 0, 'objective/kl': 80.586181640625, 'objective/entropy': 42.9520149230957, 'objective/non_score_reward': -4.029308795928955, 'objective/rlhf_reward': -5.254807472229004, 'objective/scores': -1.2254985570907593, 'policy/approxkl_avg': 0.0002893826167564839, 'policy/clipfrac_avg': 0.003537735901772976, 'loss/policy_avg': -0.005649810191243887, 'loss/value_avg': 0.5794149041175842, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.756348729133606, 'val/ratio': 0.9992795586585999, 'val/ratio_var': 4.658621435282839e-07, 'val/num_eos_tokens': 0, 'lr': 8.329250367466929e-07, 'episode': 8032, 'epoch': 0.98}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:57:51<1:05:39, 131kB/s]
 98%|█████████▊| 2009/2041 [2:54:33<02:51,  5.35s/it][A

{'eps': 0, 'objective/kl': 73.32323455810547, 'objective/entropy': 30.75933265686035, 'objective/non_score_reward': -3.6661620140075684, 'objective/rlhf_reward': -5.768486976623535, 'objective/scores': -2.102324962615967, 'policy/approxkl_avg': 7.283619197551161e-05, 'policy/clipfrac_avg': 0.001179245300590992, 'loss/policy_avg': -0.0026892106980085373, 'loss/value_avg': 0.49511975049972534, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6143912076950073, 'val/ratio': 0.9994540214538574, 'val/ratio_var': 3.5164214295946294e-07, 'val/num_eos_tokens': 0, 'lr': 8.084272415482607e-07, 'episode': 8036, 'epoch': 0.98}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:57:56<1:05:39, 131kB/s]
 98%|█████████▊| 2010/2041 [2:54:38<02:44,  5.30s/it][A

{'eps': 0, 'objective/kl': 70.7081298828125, 'objective/entropy': 44.83247375488281, 'objective/non_score_reward': -3.5354061126708984, 'objective/rlhf_reward': -4.8685832023620605, 'objective/scores': -1.333177089691162, 'policy/approxkl_avg': 6.10944043728523e-05, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.003464464098215103, 'loss/value_avg': 0.3097565174102783, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8303588032722473, 'val/ratio': 0.998598575592041, 'val/ratio_var': 1.7668203327048104e-06, 'val/num_eos_tokens': 0, 'lr': 7.839294463498285e-07, 'episode': 8040, 'epoch': 0.98}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:58:01<1:05:39, 131kB/s]
 99%|█████████▊| 2011/2041 [2:54:43<02:36,  5.22s/it][A

{'eps': 0, 'objective/kl': 74.97991943359375, 'objective/entropy': 68.07878875732422, 'objective/non_score_reward': -3.748995780944824, 'objective/rlhf_reward': -5.431756973266602, 'objective/scores': -1.6827609539031982, 'policy/approxkl_avg': 0.0001921166985994205, 'policy/clipfrac_avg': 0.002358490601181984, 'loss/policy_avg': -0.004879315849393606, 'loss/value_avg': 0.5204850435256958, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1337318420410156, 'val/ratio': 0.9987343549728394, 'val/ratio_var': 1.4277796935857623e-06, 'val/num_eos_tokens': 0, 'lr': 7.594316511513964e-07, 'episode': 8044, 'epoch': 0.99}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:58:06<1:05:39, 131kB/s]
 99%|█████████▊| 2012/2041 [2:54:48<02:29,  5.16s/it][A

{'eps': 0, 'objective/kl': 58.98881530761719, 'objective/entropy': 44.70813751220703, 'objective/non_score_reward': -2.9494409561157227, 'objective/rlhf_reward': -4.517769813537598, 'objective/scores': -1.568329095840454, 'policy/approxkl_avg': 3.947194272768684e-05, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.0031017367728054523, 'loss/value_avg': 0.5387308597564697, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8243812322616577, 'val/ratio': 0.9991626143455505, 'val/ratio_var': 6.895632509440475e-07, 'val/num_eos_tokens': 0, 'lr': 7.349338559529643e-07, 'episode': 8048, 'epoch': 0.99}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:58:11<1:05:39, 131kB/s]
 99%|█████████▊| 2013/2041 [2:54:53<02:23,  5.13s/it][A

{'eps': 0, 'objective/kl': 73.18675231933594, 'objective/entropy': 32.46815490722656, 'objective/non_score_reward': -3.6593375205993652, 'objective/rlhf_reward': -5.470082759857178, 'objective/scores': -1.8107452392578125, 'policy/approxkl_avg': 0.00010679903789423406, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.003044989425688982, 'loss/value_avg': 0.4813574552536011, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7094371914863586, 'val/ratio': 0.9988017082214355, 'val/ratio_var': 1.535643150418764e-06, 'val/num_eos_tokens': 0, 'lr': 7.104360607545321e-07, 'episode': 8052, 'epoch': 0.99}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:58:16<1:05:39, 131kB/s]
 99%|█████████▊| 2014/2041 [2:54:58<02:17,  5.11s/it][A

{'eps': 0, 'objective/kl': 81.44820404052734, 'objective/entropy': 51.44951248168945, 'objective/non_score_reward': -4.0724101066589355, 'objective/rlhf_reward': -5.3060302734375, 'objective/scores': -1.2336199283599854, 'policy/approxkl_avg': 7.446697418345138e-05, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.0033111392986029387, 'loss/value_avg': 0.6034978032112122, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9304559826850891, 'val/ratio': 0.9991277456283569, 'val/ratio_var': 7.984498893165437e-07, 'val/num_eos_tokens': 0, 'lr': 6.859382655561e-07, 'episode': 8056, 'epoch': 0.99}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:58:21<1:05:39, 131kB/s]
 99%|█████████▊| 2015/2041 [2:55:03<02:12,  5.09s/it][A

{'eps': 0, 'objective/kl': 85.9107666015625, 'objective/entropy': 37.82524108886719, 'objective/non_score_reward': -4.295538425445557, 'objective/rlhf_reward': -6.020638942718506, 'objective/scores': -1.7251003980636597, 'policy/approxkl_avg': 0.001472181873396039, 'policy/clipfrac_avg': 0.00589622650295496, 'loss/policy_avg': -0.0064523955807089806, 'loss/value_avg': 0.4436637759208679, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7925921678543091, 'val/ratio': 0.9987662434577942, 'val/ratio_var': 7.672497304156423e-07, 'val/num_eos_tokens': 0, 'lr': 6.614404703576678e-07, 'episode': 8060, 'epoch': 0.99}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:58:27<1:05:39, 131kB/s]
 99%|█████████▉| 2016/2041 [2:55:08<02:07,  5.09s/it][A

{'eps': 0, 'objective/kl': 77.97235107421875, 'objective/entropy': 41.28663635253906, 'objective/non_score_reward': -3.898617744445801, 'objective/rlhf_reward': -6.230435371398926, 'objective/scores': -2.331817626953125, 'policy/approxkl_avg': 0.000420893426053226, 'policy/clipfrac_avg': 0.00589622650295496, 'loss/policy_avg': -0.006400011479854584, 'loss/value_avg': 0.598868191242218, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7960430383682251, 'val/ratio': 0.997618556022644, 'val/ratio_var': 5.098858764540637e-06, 'val/num_eos_tokens': 0, 'lr': 6.369426751592357e-07, 'episode': 8064, 'epoch': 0.99}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:58:32<1:05:39, 131kB/s]
 99%|█████████▉| 2017/2041 [2:55:13<02:01,  5.08s/it][A

{'eps': 0, 'objective/kl': 70.48750305175781, 'objective/entropy': 41.64976119995117, 'objective/non_score_reward': -3.5243756771087646, 'objective/rlhf_reward': -5.659143447875977, 'objective/scores': -2.134768009185791, 'policy/approxkl_avg': 0.0002740073250606656, 'policy/clipfrac_avg': 0.003537735901772976, 'loss/policy_avg': -0.004799139220267534, 'loss/value_avg': 0.5475343465805054, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8142159581184387, 'val/ratio': 0.9980433583259583, 'val/ratio_var': 3.6353987979964586e-06, 'val/num_eos_tokens': 0, 'lr': 6.124448799608036e-07, 'episode': 8068, 'epoch': 0.99}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:58:37<1:05:39, 131kB/s]
 99%|█████████▉| 2018/2041 [2:55:18<01:57,  5.09s/it][A

{'eps': 0, 'objective/kl': 70.11293029785156, 'objective/entropy': 53.99004364013672, 'objective/non_score_reward': -3.5056467056274414, 'objective/rlhf_reward': -5.9852142333984375, 'objective/scores': -2.479567527770996, 'policy/approxkl_avg': 3.425556133151986e-05, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.0022215601056814194, 'loss/value_avg': 0.462357759475708, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9425147771835327, 'val/ratio': 0.9995597004890442, 'val/ratio_var': 2.512230992124387e-07, 'val/num_eos_tokens': 0, 'lr': 5.879470847623714e-07, 'episode': 8072, 'epoch': 0.99}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:58:42<1:05:39, 131kB/s]
 99%|█████████▉| 2019/2041 [2:55:23<01:51,  5.08s/it][A

{'eps': 0, 'objective/kl': 69.03730773925781, 'objective/entropy': 40.62755584716797, 'objective/non_score_reward': -3.4518656730651855, 'objective/rlhf_reward': -5.501435279846191, 'objective/scores': -2.049569606781006, 'policy/approxkl_avg': 3.1085131922736764e-05, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.0017534910002723336, 'loss/value_avg': 0.6716415286064148, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8104658126831055, 'val/ratio': 0.9993164539337158, 'val/ratio_var': 4.887201043857203e-07, 'val/num_eos_tokens': 0, 'lr': 5.634492895639392e-07, 'episode': 8076, 'epoch': 0.99}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:58:47<1:05:39, 131kB/s]
 99%|█████████▉| 2020/2041 [2:55:28<01:46,  5.07s/it][A

{'eps': 0, 'objective/kl': 86.29629516601562, 'objective/entropy': 60.958656311035156, 'objective/non_score_reward': -4.314814567565918, 'objective/rlhf_reward': -6.571977615356445, 'objective/scores': -2.2571630477905273, 'policy/approxkl_avg': 1.639982838241849e-05, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.0016851408872753382, 'loss/value_avg': 0.5757343769073486, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.030452847480774, 'val/ratio': 0.9989868402481079, 'val/ratio_var': 1.1142955145260203e-06, 'val/num_eos_tokens': 0, 'lr': 5.389514943655072e-07, 'episode': 8080, 'epoch': 0.99}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:58:52<1:05:39, 131kB/s]
 99%|█████████▉| 2021/2041 [2:55:33<01:41,  5.05s/it][A

{'eps': 0, 'objective/kl': 80.23001098632812, 'objective/entropy': 45.73780059814453, 'objective/non_score_reward': -4.011500358581543, 'objective/rlhf_reward': -5.5562896728515625, 'objective/scores': -1.5447895526885986, 'policy/approxkl_avg': 3.351437771925703e-05, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.0018452703952789307, 'loss/value_avg': 0.43650728464126587, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.778225839138031, 'val/ratio': 0.9990633726119995, 'val/ratio_var': 9.398181646247394e-07, 'val/num_eos_tokens': 0, 'lr': 5.14453699167075e-07, 'episode': 8084, 'epoch': 0.99}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:58:57<1:05:39, 131kB/s]
 99%|█████████▉| 2022/2041 [2:55:39<01:36,  5.07s/it][A

{'eps': 0, 'objective/kl': 78.8607177734375, 'objective/entropy': 45.756195068359375, 'objective/non_score_reward': -3.943035840988159, 'objective/rlhf_reward': -5.477147102355957, 'objective/scores': -1.5341112613677979, 'policy/approxkl_avg': 7.421626651193947e-05, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.0037185607943683863, 'loss/value_avg': 0.4477267265319824, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7677114009857178, 'val/ratio': 0.9993763566017151, 'val/ratio_var': 4.072205399552331e-07, 'val/num_eos_tokens': 0, 'lr': 4.899559039686428e-07, 'episode': 8088, 'epoch': 0.99}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:59:02<1:05:39, 131kB/s]
 99%|█████████▉| 2023/2041 [2:55:44<01:30,  5.04s/it][A

{'eps': 0, 'objective/kl': 70.73947143554688, 'objective/entropy': 69.16984558105469, 'objective/non_score_reward': -3.536973476409912, 'objective/rlhf_reward': -6.423503875732422, 'objective/scores': -2.886530637741089, 'policy/approxkl_avg': 1.895625246106647e-05, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.0021153627894818783, 'loss/value_avg': 0.6013898849487305, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.2668030261993408, 'val/ratio': 0.999127984046936, 'val/ratio_var': 8.374229878427286e-07, 'val/num_eos_tokens': 0, 'lr': 4.6545810877021066e-07, 'episode': 8092, 'epoch': 0.99}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:59:07<1:05:39, 131kB/s]
 99%|█████████▉| 2024/2041 [2:55:49<01:25,  5.05s/it][A

{'eps': 0, 'objective/kl': 75.22254943847656, 'objective/entropy': 64.11949920654297, 'objective/non_score_reward': -3.761127471923828, 'objective/rlhf_reward': -5.852355003356934, 'objective/scores': -2.0912275314331055, 'policy/approxkl_avg': 1.3096746442897711e-05, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.0016596709610894322, 'loss/value_avg': 0.3495543599128723, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1213030815124512, 'val/ratio': 0.9996469020843506, 'val/ratio_var': 1.375147036242197e-07, 'val/num_eos_tokens': 0, 'lr': 4.409603135717786e-07, 'episode': 8096, 'epoch': 0.99}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:59:12<1:05:39, 131kB/s]
 99%|█████████▉| 2025/2041 [2:55:54<01:20,  5.06s/it][A

{'eps': 0, 'objective/kl': 77.7135238647461, 'objective/entropy': 40.43712615966797, 'objective/non_score_reward': -3.885676383972168, 'objective/rlhf_reward': -5.82901668548584, 'objective/scores': -1.9433403015136719, 'policy/approxkl_avg': 4.5009084715275094e-05, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.0030032508075237274, 'loss/value_avg': 0.6748709678649902, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7985562086105347, 'val/ratio': 0.9989005327224731, 'val/ratio_var': 1.3434537322609685e-06, 'val/num_eos_tokens': 0, 'lr': 4.1646251837334643e-07, 'episode': 8100, 'epoch': 0.99}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:59:17<1:05:39, 131kB/s]
 99%|█████████▉| 2026/2041 [2:55:59<01:16,  5.08s/it][A

{'eps': 0, 'objective/kl': 62.62528610229492, 'objective/entropy': 37.83195877075195, 'objective/non_score_reward': -3.1312644481658936, 'objective/rlhf_reward': -5.150638580322266, 'objective/scores': -2.019374132156372, 'policy/approxkl_avg': 3.259625373175368e-05, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.002793386345729232, 'loss/value_avg': 0.34301134943962097, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7064077854156494, 'val/ratio': 0.9997491836547852, 'val/ratio_var': 7.115669831136984e-08, 'val/num_eos_tokens': 0, 'lr': 3.9196472317491427e-07, 'episode': 8104, 'epoch': 0.99}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:59:22<1:05:39, 131kB/s]
 99%|█████████▉| 2027/2041 [2:56:04<01:10,  5.04s/it][A

{'eps': 0, 'objective/kl': 70.22508239746094, 'objective/entropy': 33.53533172607422, 'objective/non_score_reward': -3.511254072189331, 'objective/rlhf_reward': -4.738747596740723, 'objective/scores': -1.2274932861328125, 'policy/approxkl_avg': 5.145538307260722e-05, 'policy/clipfrac_avg': 0.001179245300590992, 'loss/policy_avg': -0.0026192597579210997, 'loss/value_avg': 0.3867725729942322, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7263220548629761, 'val/ratio': 0.9998767375946045, 'val/ratio_var': 1.1361134255594152e-08, 'val/num_eos_tokens': 0, 'lr': 3.6746692797648215e-07, 'episode': 8108, 'epoch': 0.99}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:59:27<1:05:39, 131kB/s]
 99%|█████████▉| 2028/2041 [2:56:09<01:05,  5.02s/it][A

{'eps': 0, 'objective/kl': 72.47077178955078, 'objective/entropy': 33.30823516845703, 'objective/non_score_reward': -3.6235387325286865, 'objective/rlhf_reward': -5.669700622558594, 'objective/scores': -2.0461618900299072, 'policy/approxkl_avg': 2.4404986106674187e-05, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.0018626298988237977, 'loss/value_avg': 0.5526549816131592, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6630791425704956, 'val/ratio': 0.9990702867507935, 'val/ratio_var': 8.604756089880539e-07, 'val/num_eos_tokens': 0, 'lr': 3.4296913277805e-07, 'episode': 8112, 'epoch': 0.99}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:59:32<1:05:39, 131kB/s]
 99%|█████████▉| 2029/2041 [2:56:14<01:00,  5.01s/it][A

{'eps': 0, 'objective/kl': 72.96831512451172, 'objective/entropy': 42.059383392333984, 'objective/non_score_reward': -3.6484155654907227, 'objective/rlhf_reward': -5.533504962921143, 'objective/scores': -1.88508939743042, 'policy/approxkl_avg': 2.641152241267264e-05, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.0018892495427280664, 'loss/value_avg': 0.3717541992664337, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8767770528793335, 'val/ratio': 0.9998252391815186, 'val/ratio_var': 3.0640155301853156e-08, 'val/num_eos_tokens': 0, 'lr': 3.1847133757961787e-07, 'episode': 8116, 'epoch': 0.99}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:59:37<1:05:39, 131kB/s]
 99%|█████████▉| 2030/2041 [2:56:19<00:55,  5.03s/it][A

{'eps': 0, 'objective/kl': 77.05134582519531, 'objective/entropy': 54.80491638183594, 'objective/non_score_reward': -3.852566719055176, 'objective/rlhf_reward': -5.171484470367432, 'objective/scores': -1.3189177513122559, 'policy/approxkl_avg': 7.83031737228157e-06, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.0012863429728895426, 'loss/value_avg': 0.5098556280136108, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.0817914009094238, 'val/ratio': 0.9996342658996582, 'val/ratio_var': 1.4080374910463433e-07, 'val/num_eos_tokens': 0, 'lr': 2.939735423811857e-07, 'episode': 8120, 'epoch': 0.99}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:59:42<1:05:39, 131kB/s]
100%|█████████▉| 2031/2041 [2:56:24<00:50,  5.04s/it][A

{'eps': 0, 'objective/kl': 68.50839233398438, 'objective/entropy': 56.215850830078125, 'objective/non_score_reward': -3.425419807434082, 'objective/rlhf_reward': -5.143769264221191, 'objective/scores': -1.7183496952056885, 'policy/approxkl_avg': 2.6448973585502245e-05, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.0022787144407629967, 'loss/value_avg': 0.45287877321243286, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.02052640914917, 'val/ratio': 0.9992791414260864, 'val/ratio_var': 5.364930188989092e-07, 'val/num_eos_tokens': 0, 'lr': 2.694757471827536e-07, 'episode': 8124, 'epoch': 1.0}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:59:47<1:05:39, 131kB/s]
100%|█████████▉| 2032/2041 [2:56:29<00:45,  5.06s/it][A

{'eps': 0, 'objective/kl': 74.81159973144531, 'objective/entropy': 51.66099548339844, 'objective/non_score_reward': -3.7405803203582764, 'objective/rlhf_reward': -5.428926467895508, 'objective/scores': -1.6883459091186523, 'policy/approxkl_avg': 4.088744390173815e-06, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.0007731256191618741, 'loss/value_avg': 0.3953794240951538, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9983621835708618, 'val/ratio': 0.9996333122253418, 'val/ratio_var': 1.4478492005309818e-07, 'val/num_eos_tokens': 0, 'lr': 2.449779519843214e-07, 'episode': 8128, 'epoch': 1.0}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:59:52<1:05:39, 131kB/s]
100%|█████████▉| 2033/2041 [2:56:34<00:40,  5.04s/it][A

{'eps': 0, 'objective/kl': 65.33705139160156, 'objective/entropy': 36.61632537841797, 'objective/non_score_reward': -3.266852855682373, 'objective/rlhf_reward': -5.832955360412598, 'objective/scores': -2.5661025047302246, 'policy/approxkl_avg': 5.873169811820844e-06, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.0007927411934360862, 'loss/value_avg': 0.33122485876083374, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7275556325912476, 'val/ratio': 0.9996542930603027, 'val/ratio_var': 1.1993348891792266e-07, 'val/num_eos_tokens': 0, 'lr': 2.204801567858893e-07, 'episode': 8132, 'epoch': 1.0}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [2:59:57<1:05:39, 131kB/s]
100%|█████████▉| 2034/2041 [2:56:39<00:35,  5.05s/it][A

{'eps': 0, 'objective/kl': 77.41938781738281, 'objective/entropy': 27.52259063720703, 'objective/non_score_reward': -3.870969533920288, 'objective/rlhf_reward': -5.98647928237915, 'objective/scores': -2.1155097484588623, 'policy/approxkl_avg': 9.11321080820926e-07, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.0002525746822357178, 'loss/value_avg': 0.5389542579650879, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.649689793586731, 'val/ratio': 0.9998997449874878, 'val/ratio_var': 1.1213913353458338e-08, 'val/num_eos_tokens': 0, 'lr': 1.9598236158745713e-07, 'episode': 8136, 'epoch': 1.0}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [3:00:02<1:05:39, 131kB/s]
100%|█████████▉| 2035/2041 [2:56:44<00:30,  5.02s/it][A

{'eps': 0, 'objective/kl': 75.1282730102539, 'objective/entropy': 34.492706298828125, 'objective/non_score_reward': -3.756413459777832, 'objective/rlhf_reward': -5.076320648193359, 'objective/scores': -1.3199071884155273, 'policy/approxkl_avg': 3.559165634214878e-06, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.0007391436956822872, 'loss/value_avg': 0.4464362561702728, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.7087276577949524, 'val/ratio': 1.000015139579773, 'val/ratio_var': 1.3284544297942347e-10, 'val/num_eos_tokens': 0, 'lr': 1.71484566389025e-07, 'episode': 8140, 'epoch': 1.0}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [3:00:07<1:05:39, 131kB/s]
100%|█████████▉| 2036/2041 [2:56:49<00:25,  5.02s/it][A

{'eps': 0, 'objective/kl': 82.67264556884766, 'objective/entropy': 59.795013427734375, 'objective/non_score_reward': -4.133632659912109, 'objective/rlhf_reward': -6.0726237297058105, 'objective/scores': -1.9389910697937012, 'policy/approxkl_avg': 3.0550941119145136e-06, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.0005738307954743505, 'loss/value_avg': 0.5194435119628906, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.098046064376831, 'val/ratio': 0.9996829032897949, 'val/ratio_var': 9.117079713405474e-08, 'val/num_eos_tokens': 0, 'lr': 1.4698677119059285e-07, 'episode': 8144, 'epoch': 1.0}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [3:00:13<1:05:39, 131kB/s]
100%|█████████▉| 2037/2041 [2:56:54<00:20,  5.07s/it][A

{'eps': 0, 'objective/kl': 80.68869018554688, 'objective/entropy': 46.25664520263672, 'objective/non_score_reward': -4.0344343185424805, 'objective/rlhf_reward': -5.910708904266357, 'objective/scores': -1.876274585723877, 'policy/approxkl_avg': 6.896489594510058e-06, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.0008943867869675159, 'loss/value_avg': 0.5148804187774658, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.8007837533950806, 'val/ratio': 0.9995821714401245, 'val/ratio_var': 1.8686534986045444e-07, 'val/num_eos_tokens': 0, 'lr': 1.224889759921607e-07, 'episode': 8148, 'epoch': 1.0}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [3:00:18<1:05:39, 131kB/s]
100%|█████████▉| 2038/2041 [2:56:59<00:15,  5.06s/it][A

{'eps': 0, 'objective/kl': 74.70988464355469, 'objective/entropy': 35.3535270690918, 'objective/non_score_reward': -3.7354941368103027, 'objective/rlhf_reward': -5.225647926330566, 'objective/scores': -1.4901535511016846, 'policy/approxkl_avg': 7.590336394969199e-07, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.00027915299870073795, 'loss/value_avg': 0.4568171799182892, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.698249101638794, 'val/ratio': 0.9999566078186035, 'val/ratio_var': 2.4721849012365738e-09, 'val/num_eos_tokens': 0, 'lr': 9.799118079372857e-08, 'episode': 8152, 'epoch': 1.0}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [3:00:23<1:05:39, 131kB/s]
100%|█████████▉| 2039/2041 [2:57:04<00:10,  5.03s/it][A

{'eps': 0, 'objective/kl': 72.410888671875, 'objective/entropy': 30.513431549072266, 'objective/non_score_reward': -3.62054443359375, 'objective/rlhf_reward': -5.3525824546813965, 'objective/scores': -1.732037901878357, 'policy/approxkl_avg': 2.156647553874791e-07, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.0001469243288738653, 'loss/value_avg': 0.4162747263908386, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.6549552083015442, 'val/ratio': 0.99993497133255, 'val/ratio_var': 4.127179042967555e-09, 'val/num_eos_tokens': 0, 'lr': 7.349338559529642e-08, 'episode': 8156, 'epoch': 1.0}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [3:00:28<1:05:39, 131kB/s]
100%|█████████▉| 2040/2041 [2:57:09<00:05,  5.04s/it][A

{'eps': 0, 'objective/kl': 68.78968811035156, 'objective/entropy': 74.5024185180664, 'objective/non_score_reward': -3.439484119415283, 'objective/rlhf_reward': -5.880460739135742, 'objective/scores': -2.440976619720459, 'policy/approxkl_avg': 3.3185577308358916e-07, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.00030352710746228695, 'loss/value_avg': 0.3537227213382721, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.1898162364959717, 'val/ratio': 0.9999421834945679, 'val/ratio_var': 3.45312511917939e-09, 'val/num_eos_tokens': 0, 'lr': 4.899559039686428e-08, 'episode': 8160, 'epoch': 1.0}



                                                                         
model.safetensors:   4%|▍         | 23.7M/538M [3:00:33<1:05:39, 131kB/s]
100%|██████████| 2041/2041 [2:57:14<00:00,  5.02s/it][A

{'eps': 0, 'objective/kl': 75.11905670166016, 'objective/entropy': 55.39173889160156, 'objective/non_score_reward': -3.755953073501587, 'objective/rlhf_reward': -5.915444374084473, 'objective/scores': -2.159491539001465, 'policy/approxkl_avg': 1.1910705666196009e-07, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': -0.00016523082740604877, 'loss/value_avg': 0.4490823745727539, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 0.9708170890808105, 'val/ratio': 0.9999620914459229, 'val/ratio_var': 1.4465791897677605e-09, 'val/num_eos_tokens': 0, 'lr': 2.449779519843214e-08, 'episode': 8164, 'epoch': 1.0}


100%|██████████| 2041/2041 [2:57:37<00:00,  5.22s/it]


Посмотрим на изменение в ответах. Вполне вероятно, что вы не увидите такого сильного изменения как после DPO. PPO требует гораздо больше ресурсов, правильных гиперпараметров и в целом не так стабилен.

In [86]:
messages = [{"role": "user", "content": "What's your morning routine like?"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = tokenizer([text], return_tensors="pt").to(DEVICE)

generated_ids = policy.generate(model_inputs.input_ids, max_new_tokens=256, do_sample=False)
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

init_generated_ids = ref_policy.generate(model_inputs.input_ids, max_new_tokens=256, do_sample=False)
init_response = tokenizer.batch_decode(init_generated_ids, skip_special_tokens=True)[0]


The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


In [87]:
print("======== BEFORE TUNING ========")
print(init_response)
print()

print("======== AFTER TUNING ========")
print(response)


user
What's your morning routine like?
assistant
I'm excited to start my morning routine! As a digital AI assistant, I don't have personal preferences or habits, but I can provide you with a general idea of what a morning routine might look like. Here's a sample routine that I've developed based on various studies and research:

**Morning Routine (10-15 minutes)**

1. **Hydrate**: Drink a full glass of water or a herbal tea (e.g., chamomile, peppermint) to start the day.
2. **Eat a nutritious breakfast**: Prepare a healthy breakfast, such as oatmeal with fruit, scrambled eggs with spinach, or Greek yogurt with berries.
3. **Get some morning sunlight**: Spend 10-15 minutes outside in natural light to help regulate your circadian rhythms.
4. **Take a few deep breaths**: Inhale for 1-2 minutes, hold for 2-3 minutes, and exhale for 2-3 minutes.
5. **Stretch or move**: Engage in some light stretching or movement to get your blood flowing and your muscles moving.

**Morning Routine (15-30 mi

In [92]:
# Загружаем все на хаб

policy.push_to_hub(f"{REPO_NAME}-ppo")
tokenizer.push_to_hub(f"{REPO_NAME}-ppo")


model.safetensors:   0%|          | 0.00/538M [00:00<?, ?B/s][A'(MaxRetryError("HTTPSConnectionPool(host='hf-hub-lfs-us-east-1.s3-accelerate.amazonaws.com', port=443): Max retries exceeded with url: /repos/70/bd/70bd266dc809f8a11fb553297ae26d8a82379282d923f719ab06e43abbc6deff/6073b1629cabecc191fa34d0546eca59182ad1e36154cee0622da1e34e2d11d5?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIA2JU7TKAQLC2QXPN7%2F20250328%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250328T213618Z&X-Amz-Expires=86400&X-Amz-Signature=7debd6bb29b64fcbffd3658ce73f9a2c0d80bc8ad70ab59dd27c759d28f6f403&X-Amz-SignedHeaders=host&partNumber=1&uploadId=XHMPmJIPxxchUWj5quXFCqc_dRueSQ2J7waOxxm82Ts0fcrOtlhv_fWenVZdQpMc5PSWFHpGX7r3Xr.8joK9Q1_OqlKK1RTy6Jt96AXR4QPxp5HUZbgmnPWEBHP1oXjd&x-id=UploadPart (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:2426)')))"), '(Request ID: 2223f3d6-02ad-4194-905c-d29e7978810d)')' thrown while requesting PUT https

CommitInfo(commit_url='https://huggingface.co/fridalex/llm-course-hw2-ppo/commit/9cfb9bbcc02af8c00e3ad6da120d64cfdab0d755', commit_message='Upload tokenizer', commit_description='', oid='9cfb9bbcc02af8c00e3ad6da120d64cfdab0d755', pr_url=None, repo_url=RepoUrl('https://huggingface.co/fridalex/llm-course-hw2-ppo', endpoint='https://huggingface.co', repo_type='model', repo_id='fridalex/llm-course-hw2-ppo'), pr_revision=None, pr_num=None)

## Анализ модели [2 балл]

Проанализируйте финальный модель (от DPO и PPO).
Постройте графики логпроб для данных из обучающей выборки и сторонних, которые модель не видела.
Подойдет любой не сильно большой датасет с hugging face.

Считает ли финальная модель что данные из обучающей выборки более вероятны?

Попробуйте проанализировать финальную модель (от DPO или PPO). Постройте графики логпроб для данных из обучающей выборки и каких нибудь еще, которые модель не видела. Считает ли финальная модель что данные из обучающей выборки более вероятны?

In [None]:
ref_model = AutoModelForCausalLM.from_pretrained(MODEL_ID)
tokenizer = AutoTokenizer.from_pretrained(f"{REPO_NAME}-dpo")
model = AutoModelForCausalLM.from_pretrained(f"{REPO_NAME}-dpo")

In [25]:
BATCH_SIZE = 8  # in colab make it smaller, or implement grad accumulation
NUM_EPOCHS = 1
LR = 5e-5
MAX_SEQ_LEN = 1024  # this also can be adjusted
MAX_PROMPT_LEN = 256 # this also can be adjusted
MAX_COMPLETION_LEN = None
BETA = 0.1

dataset = dataset.map(apply_chat_template, fn_kwargs={"tokenizer": tokenizer})
dataset = dataset.map(
    tokenize_row,
    fn_kwargs={
        "tokenizer": tokenizer,
        "max_prompt_length": MAX_PROMPT_LEN,
        "max_completion_length": MAX_COMPLETION_LEN,
    },
    remove_columns=["prompt", "chosen", "rejected"],
)

Map: 100%|██████████| 10884/10884 [00:02<00:00, 4741.40 examples/s]
Map: 100%|██████████| 10884/10884 [00:23<00:00, 466.14 examples/s]


In [43]:
dataloader = DataLoader(
    dataset.with_format("torch"),
    batch_size=BATCH_SIZE,
    shuffle=True,
    pin_memory=False,
    collate_fn=partial(pad_collate_fn, pad_token_id=tokenizer.pad_token_id),
)

In [26]:
model.eval()
ref_model.eval()

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(49152, 576, padding_idx=2)
    (layers): ModuleList(
      (0-29): 30 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=576, out_features=576, bias=False)
          (k_proj): Linear(in_features=576, out_features=192, bias=False)
          (v_proj): Linear(in_features=576, out_features=192, bias=False)
          (o_proj): Linear(in_features=576, out_features=576, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=576, out_features=1536, bias=False)
          (up_proj): Linear(in_features=576, out_features=1536, bias=False)
          (down_proj): Linear(in_features=1536, out_features=576, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((576,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((576,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((576,), eps=1e-05)
    (rotary_emb)

In [44]:
batch = next(iter(dataloader))

In [45]:
chosen_input = torch.cat([batch['prompt_input_ids'], batch['chosen_input_ids']], dim=-1)
chosen_mask = torch.cat([batch['prompt_attn_mask'], batch['chosen_attn_mask']], dim=-1)

rejected_input = torch.cat([batch['prompt_input_ids'], batch['rejected_input_ids']], dim=-1)
rejected_mask = torch.cat([batch['prompt_attn_mask'], batch['rejected_attn_mask']], dim=-1)

In [None]:
with torch.no_grad():
    logits = model.forward(chosen_input[:, :-1])
    logits_ref = ref_model(chosen_input[:, :-1])

In [56]:
get_log_prob(torch.softmax(logits['logits'], dim=-1), chosen_input[:, 1:], torch.ones(chosen_input[:, 1:].shape))

tensor([-8094.2749, -7856.6221, -7447.6753, -9043.6533, -7961.4731, -9195.3730,
        -9186.6328, -8977.4512])

In [57]:
get_log_prob(torch.softmax(logits_ref['logits'], dim=-1), chosen_input[:, 1:], torch.ones(chosen_input[:, 1:].shape))

tensor([ -759.9898,  -857.4566,  -877.3159,  -983.7149,  -633.6689, -1024.4534,
         -933.6204,  -628.6726])

# Видно, что модель считает данные из обучающей выборки более вероятными, чем референс модель. Что докзаывает, что обучение прошло успешно.

# Дополнительные баллы

Вы также можно заработать дополнительные баллы:
- Оформить репозитории на 🤗 (можно сделать коллекцию, так как у нас 3 репозитория): карточка модели с описанием задания, репортом качества и примерами генерации **[2 балла]**

# Специальный раздел для проверяющего

In [14]:
device = torch.device("cuda")

DPO_REPO_NAME = f"{REPO_NAME}-dpo"
PPO_REPO_NAME = f"{REPO_NAME}-ppo"
REWARD_MODEL_REPO_NAME = f"{REPO_NAME}-reward-model"

tokenizer = AutoTokenizer.from_pretrained(DPO_REPO_NAME)
check_model = AutoModelForCausalLM.from_pretrained(DPO_REPO_NAME)
check_model = check_model.to(device)
check_model = check_model.eval()

In [20]:
check_model

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(49152, 576, padding_idx=2)
    (layers): ModuleList(
      (0-29): 30 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=576, out_features=576, bias=False)
          (k_proj): Linear(in_features=576, out_features=192, bias=False)
          (v_proj): Linear(in_features=576, out_features=192, bias=False)
          (o_proj): Linear(in_features=576, out_features=576, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=576, out_features=1536, bias=False)
          (up_proj): Linear(in_features=576, out_features=1536, bias=False)
          (down_proj): Linear(in_features=1536, out_features=576, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((576,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((576,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((576,), eps=1e-05)
    (rotary_emb)

In [29]:
messages = [{"role": "user", "content": "What's your morning routine like?"}]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = tokenizer([text], return_tensors="pt").to(device)

generated_ids = check_model.generate(model_inputs.input_ids, max_new_tokens=256, do_sample=False)
response = tokenizer.decode(generated_ids[0], skip_special_tokens=True)

In [31]:
print(response)

user
What's your morning routine like?
assistant
I'm an artificial intelligence language model, and as such, I don't have personal preferences or emotions. I am designed to provide information and assist with inquiries to the best of my abilities, without personal biases or opinions. I am here to provide accurate and helpful responses to your questions, and I do not have the capacity to engage in leisure activities or personal pursuits. My purpose is to provide helpful and informative responses to your inquiries, and I do not have the capacity to relax or engage in leisure activities. If you have any specific questions or topics you would like to discuss, I am here to assist you. Please let me know if there is a particular subject or topic you would like to explore, and I will do my best to provide a helpful and informative response. Thank you for your inquiry, and I look forward to hearing from you. If you have any further questions or topics you would like to discuss, I am here to as