## Open notebook in:
| Colab                               Gradient                                                                                                                                         |
|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Nicolepcx/Transformers-in-Action/blob/main/CH08/ch08_ray_tune.ipynb)                                              

# Imports


# About this notebook

This notebook is inspired by two examples from Ray:

* [Example one](https://docs.ray.io/en/latest/train/getting-started-transformers.html)
* [Example two](https://docs.ray.io/en/latest/train/examples/transformers/huggingface_text_classification.html)

You will use one dataset from the [superclue dataset](https://huggingface.co/datasets/super_glue).

This notebook show the following steps:
1. Set up Ray
2. Load the dataset and process with Ray Data
3. Run the training with Ray Train



In [None]:
!pip install evaluate -qqq

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
!pip install ray -qqq

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m68.9/68.9 MB[0m [31m34.7 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
!pip install datasets -U

Collecting datasets
  Downloading datasets-4.0.0-py3-none-any.whl.metadata (19 kB)
Collecting fsspec<=2025.3.0,>=2023.1.0 (from fsspec[http]<=2025.3.0,>=2023.1.0->datasets)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-4.0.0-py3-none-any.whl (494 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m494.8/494.8 kB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2025.3.0-py3-none-any.whl (193 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m193.6/193.6 kB[0m [31m18.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: fsspec, datasets
  Attempting uninstall: fsspec
    Found existing installation: fsspec 2025.3.2
    Uninstalling fsspec-2025.3.2:
      Successfully uninstalled fsspec-2025.3.2
  Attempting uninstall: datasets
    Found existing installation: datasets 2.14.4
    Uninstalling datasets-2.14.4:
      Successfully uninstalled datasets-2.14.4
[31mERROR: pip's dependency r


## Set up Ray

Use `ray.init()` to initialize a local cluster. By default, this cluster contains only the machine you are running this notebook on.

In [None]:
from pprint import pprint
import ray

ray.init(num_cpus=1, num_gpus=1, include_dashboard=False)

2025-07-14 05:29:03,365	INFO worker.py:1917 -- Started a local Ray instance.


0,1
Python version:,3.11.13
Ray version:,2.47.1


Check the resources our cluster is composed of. If you are running this notebook on your local machine or Google Colab, you should see the number of CPU cores and GPUs available on the your machine.

In [None]:
pprint(ray.cluster_resources())

{'CPU': 1.0,
 'GPU': 1.0,
 'accelerator_type:A100': 1.0,
 'memory': 62497129677.0,
 'node:172.28.0.12': 1.0,
 'node:__internal_head__': 1.0,
 'object_store_memory': 26784484147.0}


# Imports

In [None]:
import os
import torch
import numpy as np

import ray.data
from datasets import load_dataset
from transformers import AutoTokenizer
import wandb

import evaluate
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
from huggingface_hub import notebook_login

import ray.train
from ray.train import RunConfig, ScalingConfig, CheckpointConfig, Checkpoint
from ray.train.huggingface.transformers import prepare_trainer, RayTrainReportCallback
from ray.train.torch import TorchTrainer

from ray import tune
from ray.tune import Tuner
from ray.tune.schedulers.async_hyperband import ASHAScheduler


In [None]:
use_gpu = True  # set this to False to run on CPUs
num_workers = 1  # set this to number of GPUs or CPUs you want to use

## Log into weights & biases

In [None]:
wandb.login(key='your_key')

[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mnicolepcx[0m to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

## Set the task for SuperGlue

In [None]:
task = "cb"
model_checkpoint = "distilbert-base-uncased"
batch_size = 16


## Load the dataset

In [None]:
hf_dataset = load_dataset("super_glue", task)

README.md: 0.00B [00:00, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/58.0k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/18.0k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/63.5k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/250 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/56 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/250 [00:00<?, ? examples/s]

In [None]:
def is_valid(example):
    return isinstance(example.get("premise"), str) and isinstance(example.get("hypothesis"), str)

hf_dataset["train"] = hf_dataset["train"].filter(is_valid)
hf_dataset["validation"] = hf_dataset["validation"].filter(is_valid)
hf_dataset["test"] = hf_dataset["test"].filter(is_valid)


Filter:   0%|          | 0/250 [00:00<?, ? examples/s]

Filter:   0%|          | 0/56 [00:00<?, ? examples/s]

Filter:   0%|          | 0/250 [00:00<?, ? examples/s]

## Preprocessing the data with Ray Data

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [None]:
ray_datasets = {
    "train": ray.data.from_items(hf_dataset["train"].to_list()),
    "validation": ray.data.from_items(hf_dataset["validation"].to_list()),
    "test": ray.data.from_items(hf_dataset["test"].to_list()),
}
ray_datasets

{'train': MaterializedDataset(
    num_blocks=200,
    num_rows=250,
    schema={premise: string, hypothesis: string, idx: int64, label: int64}
 ),
 'validation': MaterializedDataset(
    num_blocks=56,
    num_rows=56,
    schema={premise: string, hypothesis: string, idx: int64, label: int64}
 ),
 'test': MaterializedDataset(
    num_blocks=200,
    num_rows=250,
    schema={premise: string, hypothesis: string, idx: int64, label: int64}
 )}

## Preprocesses the samples

In [None]:
def tokenize_fn(batch):
    premises = batch["premise"]
    hypotheses = batch["hypothesis"]
    labels = batch["label"]

    # Cast to list to ensure compatibility (HF tokenizer can handle np.array, but torch.tensor prefers lists here)
    premises = premises.tolist() if isinstance(premises, np.ndarray) else premises
    hypotheses = hypotheses.tolist() if isinstance(hypotheses, np.ndarray) else hypotheses
    labels = labels.tolist() if isinstance(labels, np.ndarray) else labels

    # Tokenize
    tokenized = tokenizer(
        premises,
        hypotheses,
        truncation=True,
        padding="longest",
        return_tensors="pt"
    )

    # Add labels
    tokenized["labels"] = torch.tensor(labels, dtype=torch.long)
    tokenized = {k: v.to("cuda") for k, v in tokenized.items()}
    return tokenized


## Fine-tuning the model with Ray Train

Now that the data is ready, download the pretrained model and fine-tune it.

You need to define the training logic as a function (`train_func`). You pass this [training function] to the {class}`~ray.train.torch.TorchTrainer` and with that on every Ray worker.


In [None]:
num_labels = 3
metric_name = ("accuracy")

model_name = model_checkpoint.split("/")[-1]
validation_key = ("validation")

name = f"{model_name}-finetuned-{task}"

# Calculate the maximum steps per epoch based on the number of rows in the training dataset.
max_steps_per_epoch = ray_datasets["train"].count() // (batch_size * num_workers)


def train_func(config):
    print(f"CUDA available: {torch.cuda.is_available()}")

    # Load metric
    metric = evaluate.load("super_glue", config_name=task)

    # Tokenizer and model
    tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)
    model = AutoModelForSequenceClassification.from_pretrained(
        model_checkpoint, num_labels=num_labels
    )

    # Load Ray datasets
    train = ray.train.get_dataset_shard("train")
    eval = ray.train.get_dataset_shard("eval")

    train_iterable = train.iter_torch_batches(
        batch_size=batch_size, collate_fn=tokenize_fn
    )
    eval_iterable = eval.iter_torch_batches(
        batch_size=batch_size, collate_fn=tokenize_fn
    )

    # Training arguments
    args = TrainingArguments(
        name,
        eval_strategy="epoch",
        save_strategy="epoch",
        logging_strategy="epoch",
        per_device_train_batch_size=config.get("batch_size", 64),
        per_device_eval_batch_size=config.get("batch_size", 64),
        learning_rate=config.get("learning_rate", 2e-5),
        num_train_epochs=config.get("epochs", 6),
        weight_decay=config.get("weight_decay", 0.001),
        max_steps=max_steps_per_epoch * config.get("epochs", 6),
        disable_tqdm=False,
        no_cuda=not torch.cuda.is_available(),
        report_to="wandb",
        run_name="superglue_cb"
    )

    def compute_metrics(eval_pred):
        predictions, labels = eval_pred
        predictions = np.argmax(predictions, axis=1)
        return metric.compute(predictions=predictions, references=labels)

    # Trainer setup
    trainer = Trainer(
        model=model,
        args=args,
        train_dataset=train_iterable,
        eval_dataset=eval_iterable,
        compute_metrics=compute_metrics,
    )
    trainer.add_callback(RayTrainReportCallback())
    trainer = prepare_trainer(trainer)

    print("Beginning training...")
    trainer.train()

    # Final evaluation
    eval_metrics = trainer.evaluate()
    print("Final evaluation:", eval_metrics)

    # Save HF-compatible model directory
    output_dir = os.path.join(ray.train.get_context().get_trial_dir(), "hf_model")
    model.save_pretrained(output_dir)
    tokenizer.save_pretrained(output_dir)

    # Report metrics + HF-style checkpoint to Ray Tune
    ray.train.report(
        {
            "eval_loss": eval_metrics.get("eval_loss"),
            "eval_accuracy": eval_metrics.get("eval_accuracy"),
        },
        checkpoint=Checkpoint.from_directory(output_dir)
    )


## Setup TorchTrainer


Having finalized your train_func, it's time to create an instance of the `{class}~ray.train.torch.TorchTrainer`. In addition to invoking the function, configure the `scaling_config` to manage the number of workers and resources allocated, as well as specify the `datasets` for training and evaluation purposes.

In [None]:
trainer = TorchTrainer(
    train_func,
    scaling_config=ScalingConfig(num_workers=num_workers, use_gpu=use_gpu),
    datasets={
        "train": ray_datasets["train"],
        "eval": ray_datasets["validation"],
    },
    run_config=RunConfig(
        checkpoint_config=CheckpointConfig(
            num_to_keep=1,
            checkpoint_score_attribute="eval_loss",
            checkpoint_score_order="min",
        ),
    ),
)

## Tune hyperparameters with Ray Tune

To adjust the model's hyperparameters, insert your TorchTrainer into a Tuner and specify the search space. I have added some example hyperparameters.

Moreover, you can leverage the sophisticated search algorithms and schedulers provided by `Ray Tune`. In this instance, an `ASHAScheduler` is utilized to aggressively halt trials that are not performing well.

In [None]:
tuner = Tuner(
    trainer,
    param_space={
        "train_loop_config": {
            "learning_rate": tune.grid_search([2e-5, 2e-4, 2e-3, 2e-2]),
            "epochs": tune.choice([2, 4, 6, 8]),  # Experiment with these epoch values
            "batch_size": tune.choice([16, 32, 64, 128]),  # Experiment with these batch sizes
            "weight_decay": tune.grid_search([0.0, 0.01, 0.1, 0.001])  # Experiment with these weight decay values
        }
    },
    tune_config=tune.TuneConfig(
        metric="eval_loss",
        mode="min",
        num_samples=1,
        scheduler=ASHAScheduler(
            max_t=max([2, 4, 6, 8]),  # Set to the maximum value of epochs being considered
            grace_period=1,  # Minimum number of epochs to evaluate before considering stopping
            reduction_factor=2,  # Reduction factor for pruning
        ),
    ),
    run_config=RunConfig(
        name="tune_transformers",
        checkpoint_config=CheckpointConfig(
            num_to_keep=1,
            checkpoint_score_attribute="eval_loss",
            checkpoint_score_order="min",
        ),
    ),
)


2025-07-14 05:29:32,081	INFO tuner_internal.py:427 -- A `RunConfig` was passed to both the `Tuner` and the `TorchTrainer`. The run config passed to the `Tuner` is the one that will be used.


## Tune the model

In [None]:
tune_results = tuner.fit()

2025-07-14 05:29:32,126	INFO tensorboardx.py:193 -- pip install "ray[tune]" to see TensorBoard files.


+----------------------------------------------------------+
| Configuration for experiment     tune_transformers       |
+----------------------------------------------------------+
| Search algorithm                 BasicVariantGenerator   |
| Scheduler                        AsyncHyperBandScheduler |
| Number of trials                 16                      |
+----------------------------------------------------------+

View detailed results here: /root/ray_results/tune_transformers

Trial status: 16 PENDING
Current time: 2025-07-14 05:29:32. Total running time: 0s
Logical resource usage: 0/1 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:A100)
+-----------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                 status       ...oop_config/epochs     ...config/batch_size     ...fig/learning_rate     ...nfig/weight_decay |
+--------------------------------------------------------------------

[36m(TrainTrainable pid=3042)[0m 2025-07-14 05:29:38.070522: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[36m(TrainTrainable pid=3042)[0m 2025-07-14 05:29:38.088623: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
[36m(TrainTrainable pid=3042)[0m E0000 00:00:1752470978.110287    3042 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
[36m(TrainTrainable pid=3042)[0m E0000 00:00:1752470978.116726    3042 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
[3


Trial TorchTrainer_853b7_00000 started with configuration:
+-------------------------------------------------+
| Trial TorchTrainer_853b7_00000 config           |
+-------------------------------------------------+
| train_loop_config/batch_size                128 |
| train_loop_config/epochs                      6 |
| train_loop_config/learning_rate           2e-05 |
| train_loop_config/weight_decay                0 |
+-------------------------------------------------+


[36m(RayTrainWorker pid=3139)[0m Setting up process group for: env:// [rank=0, world_size=1]
[36m(TorchTrainer pid=3042)[0m Started distributed worker processes: 
[36m(TorchTrainer pid=3042)[0m - (node_id=dfac881cdf0b95ec3c65ef7c281e0d5ea5b341dd9a6d97ecd807b1a3, ip=172.28.0.12, pid=3139) world_rank=0, local_rank=0, node_rank=0
[36m(RayTrainWorker pid=3139)[0m 2025-07-14 05:29:47.943008: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[36m(RayTrainWorker pid=3139)[0m 2025-07-14 05:29:47.959585: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
[36m(RayTrainWorker pid=3139)[0m E0000 00:00:1752470987.980667    3139 cuda_dnn.cc:8310] 

[36m(RayTrainWorker pid=3139)[0m CUDA available: True


Downloading builder script: 9.64kB [00:00, 27.5MB/s]
Downloading extra modules: 3.72kB [00:00, 17.5MB/s]
[36m(RayTrainWorker pid=3139)[0m Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
[36m(RayTrainWorker pid=3139)[0m You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[36m(RayTrainWorker pid=3139)[0m Beginning training...


[36m(RayTrainWorker pid=3139)[0m wandb: Currently logged in as: nicolepcx to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
[36m(RayTrainWorker pid=3139)[0m wandb: Tracking run with wandb version 0.21.0
[36m(RayTrainWorker pid=3139)[0m wandb: Run data is saved locally in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/artifacts/2025-07-14_05-29-32/tune_transformers/working_dirs/TorchTrainer_853b7_00000_0_batch_size=128,epochs=6,learning_rate=0.0000,weight_decay=0.0000_2025-07-14_05-29-32/wandb/run-20250714_052957-bxurb490
[36m(RayTrainWorker pid=3139)[0m wandb: Run `wandb offline` to turn off syncing.
[36m(RayTrainWorker pid=3139)[0m wandb: Syncing run superglue_cb
[36m(RayTrainWorker pid=3139)[0m wandb: ⭐️ View project at https://wandb.ai/nicolepcx/huggingface
[36m(RayTrainWorker pid=3139)[0m wandb: 🚀 View run at https://wandb.ai/nicolepcx/huggingface/runs/bxurb490
  0%|          | 0/90 [00:00<?, ?it/s]


[2m[36m(pid=3205) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=3205) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(SplitCoordinator pid=3205)[0m Registered dataset logger for dataset train_3_0
[36m(SplitCoordinator pid=3205)[0m Starting execution of Dataset train_3_0. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=3205)[0m Execution plan of Dataset train_3_0: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
  1%|          | 1/90 [00:01<02:01,  1.37s/it]
  3%|▎         | 3/90 [00:01<00:36,  2.41it/s]
  6%|▌         | 5/90 [00:01<00:20,  4.25it/s]
  8%|▊         | 7/90 [00:01<00:13,  6.11it/s]
[36m(SplitCoordinator pid=3205)[0m ✔️  Dataset train_3_0 execution finished in 1.81 seconds




 10%|█         | 9/90 [00:01<00:10,  7.68it/s]
 12%|█▏        | 11/90 [00:02<00:10,  7.72it/s]
 14%|█▍        | 13/90 [00:02<00:08,  9.12it/s]
 17%|█▋        | 15/90 [00:02<00:06, 10.75it/s]


[36m(RayTrainWorker pid=3139)[0m {'loss': 1.083, 'grad_norm': 2.8524980545043945, 'learning_rate': 1.6666666666666667e-05, 'epoch': 0.18}


[36m(RayTrainWorker pid=3139)[0m                                                 18%|█▊        | 16/90 [00:02<00:06, 10.75it/s]


[2m[36m(pid=3255) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=3255) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(RayTrainWorker pid=3139)[0m                                                 18%|█▊        | 16/90 [00:02<00:06, 10.75it/s]


[36m(RayTrainWorker pid=3139)[0m {'eval_loss': 0.9082885384559631, 'eval_accuracy': 0.6964285714285714, 'eval_f1': 0.48646125116713357, 'eval_runtime': 0.2217, 'eval_samples_per_second': 252.58, 'eval_steps_per_second': 4.51, 'epoch': 0.18}

Trial status: 1 RUNNING | 15 PENDING
Current time: 2025-07-14 05:30:02. Total running time: 30s
Logical resource usage: 0/1 CPUs, 1.0/1 GPUs (0.0/1.0 accelerator_type:A100)
+-----------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                 status       ...oop_config/epochs     ...config/batch_size     ...fig/learning_rate     ...nfig/weight_decay |
+-----------------------------------------------------------------------------------------------------------------------------------------+
| TorchTrainer_853b7_00000   RUNNING                         6                      128                   2e-05                     0     |
| TorchTrainer_853b7_00

[36m(RayTrainWorker pid=3139)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/root/ray_results/tune_transformers/TorchTrainer_853b7_00000_0_batch_size=128,epochs=6,learning_rate=0.0000,weight_decay=0.0000_2025-07-14_05-29-32/checkpoint_000000)
[36m(SplitCoordinator pid=3255)[0m Registered dataset logger for dataset eval_4_0
[36m(SplitCoordinator pid=3255)[0m Starting execution of Dataset eval_4_0. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=3255)[0m Execution plan of Dataset eval_4_0: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
[36m(SplitCoordinator pid=3255)[0m ✔️  Dataset eval_4_0 execution finished in 0.12 seconds


[2m[36m(pid=3205) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=3205) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(SplitCoordinator pid=3205)[0m Registered dataset logger for dataset train_3_1
[36m(SplitCoordinator pid=3205)[0m Starting execution of Dataset train_3_1. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=3205)[0m Execution plan of Dataset train_3_1: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
 19%|█▉        | 17/90 [00:08<01:13,  1.01s/it]
 21%|██        | 19/90 [00:08<00:50,  1.40it/s]
 23%|██▎       | 21/90 [00:08<00:35,  1.95it/s]
 26%|██▌       | 23/90 [00:08<00:24,  2.68it/s]
[36m(SplitCoordinator pid=3205)[0m ✔️  Dataset train_3_1 execution finished in 0.53 seconds
 28%|██▊       | 25/90 [00:08<00:18,  3.57it/s]
 30%|███       | 27/90 [00:09<00:13,  4.57it/s]


[36m(RayTrainWorker pid=3139)[0m {'loss': 0.7965, 'grad_norm': 2.860023260116577, 'learning_rate': 1.3111111111111113e-05, 'epoch': 1.18}


[36m(RayTrainWorker pid=3139)[0m  32%|███▏      | 29/90 [00:09<00:10,  5.76it/s]
[36m(RayTrainWorker pid=3139)[0m  34%|███▍      | 31/90 [00:09<00:08,  7.17it/s]                                                36%|███▌      | 32/90 [00:09<00:08,  7.17it/s]


[2m[36m(pid=3255) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=3255) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(RayTrainWorker pid=3139)[0m                                                 36%|███▌      | 32/90 [00:09<00:08,  7.17it/s]


[36m(RayTrainWorker pid=3139)[0m {'eval_loss': 0.8321595191955566, 'eval_accuracy': 0.6785714285714286, 'eval_f1': 0.47391812865497074, 'eval_runtime': 0.1922, 'eval_samples_per_second': 291.292, 'eval_steps_per_second': 5.202, 'epoch': 1.18}


[36m(RayTrainWorker pid=3139)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/root/ray_results/tune_transformers/TorchTrainer_853b7_00000_0_batch_size=128,epochs=6,learning_rate=0.0000,weight_decay=0.0000_2025-07-14_05-29-32/checkpoint_000001)
[36m(SplitCoordinator pid=3255)[0m Registered dataset logger for dataset eval_4_1
[36m(SplitCoordinator pid=3255)[0m Starting execution of Dataset eval_4_1. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=3255)[0m Execution plan of Dataset eval_4_1: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
[36m(SplitCoordinator pid=3255)[0m ✔️  Dataset eval_4_1 execution finished in 0.12 seconds


[2m[36m(pid=3205) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=3205) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(SplitCoordinator pid=3205)[0m Registered dataset logger for dataset train_3_2
[36m(SplitCoordinator pid=3205)[0m Starting execution of Dataset train_3_2. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=3205)[0m Execution plan of Dataset train_3_2: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
 37%|███▋      | 33/90 [00:14<00:52,  1.08it/s]
 39%|███▉      | 35/90 [00:14<00:36,  1.50it/s]
 41%|████      | 37/90 [00:15<00:25,  2.07it/s]
 43%|████▎     | 39/90 [00:15<00:18,  2.81it/s]
[36m(SplitCoordinator pid=3205)[0m ✔️  Dataset train_3_2 execution finished in 0.54 seconds
 46%|████▌     | 41/90 [00:15<00:13,  3.72it/s]
 48%|████▊     | 43/90 [00:15<00:09,  4.73it/s]
 50%|█████     | 45/90 [00:15<00:07,  5.93it/s]


[36m(RayTrainWorker pid=3139)[0m {'loss': 0.6909, 'grad_norm': 2.717909812927246, 'learning_rate': 9.555555555555556e-06, 'epoch': 2.18}


[36m(RayTrainWorker pid=3139)[0m  52%|█████▏    | 47/90 [00:15<00:05,  7.39it/s]
[36m(RayTrainWorker pid=3139)[0m                                                 53%|█████▎    | 48/90 [00:15<00:05,  7.39it/s]


[2m[36m(pid=3255) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=3255) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(RayTrainWorker pid=3139)[0m {'eval_loss': 0.831642746925354, 'eval_accuracy': 0.6785714285714286, 'eval_f1': 0.47391812865497074, 'eval_runtime': 0.1819, 'eval_samples_per_second': 307.855, 'eval_steps_per_second': 5.497, 'epoch': 2.18}


[36m(RayTrainWorker pid=3139)[0m                                                 53%|█████▎    | 48/90 [00:15<00:05,  7.39it/s]
[36m(RayTrainWorker pid=3139)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/root/ray_results/tune_transformers/TorchTrainer_853b7_00000_0_batch_size=128,epochs=6,learning_rate=0.0000,weight_decay=0.0000_2025-07-14_05-29-32/checkpoint_000002)
[36m(SplitCoordinator pid=3255)[0m Registered dataset logger for dataset eval_4_2
[36m(SplitCoordinator pid=3255)[0m Starting execution of Dataset eval_4_2. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=3255)[0m Execution plan of Dataset eval_4_2: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
[36m(SplitCoordinator pid=3255)[0m ✔️  Dataset eval_4_2 execution finished in 0.11 seconds


[2m[36m(pid=3205) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=3205) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(SplitCoordinator pid=3205)[0m Registered dataset logger for dataset train_3_3
[36m(SplitCoordinator pid=3205)[0m Starting execution of Dataset train_3_3. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=3205)[0m Execution plan of Dataset train_3_3: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
 54%|█████▍    | 49/90 [00:21<00:37,  1.11it/s]
 57%|█████▋    | 51/90 [00:21<00:25,  1.53it/s]
 59%|█████▉    | 53/90 [00:21<00:17,  2.11it/s]
 61%|██████    | 55/90 [00:21<00:12,  2.87it/s]
[36m(SplitCoordinator pid=3205)[0m ✔️  Dataset train_3_3 execution finished in 0.52 seconds
 63%|██████▎   | 57/90 [00:21<00:08,  3.78it/s]
 66%|██████▌   | 59/90 [00:21<00:06,  4.80it/s]
 68%|██████▊   | 61/90 [00:21<00:04,  6.01it/s]


[36m(RayTrainWorker pid=3139)[0m {'loss': 0.6308, 'grad_norm': 3.127286672592163, 'learning_rate': 6e-06, 'epoch': 3.18}


[36m(RayTrainWorker pid=3139)[0m  70%|███████   | 63/90 [00:22<00:03,  7.47it/s]                                                71%|███████   | 64/90 [00:22<00:03,  7.47it/s]


[2m[36m(pid=3255) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=3255) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(RayTrainWorker pid=3139)[0m                                                 71%|███████   | 64/90 [00:22<00:03,  7.47it/s]


[36m(RayTrainWorker pid=3139)[0m {'eval_loss': 0.8442627787590027, 'eval_accuracy': 0.6785714285714286, 'eval_f1': 0.47391812865497074, 'eval_runtime': 0.1928, 'eval_samples_per_second': 290.487, 'eval_steps_per_second': 5.187, 'epoch': 3.18}


[36m(RayTrainWorker pid=3139)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/root/ray_results/tune_transformers/TorchTrainer_853b7_00000_0_batch_size=128,epochs=6,learning_rate=0.0000,weight_decay=0.0000_2025-07-14_05-29-32/checkpoint_000003)
[36m(SplitCoordinator pid=3255)[0m Registered dataset logger for dataset eval_4_3
[36m(SplitCoordinator pid=3255)[0m Starting execution of Dataset eval_4_3. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=3255)[0m Execution plan of Dataset eval_4_3: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
[36m(SplitCoordinator pid=3255)[0m ✔️  Dataset eval_4_3 execution finished in 0.11 seconds


[2m[36m(pid=3205) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=3205) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(SplitCoordinator pid=3205)[0m Registered dataset logger for dataset train_3_4
[36m(SplitCoordinator pid=3205)[0m Starting execution of Dataset train_3_4. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=3205)[0m Execution plan of Dataset train_3_4: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
 72%|███████▏  | 65/90 [00:27<00:22,  1.10it/s]
 74%|███████▍  | 67/90 [00:27<00:15,  1.53it/s]
 77%|███████▋  | 69/90 [00:27<00:09,  2.10it/s]
 79%|███████▉  | 71/90 [00:27<00:06,  2.86it/s]
[36m(SplitCoordinator pid=3205)[0m ✔️  Dataset train_3_4 execution finished in 0.53 seconds
 81%|████████  | 73/90 [00:27<00:04,  3.77it/s]
 83%|████████▎ | 75/90 [00:28<00:03,  4.79it/s]


[36m(RayTrainWorker pid=3139)[0m {'loss': 0.5967, 'grad_norm': 3.1859359741210938, 'learning_rate': 2.4444444444444447e-06, 'epoch': 4.18}


[36m(RayTrainWorker pid=3139)[0m  86%|████████▌ | 77/90 [00:28<00:02,  6.00it/s]
[36m(RayTrainWorker pid=3139)[0m  88%|████████▊ | 79/90 [00:28<00:01,  7.47it/s]                                                89%|████████▉ | 80/90 [00:28<00:01,  7.47it/s]


[2m[36m(pid=3255) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=3255) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(RayTrainWorker pid=3139)[0m                                                 89%|████████▉ | 80/90 [00:28<00:01,  7.47it/s]


[36m(RayTrainWorker pid=3139)[0m {'eval_loss': 0.8476214408874512, 'eval_accuracy': 0.6785714285714286, 'eval_f1': 0.47391812865497074, 'eval_runtime': 0.1841, 'eval_samples_per_second': 304.253, 'eval_steps_per_second': 5.433, 'epoch': 4.18}


[36m(RayTrainWorker pid=3139)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/root/ray_results/tune_transformers/TorchTrainer_853b7_00000_0_batch_size=128,epochs=6,learning_rate=0.0000,weight_decay=0.0000_2025-07-14_05-29-32/checkpoint_000004)
[36m(SplitCoordinator pid=3255)[0m Registered dataset logger for dataset eval_4_4
[36m(SplitCoordinator pid=3255)[0m Starting execution of Dataset eval_4_4. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=3255)[0m Execution plan of Dataset eval_4_4: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
[36m(SplitCoordinator pid=3255)[0m ✔️  Dataset eval_4_4 execution finished in 0.11 seconds
[36m(SplitCoordinator pid=3205)[0m Registered dataset logger for dataset train_3_5


Trial status: 1 RUNNING | 15 PENDING
Current time: 2025-07-14 05:30:34. Total running time: 1min 2s
Logical resource usage: 0/1 CPUs, 1.0/1 GPUs (0.0/1.0 accelerator_type:A100)
Current best trial: 853b7_00000 with eval_loss=0.8476214408874512 and params={'train_loop_config': {'learning_rate': 2e-05, 'epochs': 6, 'batch_size': 128, 'weight_decay': 0.0}}
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                 status       ...oop_config/epochs     ...config/batch_size     ...fig/learning_rate     ...nfig/weight_decay     iter     total time (s)     loss     grad_norm     learning_rate     epoch |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

[2m[36m(pid=3205) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=3205) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(SplitCoordinator pid=3205)[0m Starting execution of Dataset train_3_5. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=3205)[0m Execution plan of Dataset train_3_5: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
 90%|█████████ | 81/90 [00:35<00:11,  1.23s/it]
 92%|█████████▏| 83/90 [00:36<00:06,  1.13it/s]
 94%|█████████▍| 85/90 [00:36<00:03,  1.57it/s]
 97%|█████████▋| 87/90 [00:36<00:01,  2.16it/s]
[36m(SplitCoordinator pid=3205)[0m ✔️  Dataset train_3_5 execution finished in 0.57 seconds


[36m(RayTrainWorker pid=3139)[0m {'loss': 0.5069, 'grad_norm': 2.5853540897369385, 'learning_rate': 2.2222222222222224e-07, 'epoch': 5.11}


[36m(RayTrainWorker pid=3139)[0m  99%|█████████▉| 89/90 [00:36<00:00,  2.90it/s]
[36m(RayTrainWorker pid=3139)[0m                                                100%|██████████| 90/90 [00:36<00:00,  2.90it/s]


[2m[36m(pid=3255) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=3255) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(RayTrainWorker pid=3139)[0m                                                100%|██████████| 90/90 [00:36<00:00,  2.90it/s]


[36m(RayTrainWorker pid=3139)[0m {'eval_loss': 0.8454922437667847, 'eval_accuracy': 0.6785714285714286, 'eval_f1': 0.47391812865497074, 'eval_runtime': 0.1898, 'eval_samples_per_second': 295.018, 'eval_steps_per_second': 5.268, 'epoch': 5.11}
[36m(RayTrainWorker pid=3139)[0m {'train_runtime': 43.1238, 'train_samples_per_second': 267.138, 'train_steps_per_second': 2.087, 'train_loss': 0.7315052032470704, 'epoch': 5.11}


[36m(RayTrainWorker pid=3139)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/root/ray_results/tune_transformers/TorchTrainer_853b7_00000_0_batch_size=128,epochs=6,learning_rate=0.0000,weight_decay=0.0000_2025-07-14_05-29-32/checkpoint_000005)
[36m(SplitCoordinator pid=3255)[0m Registered dataset logger for dataset eval_4_5
[36m(SplitCoordinator pid=3255)[0m Starting execution of Dataset eval_4_5. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=3255)[0m Execution plan of Dataset eval_4_5: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
[36m(SplitCoordinator pid=3255)[0m ✔️  Dataset eval_4_5 execution finished in 0.11 seconds
[36m(RayTrainWorker pid=3139)[0m                                                100%|██████████| 90/90 [00:41<00:00,  2.90it/s]100%|██████████| 90/90 [00:41<00:00,  2.16it/s]


[2m[36m(pid=3255) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=3255) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(SplitCoordinator pid=3255)[0m Registered dataset logger for dataset eval_4_6
[36m(SplitCoordinator pid=3255)[0m Starting execution of Dataset eval_4_6. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=3255)[0m Execution plan of Dataset eval_4_6: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
[36m(SplitCoordinator pid=3255)[0m ✔️  Dataset eval_4_6 execution finished in 0.11 seconds


[36m(RayTrainWorker pid=3139)[0m Final evaluation: {'eval_loss': 0.8454922437667847, 'eval_accuracy': 0.6785714285714286, 'eval_f1': 0.47391812865497074, 'eval_runtime': 0.2054, 'eval_samples_per_second': 272.62, 'eval_steps_per_second': 4.868, 'epoch': 5.111111111111111}


[36m(RayTrainWorker pid=3139)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/root/ray_results/tune_transformers/TorchTrainer_853b7_00000_0_batch_size=128,epochs=6,learning_rate=0.0000,weight_decay=0.0000_2025-07-14_05-29-32/checkpoint_000006)
You may want to consider increasing the `CheckpointConfig(num_to_keep)` or decreasing the frequency of saving checkpoints.



Trial TorchTrainer_853b7_00000 completed after 7 iterations at 2025-07-14 05:30:43. Total running time: 1min 11s
+-------------------------------------------------------------+
| Trial TorchTrainer_853b7_00000 result                       |
+-------------------------------------------------------------+
| checkpoint_dir_name                       checkpoint_000006 |
| time_this_iter_s                                    1.56567 |
| time_total_s                                       57.55527 |
| training_iteration                                        7 |
| eval_accuracy                                       0.67857 |
| eval_loss                                           0.84549 |
+-------------------------------------------------------------+


[36m(RayTrainWorker pid=3139)[0m Exception ignored in atexit callback: <function _start_and_connect_service.<locals>.teardown_atexit at 0x7ab1f972ba60>
[36m(RayTrainWorker pid=3139)[0m Traceback (most recent call last):
[36m(RayTrainWorker pid=3139)[0m   File "/usr/local/lib/python3.11/dist-packages/wandb/sdk/lib/service/service_connection.py", line 54, in teardown_atexit
[36m(RayTrainWorker pid=3139)[0m     conn.teardown(hooks.exit_code)
[36m(RayTrainWorker pid=3139)[0m   File "/usr/local/lib/python3.11/dist-packages/wandb/sdk/lib/service/service_connection.py", line 190, in teardown
[36m(RayTrainWorker pid=3139)[0m     self._client.send_server_request(
[36m(RayTrainWorker pid=3139)[0m   File "/usr/local/lib/python3.11/dist-packages/wandb/sdk/lib/sock_client.py", line 150, in send_server_request
[36m(RayTrainWorker pid=3139)[0m     self._send_message(msg)
[36m(RayTrainWorker pid=3139)[0m   File "/usr/local/lib/python3.11/dist-packages/wandb/sdk/lib/sock_client.py", l


Trial TorchTrainer_853b7_00001 started with configuration:
+--------------------------------------------------+
| Trial TorchTrainer_853b7_00001 config            |
+--------------------------------------------------+
| train_loop_config/batch_size                  16 |
| train_loop_config/epochs                       4 |
| train_loop_config/learning_rate           0.0002 |
| train_loop_config/weight_decay                 0 |
+--------------------------------------------------+


[36m(TorchTrainer pid=3817)[0m Started distributed worker processes: 
[36m(TorchTrainer pid=3817)[0m - (node_id=dfac881cdf0b95ec3c65ef7c281e0d5ea5b341dd9a6d97ecd807b1a3, ip=172.28.0.12, pid=3920) world_rank=0, local_rank=0, node_rank=0
[36m(RayTrainWorker pid=3920)[0m Setting up process group for: env:// [rank=0, world_size=1]
[36m(RayTrainWorker pid=3920)[0m 2025-07-14 05:30:59.252507: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[36m(RayTrainWorker pid=3920)[0m 2025-07-14 05:30:59.269143: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
[36m(RayTrainWorker pid=3920)[0m E0000 00:00:1752471059.290343    3920 cuda_dnn.cc:8310] 

[36m(RayTrainWorker pid=3920)[0m CUDA available: True

Trial status: 1 TERMINATED | 1 RUNNING | 14 PENDING
Current time: 2025-07-14 05:31:04. Total running time: 1min 32s
Logical resource usage: 0/1 CPUs, 1.0/1 GPUs (0.0/1.0 accelerator_type:A100)
Current best trial: 853b7_00000 with eval_loss=0.8454922437667847 and params={'train_loop_config': {'learning_rate': 2e-05, 'epochs': 6, 'batch_size': 128, 'weight_decay': 0.0}}
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                 status         ...oop_config/epochs     ...config/batch_size     ...fig/learning_rate     ...nfig/weight_decay     iter     total time (s) |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| TorchTrainer_853b7_00001   RUNNING                          

[36m(RayTrainWorker pid=3920)[0m Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
[36m(RayTrainWorker pid=3920)[0m You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[36m(RayTrainWorker pid=3920)[0m Beginning training...


[36m(RayTrainWorker pid=3920)[0m wandb: Currently logged in as: nicolepcx to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
[36m(RayTrainWorker pid=3920)[0m wandb: Tracking run with wandb version 0.21.0
[36m(RayTrainWorker pid=3920)[0m wandb: Run data is saved locally in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/artifacts/2025-07-14_05-29-32/tune_transformers/working_dirs/TorchTrainer_853b7_00001_1_batch_size=16,epochs=4,learning_rate=0.0002,weight_decay=0.0000_2025-07-14_05-29-32/wandb/run-20250714_053107-vesyelry
[36m(RayTrainWorker pid=3920)[0m wandb: Run `wandb offline` to turn off syncing.
[36m(RayTrainWorker pid=3920)[0m wandb: Syncing run superglue_cb
[36m(RayTrainWorker pid=3920)[0m wandb: ⭐️ View project at https://wandb.ai/nicolepcx/huggingface
[36m(RayTrainWorker pid=3920)[0m wandb: 🚀 View run at https://wandb.ai/nicolepcx/huggingface/runs/vesyelry
  0%|          | 0/60 [00:00<?, ?it/s]


[2m[36m(pid=3986) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=3986) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(SplitCoordinator pid=3986)[0m Registered dataset logger for dataset train_5_0
[36m(SplitCoordinator pid=3986)[0m Starting execution of Dataset train_5_0. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=3986)[0m Execution plan of Dataset train_5_0: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
  2%|▏         | 1/60 [00:00<00:32,  1.83it/s]
  5%|▌         | 3/60 [00:00<00:11,  5.15it/s]
  8%|▊         | 5/60 [00:00<00:06,  7.91it/s]
 12%|█▏        | 7/60 [00:00<00:05, 10.31it/s]
[36m(SplitCoordinator pid=3986)[0m ✔️  Dataset train_5_0 execution finished in 0.93 seconds
 15%|█▌        | 9/60 [00:01<00:04, 11.55it/s]
 18%|█▊        | 11/60 [00:01<00:04, 10.95it/s]


[36m(RayTrainWorker pid=3920)[0m {'loss': 0.9918, 'grad_norm': 2.8805739879608154, 'learning_rate': 0.00015000000000000001, 'epoch': 0.27}


[36m(RayTrainWorker pid=3920)[0m  22%|██▏       | 13/60 [00:01<00:03, 11.95it/s]
[36m(RayTrainWorker pid=3920)[0m  25%|██▌       | 15/60 [00:01<00:03, 13.28it/s]                                                27%|██▋       | 16/60 [00:01<00:03, 13.28it/s]


[2m[36m(pid=4036) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=4036) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(RayTrainWorker pid=3920)[0m                                                 27%|██▋       | 16/60 [00:01<00:03, 13.28it/s]


[36m(RayTrainWorker pid=3920)[0m {'eval_loss': 1.0110266208648682, 'eval_accuracy': 0.5, 'eval_f1': 0.2222222222222222, 'eval_runtime': 0.2131, 'eval_samples_per_second': 262.767, 'eval_steps_per_second': 18.769, 'epoch': 0.27}


You may want to consider increasing the `CheckpointConfig(num_to_keep)` or decreasing the frequency of saving checkpoints.



Trial TorchTrainer_853b7_00001 completed after 1 iterations at 2025-07-14 05:31:14. Total running time: 1min 41s
+-------------------------------------------------------------+
| Trial TorchTrainer_853b7_00001 result                       |
+-------------------------------------------------------------+
| checkpoint_dir_name                       checkpoint_000000 |
| time_this_iter_s                                   20.13487 |
| time_total_s                                       20.13487 |
| training_iteration                                        1 |
| epoch                                               0.26667 |
| eval_accuracy                                           0.5 |
| eval_f1                                             0.22222 |
| eval_loss                                           1.01103 |
| eval_runtime                                         0.2131 |
| eval_samples_per_second                             262.767 |
| eval_steps_per_second                               

[36m(RayTrainWorker pid=3920)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/root/ray_results/tune_transformers/TorchTrainer_853b7_00001_1_batch_size=16,epochs=4,learning_rate=0.0002,weight_decay=0.0000_2025-07-14_05-29-32/checkpoint_000000)
[36m(SplitCoordinator pid=4036)[0m Registered dataset logger for dataset eval_6_0
[36m(SplitCoordinator pid=4036)[0m Starting execution of Dataset eval_6_0. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=4036)[0m Execution plan of Dataset eval_6_0: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
[36m(SplitCoordinator pid=4036)[0m ✔️  Dataset eval_6_0 execution finished in 0.12 seconds
[36m(TrainTrainable pid=4259)[0m 2025-07-14 05:31:20.233033: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To t


Trial TorchTrainer_853b7_00002 started with configuration:
+-------------------------------------------------+
| Trial TorchTrainer_853b7_00002 config           |
+-------------------------------------------------+
| train_loop_config/batch_size                 16 |
| train_loop_config/epochs                      4 |
| train_loop_config/learning_rate           0.002 |
| train_loop_config/weight_decay                0 |
+-------------------------------------------------+


[36m(RayTrainWorker pid=4361)[0m Setting up process group for: env:// [rank=0, world_size=1]
[36m(TorchTrainer pid=4259)[0m Started distributed worker processes: 
[36m(TorchTrainer pid=4259)[0m - (node_id=dfac881cdf0b95ec3c65ef7c281e0d5ea5b341dd9a6d97ecd807b1a3, ip=172.28.0.12, pid=4361) world_rank=0, local_rank=0, node_rank=0
[36m(RayTrainWorker pid=4361)[0m 2025-07-14 05:31:30.176592: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[36m(RayTrainWorker pid=4361)[0m 2025-07-14 05:31:30.193180: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
[36m(RayTrainWorker pid=4361)[0m E0000 00:00:1752471090.214561    4361 cuda_dnn.cc:8310] 


Trial status: 2 TERMINATED | 1 RUNNING | 13 PENDING
Current time: 2025-07-14 05:31:34. Total running time: 2min 2s
Logical resource usage: 0/1 CPUs, 1.0/1 GPUs (0.0/1.0 accelerator_type:A100)
Current best trial: 853b7_00000 with eval_loss=0.8454922437667847 and params={'train_loop_config': {'learning_rate': 2e-05, 'epochs': 6, 'batch_size': 128, 'weight_decay': 0.0}}
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                 status         ...oop_config/epochs     ...config/batch_size     ...fig/learning_rate     ...nfig/weight_decay     iter     total time (s)     loss     grad_norm     learning_rate      epoch |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

[36m(RayTrainWorker pid=4361)[0m Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
[36m(RayTrainWorker pid=4361)[0m You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[36m(RayTrainWorker pid=4361)[0m Beginning training...


[36m(RayTrainWorker pid=4361)[0m wandb: Currently logged in as: nicolepcx to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
[36m(RayTrainWorker pid=4361)[0m wandb: Tracking run with wandb version 0.21.0
[36m(RayTrainWorker pid=4361)[0m wandb: Run data is saved locally in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/artifacts/2025-07-14_05-29-32/tune_transformers/working_dirs/TorchTrainer_853b7_00002_2_batch_size=16,epochs=4,learning_rate=0.0020,weight_decay=0.0000_2025-07-14_05-29-32/wandb/run-20250714_053137-yyp4u9j0
[36m(RayTrainWorker pid=4361)[0m wandb: Run `wandb offline` to turn off syncing.
[36m(RayTrainWorker pid=4361)[0m wandb: Syncing run superglue_cb
[36m(RayTrainWorker pid=4361)[0m wandb: ⭐️ View project at https://wandb.ai/nicolepcx/huggingface
[36m(RayTrainWorker pid=4361)[0m wandb: 🚀 View run at https://wandb.ai/nicolepcx/huggingface/runs/yyp4u9j0
  0%|          | 0/60 [00:00<?, ?it/s]
[36m(SplitCoordinator pid=4427)[0m Registered 

[2m[36m(pid=4427) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=4427) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

  2%|▏         | 1/60 [00:00<00:32,  1.79it/s]
  5%|▌         | 3/60 [00:00<00:11,  5.10it/s]
  8%|▊         | 5/60 [00:00<00:06,  7.95it/s]
 12%|█▏        | 7/60 [00:00<00:05, 10.38it/s]
[36m(SplitCoordinator pid=4427)[0m ✔️  Dataset train_7_0 execution finished in 0.94 seconds
 15%|█▌        | 9/60 [00:01<00:04, 11.56it/s]
 18%|█▊        | 11/60 [00:01<00:04, 10.63it/s]
 22%|██▏       | 13/60 [00:01<00:04, 11.69it/s]


[36m(RayTrainWorker pid=4361)[0m {'loss': 1.5553, 'grad_norm': 14.364967346191406, 'learning_rate': 0.0015, 'epoch': 0.27}


[36m(RayTrainWorker pid=4361)[0m  25%|██▌       | 15/60 [00:01<00:03, 13.07it/s]
[36m(RayTrainWorker pid=4361)[0m                                                 27%|██▋       | 16/60 [00:01<00:03, 13.07it/s]


[2m[36m(pid=4477) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=4477) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(RayTrainWorker pid=4361)[0m {'eval_loss': 1.4201806783676147, 'eval_accuracy': 0.5, 'eval_f1': 0.2222222222222222, 'eval_runtime': 0.2071, 'eval_samples_per_second': 270.416, 'eval_steps_per_second': 19.315, 'epoch': 0.27}


[36m(RayTrainWorker pid=4361)[0m                                                 27%|██▋       | 16/60 [00:01<00:03, 13.07it/s]
[36m(RayTrainWorker pid=4361)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/root/ray_results/tune_transformers/TorchTrainer_853b7_00002_2_batch_size=16,epochs=4,learning_rate=0.0020,weight_decay=0.0000_2025-07-14_05-29-32/checkpoint_000000)
[36m(SplitCoordinator pid=4477)[0m Registered dataset logger for dataset eval_8_0
[36m(SplitCoordinator pid=4477)[0m Starting execution of Dataset eval_8_0. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=4477)[0m Execution plan of Dataset eval_8_0: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]



Trial TorchTrainer_853b7_00002 completed after 1 iterations at 2025-07-14 05:31:44. Total running time: 2min 12s
+-------------------------------------------------------------+
| Trial TorchTrainer_853b7_00002 result                       |
+-------------------------------------------------------------+
| checkpoint_dir_name                       checkpoint_000000 |
| time_this_iter_s                                    19.5109 |
| time_total_s                                        19.5109 |
| training_iteration                                        1 |
| epoch                                               0.26667 |
| eval_accuracy                                           0.5 |
| eval_f1                                             0.22222 |
| eval_loss                                           1.42018 |
| eval_runtime                                         0.2071 |
| eval_samples_per_second                             270.416 |
| eval_steps_per_second                               

You may want to consider increasing the `CheckpointConfig(num_to_keep)` or decreasing the frequency of saving checkpoints.
[36m(TrainTrainable pid=4700)[0m 2025-07-14 05:31:51.261017: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[36m(TrainTrainable pid=4700)[0m 2025-07-14 05:31:51.277258: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
[36m(TrainTrainable pid=4700)[0m E0000 00:00:1752471111.298123    4700 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
[36m(TrainTrainable pid=4700)[0m E0000 00:00:1752471111.304456    4700 cuda_blas.cc:1418] U


Trial TorchTrainer_853b7_00003 started with configuration:
+------------------------------------------------+
| Trial TorchTrainer_853b7_00003 config          |
+------------------------------------------------+
| train_loop_config/batch_size                16 |
| train_loop_config/epochs                     6 |
| train_loop_config/learning_rate           0.02 |
| train_loop_config/weight_decay               0 |
+------------------------------------------------+


[36m(TorchTrainer pid=4700)[0m Started distributed worker processes: 
[36m(TorchTrainer pid=4700)[0m - (node_id=dfac881cdf0b95ec3c65ef7c281e0d5ea5b341dd9a6d97ecd807b1a3, ip=172.28.0.12, pid=4802) world_rank=0, local_rank=0, node_rank=0
[36m(RayTrainWorker pid=4802)[0m Setting up process group for: env:// [rank=0, world_size=1]
[36m(RayTrainWorker pid=4802)[0m 2025-07-14 05:32:01.147977: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[36m(RayTrainWorker pid=4802)[0m 2025-07-14 05:32:01.164406: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
[36m(RayTrainWorker pid=4802)[0m E0000 00:00:1752471121.185627    4802 cuda_dnn.cc:8310] 


Trial status: 3 TERMINATED | 1 RUNNING | 12 PENDING
Current time: 2025-07-14 05:32:04. Total running time: 2min 32s
Logical resource usage: 0/1 CPUs, 1.0/1 GPUs (0.0/1.0 accelerator_type:A100)
Current best trial: 853b7_00000 with eval_loss=0.8454922437667847 and params={'train_loop_config': {'learning_rate': 2e-05, 'epochs': 6, 'batch_size': 128, 'weight_decay': 0.0}}
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                 status         ...oop_config/epochs     ...config/batch_size     ...fig/learning_rate     ...nfig/weight_decay     iter     total time (s)     loss     grad_norm     learning_rate      epoch |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

[36m(RayTrainWorker pid=4802)[0m Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
[36m(RayTrainWorker pid=4802)[0m You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[36m(RayTrainWorker pid=4802)[0m Beginning training...


[36m(RayTrainWorker pid=4802)[0m wandb: Currently logged in as: nicolepcx to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
[36m(RayTrainWorker pid=4802)[0m wandb: Tracking run with wandb version 0.21.0
[36m(RayTrainWorker pid=4802)[0m wandb: Run data is saved locally in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/artifacts/2025-07-14_05-29-32/tune_transformers/working_dirs/TorchTrainer_853b7_00003_3_batch_size=16,epochs=6,learning_rate=0.0200,weight_decay=0.0000_2025-07-14_05-29-32/wandb/run-20250714_053208-kcbusp3a
[36m(RayTrainWorker pid=4802)[0m wandb: Run `wandb offline` to turn off syncing.
[36m(RayTrainWorker pid=4802)[0m wandb: Syncing run superglue_cb
[36m(RayTrainWorker pid=4802)[0m wandb: ⭐️ View project at https://wandb.ai/nicolepcx/huggingface
[36m(RayTrainWorker pid=4802)[0m wandb: 🚀 View run at https://wandb.ai/nicolepcx/huggingface/runs/kcbusp3a
  0%|          | 0/90 [00:00<?, ?it/s]


[2m[36m(pid=4868) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=4868) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(SplitCoordinator pid=4868)[0m Registered dataset logger for dataset train_9_0
[36m(SplitCoordinator pid=4868)[0m Starting execution of Dataset train_9_0. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=4868)[0m Execution plan of Dataset train_9_0: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
  1%|          | 1/90 [00:00<00:51,  1.74it/s]
  3%|▎         | 3/90 [00:00<00:17,  4.97it/s]
  6%|▌         | 5/90 [00:00<00:10,  7.80it/s]
  8%|▊         | 7/90 [00:00<00:08, 10.21it/s]
[36m(SplitCoordinator pid=4868)[0m ✔️  Dataset train_9_0 execution finished in 0.96 seconds
 10%|█         | 9/90 [00:01<00:07, 11.51it/s]
 12%|█▏        | 11/90 [00:01<00:07, 11.27it/s]
 14%|█▍        | 13/90 [00:01<00:06, 12.20it/s]


[36m(RayTrainWorker pid=4802)[0m {'loss': 21.2988, 'grad_norm': 22.1986083984375, 'learning_rate': 0.016666666666666666, 'epoch': 0.18}


[36m(RayTrainWorker pid=4802)[0m  17%|█▋        | 15/90 [00:01<00:05, 13.48it/s]
[36m(RayTrainWorker pid=4802)[0m                                                 18%|█▊        | 16/90 [00:01<00:05, 13.48it/s]


[2m[36m(pid=4918) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=4918) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(RayTrainWorker pid=4802)[0m {'eval_loss': 3.695937156677246, 'eval_accuracy': 0.5, 'eval_f1': 0.2222222222222222, 'eval_runtime': 0.2126, 'eval_samples_per_second': 263.408, 'eval_steps_per_second': 18.815, 'epoch': 0.18}


[36m(RayTrainWorker pid=4802)[0m                                                 18%|█▊        | 16/90 [00:01<00:05, 13.48it/s]



Trial TorchTrainer_853b7_00003 completed after 1 iterations at 2025-07-14 05:32:15. Total running time: 2min 43s
+-------------------------------------------------------------+
| Trial TorchTrainer_853b7_00003 result                       |
+-------------------------------------------------------------+
| checkpoint_dir_name                       checkpoint_000000 |
| time_this_iter_s                                   19.39422 |
| time_total_s                                       19.39422 |
| training_iteration                                        1 |
| epoch                                               0.17778 |
| eval_accuracy                                           0.5 |
| eval_f1                                             0.22222 |
| eval_loss                                           3.69594 |
| eval_runtime                                         0.2126 |
| eval_samples_per_second                             263.408 |
| eval_steps_per_second                               

[36m(RayTrainWorker pid=4802)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/root/ray_results/tune_transformers/TorchTrainer_853b7_00003_3_batch_size=16,epochs=6,learning_rate=0.0200,weight_decay=0.0000_2025-07-14_05-29-32/checkpoint_000000)
[36m(SplitCoordinator pid=4918)[0m Registered dataset logger for dataset eval_10_0
[36m(SplitCoordinator pid=4918)[0m Starting execution of Dataset eval_10_0. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=4918)[0m Execution plan of Dataset eval_10_0: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
You may want to consider increasing the `CheckpointConfig(num_to_keep)` or decreasing the frequency of saving checkpoints.
[36m(TrainTrainable pid=5145)[0m 2025-07-14 05:32:23.145958: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from d


Trial TorchTrainer_853b7_00004 started with configuration:
+-------------------------------------------------+
| Trial TorchTrainer_853b7_00004 config           |
+-------------------------------------------------+
| train_loop_config/batch_size                128 |
| train_loop_config/epochs                      2 |
| train_loop_config/learning_rate           2e-05 |
| train_loop_config/weight_decay             0.01 |
+-------------------------------------------------+


[36m(RayTrainWorker pid=5246)[0m Setting up process group for: env:// [rank=0, world_size=1]
[36m(TorchTrainer pid=5145)[0m Started distributed worker processes: 
[36m(TorchTrainer pid=5145)[0m - (node_id=dfac881cdf0b95ec3c65ef7c281e0d5ea5b341dd9a6d97ecd807b1a3, ip=172.28.0.12, pid=5246) world_rank=0, local_rank=0, node_rank=0
[36m(RayTrainWorker pid=5246)[0m 2025-07-14 05:32:33.132613: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[36m(RayTrainWorker pid=5246)[0m 2025-07-14 05:32:33.148839: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
[36m(RayTrainWorker pid=5246)[0m E0000 00:00:1752471153.169459    5246 cuda_dnn.cc:8310] 


Trial status: 4 TERMINATED | 1 RUNNING | 11 PENDING
Current time: 2025-07-14 05:32:35. Total running time: 3min 2s
Logical resource usage: 0/1 CPUs, 1.0/1 GPUs (0.0/1.0 accelerator_type:A100)
Current best trial: 853b7_00000 with eval_loss=0.8454922437667847 and params={'train_loop_config': {'learning_rate': 2e-05, 'epochs': 6, 'batch_size': 128, 'weight_decay': 0.0}}
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                 status         ...oop_config/epochs     ...config/batch_size     ...fig/learning_rate     ...nfig/weight_decay     iter     total time (s)      loss     grad_norm     learning_rate      epoch |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

[36m(RayTrainWorker pid=5246)[0m Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
[36m(RayTrainWorker pid=5246)[0m You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[36m(RayTrainWorker pid=5246)[0m Beginning training...


[36m(RayTrainWorker pid=5246)[0m wandb: Currently logged in as: nicolepcx to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
[36m(RayTrainWorker pid=5246)[0m wandb: Tracking run with wandb version 0.21.0
[36m(RayTrainWorker pid=5246)[0m wandb: Run data is saved locally in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/artifacts/2025-07-14_05-29-32/tune_transformers/working_dirs/TorchTrainer_853b7_00004_4_batch_size=128,epochs=2,learning_rate=0.0000,weight_decay=0.0100_2025-07-14_05-29-32/wandb/run-20250714_053240-k9o0rnlm
[36m(RayTrainWorker pid=5246)[0m wandb: Run `wandb offline` to turn off syncing.
[36m(RayTrainWorker pid=5246)[0m wandb: Syncing run superglue_cb
[36m(RayTrainWorker pid=5246)[0m wandb: ⭐️ View project at https://wandb.ai/nicolepcx/huggingface
[36m(RayTrainWorker pid=5246)[0m wandb: 🚀 View run at https://wandb.ai/nicolepcx/huggingface/runs/k9o0rnlm
  0%|          | 0/30 [00:00<?, ?it/s]
[36m(SplitCoordinator pid=5312)[0m Registered

[2m[36m(pid=5312) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=5312) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

  3%|▎         | 1/30 [00:00<00:16,  1.79it/s]
 10%|█         | 3/30 [00:00<00:05,  5.08it/s]
 17%|█▋        | 5/30 [00:00<00:03,  8.00it/s]
 23%|██▎       | 7/30 [00:00<00:02, 10.42it/s]
[36m(SplitCoordinator pid=5312)[0m ✔️  Dataset train_11_0 execution finished in 0.93 seconds
 30%|███       | 9/30 [00:01<00:01, 11.69it/s]
 37%|███▋      | 11/30 [00:01<00:01, 11.39it/s]
 43%|████▎     | 13/30 [00:01<00:01, 12.28it/s]


[36m(RayTrainWorker pid=5246)[0m {'loss': 1.0686, 'grad_norm': 2.408996343612671, 'learning_rate': 1e-05, 'epoch': 0.53}


[36m(RayTrainWorker pid=5246)[0m  50%|█████     | 15/30 [00:01<00:01, 13.55it/s]                                                53%|█████▎    | 16/30 [00:01<00:01, 13.55it/s]


[2m[36m(pid=5362) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=5362) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(RayTrainWorker pid=5246)[0m                                                 53%|█████▎    | 16/30 [00:01<00:01, 13.55it/s]


[36m(RayTrainWorker pid=5246)[0m {'eval_loss': 0.9569039344787598, 'eval_accuracy': 0.42857142857142855, 'eval_f1': 0.21956970232832304, 'eval_runtime': 0.2087, 'eval_samples_per_second': 268.341, 'eval_steps_per_second': 4.792, 'epoch': 0.53}


[36m(RayTrainWorker pid=5246)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/root/ray_results/tune_transformers/TorchTrainer_853b7_00004_4_batch_size=128,epochs=2,learning_rate=0.0000,weight_decay=0.0100_2025-07-14_05-29-32/checkpoint_000000)
[36m(SplitCoordinator pid=5362)[0m Registered dataset logger for dataset eval_12_0
[36m(SplitCoordinator pid=5362)[0m Starting execution of Dataset eval_12_0. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=5362)[0m Execution plan of Dataset eval_12_0: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
[36m(SplitCoordinator pid=5362)[0m ✔️  Dataset eval_12_0 execution finished in 0.11 seconds
You may want to consider increasing the `CheckpointConfig(num_to_keep)` or decreasing the frequency of saving checkpoints.
[36m(SplitCoordinator pid=5312)[0m Registered dataset logger for dataset train_11_1
[36m(SplitCoordinator pid=5312)[0m Starting

[2m[36m(pid=5312) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=5312) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(RayTrainWorker pid=5246)[0m  57%|█████▋    | 17/30 [00:06<00:10,  1.24it/s]
 63%|██████▎   | 19/30 [00:06<00:06,  1.73it/s]
 70%|███████   | 21/30 [00:06<00:03,  2.39it/s]
 77%|███████▋  | 23/30 [00:06<00:02,  3.25it/s]
[36m(SplitCoordinator pid=5312)[0m ✔️  Dataset train_11_1 execution finished in 0.56 seconds
 83%|████████▎ | 25/30 [00:06<00:01,  4.25it/s]
 90%|█████████ | 27/30 [00:06<00:00,  5.33it/s]
 97%|█████████▋| 29/30 [00:07<00:00,  6.57it/s]


[36m(RayTrainWorker pid=5246)[0m {'loss': 0.9009, 'grad_norm': 3.373149871826172, 'learning_rate': 6.666666666666667e-07, 'epoch': 1.47}


[36m(RayTrainWorker pid=5246)[0m                                                100%|██████████| 30/30 [00:07<00:00,  6.57it/s]


[2m[36m(pid=5362) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=5362) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(RayTrainWorker pid=5246)[0m                                                100%|██████████| 30/30 [00:07<00:00,  6.57it/s]


[36m(RayTrainWorker pid=5246)[0m {'eval_loss': 0.9111846685409546, 'eval_accuracy': 0.5892857142857143, 'eval_f1': 0.39552238805970147, 'eval_runtime': 0.185, 'eval_samples_per_second': 302.643, 'eval_steps_per_second': 5.404, 'epoch': 1.47}
[36m(RayTrainWorker pid=5246)[0m {'train_runtime': 14.2063, 'train_samples_per_second': 270.302, 'train_steps_per_second': 2.112, 'train_loss': 0.9903110186258952, 'epoch': 1.47}


[36m(RayTrainWorker pid=5246)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/root/ray_results/tune_transformers/TorchTrainer_853b7_00004_4_batch_size=128,epochs=2,learning_rate=0.0000,weight_decay=0.0100_2025-07-14_05-29-32/checkpoint_000001)
[36m(SplitCoordinator pid=5362)[0m Registered dataset logger for dataset eval_12_1
[36m(SplitCoordinator pid=5362)[0m Starting execution of Dataset eval_12_1. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=5362)[0m Execution plan of Dataset eval_12_1: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
[36m(SplitCoordinator pid=5362)[0m ✔️  Dataset eval_12_1 execution finished in 0.11 seconds
[36m(RayTrainWorker pid=5246)[0m                                                100%|██████████| 30/30 [00:13<00:00,  6.57it/s]100%|██████████| 30/30 [00:13<00:00,  2.27it/s]


[2m[36m(pid=5362) [0mRunning 0: 0.00 row [00:00, ? row/s]




[2m[36m(pid=5362) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(SplitCoordinator pid=5362)[0m Registered dataset logger for dataset eval_12_2
[36m(SplitCoordinator pid=5362)[0m Starting execution of Dataset eval_12_2. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=5362)[0m Execution plan of Dataset eval_12_2: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]


Trial TorchTrainer_853b7_00004 completed after 2 iterations at 2025-07-14 05:32:54. Total running time: 3min 22s
+-------------------------------------------------------------+
| Trial TorchTrainer_853b7_00004 result                       |
+-------------------------------------------------------------+
| checkpoint_dir_name                       checkpoint_000001 |
| time_this_iter_s                                    7.09243 |
| time_total_s                                       26.99847 |
| training_iteration                                        2 |
| epoch                                               1.46667 |
| eval_accuracy                                       0.58929 |
| eval_f1                                             0.39552 |
| eval_loss                                           0.91118 |
| eval_runtime                                          0.185 |
| eval_samples_per_second                             302.643 |
| eval_steps_per_second                                

[36m(TrainTrainable pid=5635)[0m 2025-07-14 05:33:01.344625: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[36m(TrainTrainable pid=5635)[0m 2025-07-14 05:33:01.361257: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
[36m(TrainTrainable pid=5635)[0m E0000 00:00:1752471181.383211    5635 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
[36m(TrainTrainable pid=5635)[0m E0000 00:00:1752471181.389691    5635 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
[3


Trial status: 5 TERMINATED | 11 PENDING
Current time: 2025-07-14 05:33:05. Total running time: 3min 33s
Logical resource usage: 0/1 CPUs, 1.0/1 GPUs (0.0/1.0 accelerator_type:A100)
Current best trial: 853b7_00000 with eval_loss=0.8454922437667847 and params={'train_loop_config': {'learning_rate': 2e-05, 'epochs': 6, 'batch_size': 128, 'weight_decay': 0.0}}
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                 status         ...oop_config/epochs     ...config/batch_size     ...fig/learning_rate     ...nfig/weight_decay     iter     total time (s)      loss     grad_norm     learning_rate      epoch |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

[36m(RayTrainWorker pid=5735)[0m Setting up process group for: env:// [rank=0, world_size=1]
[36m(TorchTrainer pid=5635)[0m Started distributed worker processes: 
[36m(TorchTrainer pid=5635)[0m - (node_id=dfac881cdf0b95ec3c65ef7c281e0d5ea5b341dd9a6d97ecd807b1a3, ip=172.28.0.12, pid=5735) world_rank=0, local_rank=0, node_rank=0
[36m(RayTrainWorker pid=5735)[0m 2025-07-14 05:33:11.343223: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[36m(RayTrainWorker pid=5735)[0m 2025-07-14 05:33:11.359660: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
[36m(RayTrainWorker pid=5735)[0m E0000 00:00:1752471191.380485    5735 cuda_dnn.cc:8310] 

[36m(RayTrainWorker pid=5735)[0m CUDA available: True


[36m(RayTrainWorker pid=5735)[0m Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
[36m(RayTrainWorker pid=5735)[0m You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[36m(RayTrainWorker pid=5735)[0m Beginning training...


[36m(RayTrainWorker pid=5735)[0m wandb: Currently logged in as: nicolepcx to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
[36m(RayTrainWorker pid=5735)[0m wandb: Tracking run with wandb version 0.21.0
[36m(RayTrainWorker pid=5735)[0m wandb: Run data is saved locally in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/artifacts/2025-07-14_05-29-32/tune_transformers/working_dirs/TorchTrainer_853b7_00005_5_batch_size=32,epochs=4,learning_rate=0.0002,weight_decay=0.0100_2025-07-14_05-29-32/wandb/run-20250714_053318-7858brom
[36m(RayTrainWorker pid=5735)[0m wandb: Run `wandb offline` to turn off syncing.
[36m(RayTrainWorker pid=5735)[0m wandb: Syncing run superglue_cb
[36m(RayTrainWorker pid=5735)[0m wandb: ⭐️ View project at https://wandb.ai/nicolepcx/huggingface
[36m(RayTrainWorker pid=5735)[0m wandb: 🚀 View run at https://wandb.ai/nicolepcx/huggingface/runs/7858brom
  0%|          | 0/60 [00:00<?, ?it/s]


[2m[36m(pid=5801) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=5801) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(SplitCoordinator pid=5801)[0m Registered dataset logger for dataset train_13_0
[36m(SplitCoordinator pid=5801)[0m Starting execution of Dataset train_13_0. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=5801)[0m Execution plan of Dataset train_13_0: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
  2%|▏         | 1/60 [00:00<00:33,  1.76it/s]
  5%|▌         | 3/60 [00:00<00:11,  4.94it/s]
  8%|▊         | 5/60 [00:00<00:07,  7.57it/s]
 12%|█▏        | 7/60 [00:00<00:05, 10.00it/s]
[36m(SplitCoordinator pid=5801)[0m ✔️  Dataset train_13_0 execution finished in 0.97 seconds
 15%|█▌        | 9/60 [00:01<00:04, 11.31it/s]
 18%|█▊        | 11/60 [00:01<00:04, 11.09it/s]
 22%|██▏       | 13/60 [00:01<00:03, 12.03it/s]


[36m(RayTrainWorker pid=5735)[0m {'loss': 0.9997, 'grad_norm': 5.086143970489502, 'learning_rate': 0.00015000000000000001, 'epoch': 0.27}


[36m(RayTrainWorker pid=5735)[0m  25%|██▌       | 15/60 [00:01<00:03, 13.35it/s]
[36m(RayTrainWorker pid=5735)[0m                                                 27%|██▋       | 16/60 [00:01<00:03, 13.35it/s]


[2m[36m(pid=5851) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=5851) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(RayTrainWorker pid=5735)[0m {'eval_loss': 1.0697396993637085, 'eval_accuracy': 0.5, 'eval_f1': 0.2222222222222222, 'eval_runtime': 0.2119, 'eval_samples_per_second': 264.221, 'eval_steps_per_second': 9.436, 'epoch': 0.27}


[36m(RayTrainWorker pid=5735)[0m                                                 27%|██▋       | 16/60 [00:01<00:03, 13.35it/s]



Trial TorchTrainer_853b7_00005 completed after 1 iterations at 2025-07-14 05:33:25. Total running time: 3min 53s
+-------------------------------------------------------------+
| Trial TorchTrainer_853b7_00005 result                       |
+-------------------------------------------------------------+
| checkpoint_dir_name                       checkpoint_000000 |
| time_this_iter_s                                   19.22991 |
| time_total_s                                       19.22991 |
| training_iteration                                        1 |
| epoch                                               0.26667 |
| eval_accuracy                                           0.5 |
| eval_f1                                             0.22222 |
| eval_loss                                           1.06974 |
| eval_runtime                                         0.2119 |
| eval_samples_per_second                             264.221 |
| eval_steps_per_second                               

[36m(RayTrainWorker pid=5735)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/root/ray_results/tune_transformers/TorchTrainer_853b7_00005_5_batch_size=32,epochs=4,learning_rate=0.0002,weight_decay=0.0100_2025-07-14_05-29-32/checkpoint_000000)
[36m(SplitCoordinator pid=5851)[0m Registered dataset logger for dataset eval_14_0
[36m(SplitCoordinator pid=5851)[0m Starting execution of Dataset eval_14_0. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=5851)[0m Execution plan of Dataset eval_14_0: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
You may want to consider increasing the `CheckpointConfig(num_to_keep)` or decreasing the frequency of saving checkpoints.
[36m(TrainTrainable pid=6080)[0m 2025-07-14 05:33:33.290642: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from d


Trial status: 6 TERMINATED | 10 PENDING
Current time: 2025-07-14 05:33:35. Total running time: 4min 3s
Logical resource usage: 0/1 CPUs, 1.0/1 GPUs (0.0/1.0 accelerator_type:A100)
Current best trial: 853b7_00000 with eval_loss=0.8454922437667847 and params={'train_loop_config': {'learning_rate': 2e-05, 'epochs': 6, 'batch_size': 128, 'weight_decay': 0.0}}
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                 status         ...oop_config/epochs     ...config/batch_size     ...fig/learning_rate     ...nfig/weight_decay     iter     total time (s)      loss     grad_norm     learning_rate      epoch |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

[36m(TorchTrainer pid=6080)[0m Started distributed worker processes: 
[36m(TorchTrainer pid=6080)[0m - (node_id=dfac881cdf0b95ec3c65ef7c281e0d5ea5b341dd9a6d97ecd807b1a3, ip=172.28.0.12, pid=6179) world_rank=0, local_rank=0, node_rank=0
[36m(RayTrainWorker pid=6179)[0m Setting up process group for: env:// [rank=0, world_size=1]
[36m(RayTrainWorker pid=6179)[0m 2025-07-14 05:33:43.149231: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[36m(RayTrainWorker pid=6179)[0m 2025-07-14 05:33:43.165644: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
[36m(RayTrainWorker pid=6179)[0m E0000 00:00:1752471223.186556    6179 cuda_dnn.cc:8310] 

[36m(RayTrainWorker pid=6179)[0m CUDA available: True


[36m(RayTrainWorker pid=6179)[0m Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
[36m(RayTrainWorker pid=6179)[0m You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[36m(RayTrainWorker pid=6179)[0m Beginning training...


[36m(RayTrainWorker pid=6179)[0m wandb: Currently logged in as: nicolepcx to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
[36m(RayTrainWorker pid=6179)[0m wandb: Tracking run with wandb version 0.21.0
[36m(RayTrainWorker pid=6179)[0m wandb: Run data is saved locally in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/artifacts/2025-07-14_05-29-32/tune_transformers/working_dirs/TorchTrainer_853b7_00006_6_batch_size=16,epochs=8,learning_rate=0.0020,weight_decay=0.0100_2025-07-14_05-29-32/wandb/run-20250714_053350-b7u2drcv
[36m(RayTrainWorker pid=6179)[0m wandb: Run `wandb offline` to turn off syncing.
[36m(RayTrainWorker pid=6179)[0m wandb: Syncing run superglue_cb
[36m(RayTrainWorker pid=6179)[0m wandb: ⭐️ View project at https://wandb.ai/nicolepcx/huggingface
[36m(RayTrainWorker pid=6179)[0m wandb: 🚀 View run at https://wandb.ai/nicolepcx/huggingface/runs/b7u2drcv
  0%|          | 0/120 [00:00<?, ?it/s]


[2m[36m(pid=6245) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=6245) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(SplitCoordinator pid=6245)[0m Registered dataset logger for dataset train_15_0
[36m(SplitCoordinator pid=6245)[0m Starting execution of Dataset train_15_0. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=6245)[0m Execution plan of Dataset train_15_0: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
  1%|          | 1/120 [00:00<01:05,  1.80it/s]
  2%|▎         | 3/120 [00:00<00:22,  5.16it/s]
  4%|▍         | 5/120 [00:00<00:14,  8.03it/s]
  6%|▌         | 7/120 [00:00<00:10, 10.33it/s]
[36m(SplitCoordinator pid=6245)[0m ✔️  Dataset train_15_0 execution finished in 0.93 seconds
  8%|▊         | 9/120 [00:01<00:09, 11.58it/s]
  9%|▉         | 11/120 [00:01<00:09, 10.97it/s]


[36m(RayTrainWorker pid=6179)[0m {'loss': 2.0052, 'grad_norm': 24.031822204589844, 'learning_rate': 0.00175, 'epoch': 0.13}


[36m(RayTrainWorker pid=6179)[0m  11%|█         | 13/120 [00:01<00:08, 11.93it/s]
[36m(RayTrainWorker pid=6179)[0m  12%|█▎        | 15/120 [00:01<00:07, 13.26it/s]                                                 13%|█▎        | 16/120 [00:01<00:07, 13.26it/s]


[2m[36m(pid=6295) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=6295) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(RayTrainWorker pid=6179)[0m                                                  13%|█▎        | 16/120 [00:01<00:07, 13.26it/s]


[36m(RayTrainWorker pid=6179)[0m {'eval_loss': 0.999843955039978, 'eval_accuracy': 0.4107142857142857, 'eval_f1': 0.1940928270042194, 'eval_runtime': 0.2189, 'eval_samples_per_second': 255.873, 'eval_steps_per_second': 18.277, 'epoch': 0.13}


[36m(RayTrainWorker pid=6179)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/root/ray_results/tune_transformers/TorchTrainer_853b7_00006_6_batch_size=16,epochs=8,learning_rate=0.0020,weight_decay=0.0100_2025-07-14_05-29-32/checkpoint_000000)
[36m(SplitCoordinator pid=6295)[0m Registered dataset logger for dataset eval_16_0
[36m(SplitCoordinator pid=6295)[0m Starting execution of Dataset eval_16_0. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=6295)[0m Execution plan of Dataset eval_16_0: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]


[2m[36m(pid=6245) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=6245) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(SplitCoordinator pid=6245)[0m Registered dataset logger for dataset train_15_1
[36m(SplitCoordinator pid=6245)[0m Starting execution of Dataset train_15_1. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=6245)[0m Execution plan of Dataset train_15_1: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
 14%|█▍        | 17/120 [00:05<01:14,  1.37it/s]
 16%|█▌        | 19/120 [00:05<00:52,  1.92it/s]
[36m(SplitCoordinator pid=6295)[0m ✔️  Dataset eval_16_0 execution finished in 0.12 seconds
 18%|█▊        | 21/120 [00:06<00:37,  2.63it/s]
 19%|█▉        | 23/120 [00:06<00:27,  3.54it/s]
[36m(SplitCoordinator pid=6245)[0m ✔️  Dataset train_15_1 execution finished in 0.55 seconds
 21%|██        | 25/120 [00:06<00:20,  4.58it/s]
 22%|██▎       | 27/120 [00:06<00:16,  5.70it/s]
 24%|██▍       | 29/120 [00:06<00:13,  6.98it/s]
 26%|██▌       | 31/120 [00:06<00:10,  8.51it/s]


[36m(RayTrainWorker pid=6179)[0m {'loss': 1.2263, 'grad_norm': 10.16295337677002, 'learning_rate': 0.0014833333333333335, 'epoch': 1.13}


[36m(RayTrainWorker pid=6179)[0m                                                  27%|██▋       | 32/120 [00:06<00:10,  8.51it/s]


[2m[36m(pid=6295) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=6295) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(RayTrainWorker pid=6179)[0m                                                  27%|██▋       | 32/120 [00:06<00:10,  8.51it/s]


[36m(RayTrainWorker pid=6179)[0m {'eval_loss': 1.4689449071884155, 'eval_accuracy': 0.5, 'eval_f1': 0.2222222222222222, 'eval_runtime': 0.1913, 'eval_samples_per_second': 292.723, 'eval_steps_per_second': 20.909, 'epoch': 1.13}

Trial TorchTrainer_853b7_00006 completed after 2 iterations at 2025-07-14 05:34:04. Total running time: 4min 32s
+-------------------------------------------------------------+
| Trial TorchTrainer_853b7_00006 result                       |
+-------------------------------------------------------------+
| checkpoint_dir_name                       checkpoint_000001 |
| time_this_iter_s                                    7.79086 |
| time_total_s                                         26.962 |
| training_iteration                                        2 |
| epoch                                               1.13333 |
| eval_accuracy                                           0.5 |
| eval_f1                                             0.22222 |
| eval_loss     

[36m(RayTrainWorker pid=6179)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/root/ray_results/tune_transformers/TorchTrainer_853b7_00006_6_batch_size=16,epochs=8,learning_rate=0.0020,weight_decay=0.0100_2025-07-14_05-29-32/checkpoint_000001)
[36m(SplitCoordinator pid=6295)[0m Registered dataset logger for dataset eval_16_1
[36m(SplitCoordinator pid=6295)[0m Starting execution of Dataset eval_16_1. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=6295)[0m Execution plan of Dataset eval_16_1: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
[36m(SplitCoordinator pid=6295)[0m ✔️  Dataset eval_16_1 execution finished in 0.11 seconds



Trial status: 7 TERMINATED | 9 PENDING
Current time: 2025-07-14 05:34:05. Total running time: 4min 33s
Logical resource usage: 0/1 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:A100)
Current best trial: 853b7_00000 with eval_loss=0.8454922437667847 and params={'train_loop_config': {'learning_rate': 2e-05, 'epochs': 6, 'batch_size': 128, 'weight_decay': 0.0}}
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                 status         ...oop_config/epochs     ...config/batch_size     ...fig/learning_rate     ...nfig/weight_decay     iter     total time (s)      loss     grad_norm     learning_rate      epoch |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

[36m(TrainTrainable pid=6559)[0m 2025-07-14 05:34:11.315784: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[36m(TrainTrainable pid=6559)[0m 2025-07-14 05:34:11.332249: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
[36m(TrainTrainable pid=6559)[0m E0000 00:00:1752471251.355209    6559 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
[36m(TrainTrainable pid=6559)[0m E0000 00:00:1752471251.361696    6559 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
[3


Trial TorchTrainer_853b7_00007 started with configuration:
+------------------------------------------------+
| Trial TorchTrainer_853b7_00007 config          |
+------------------------------------------------+
| train_loop_config/batch_size               128 |
| train_loop_config/epochs                     8 |
| train_loop_config/learning_rate           0.02 |
| train_loop_config/weight_decay            0.01 |
+------------------------------------------------+


[36m(RayTrainWorker pid=6659)[0m Setting up process group for: env:// [rank=0, world_size=1]
[36m(TorchTrainer pid=6559)[0m Started distributed worker processes: 
[36m(TorchTrainer pid=6559)[0m - (node_id=dfac881cdf0b95ec3c65ef7c281e0d5ea5b341dd9a6d97ecd807b1a3, ip=172.28.0.12, pid=6659) world_rank=0, local_rank=0, node_rank=0
[36m(RayTrainWorker pid=6659)[0m 2025-07-14 05:34:21.248197: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[36m(RayTrainWorker pid=6659)[0m 2025-07-14 05:34:21.265184: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
[36m(RayTrainWorker pid=6659)[0m E0000 00:00:1752471261.286248    6659 cuda_dnn.cc:8310] 

[36m(RayTrainWorker pid=6659)[0m CUDA available: True


[36m(RayTrainWorker pid=6659)[0m Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
[36m(RayTrainWorker pid=6659)[0m You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[36m(RayTrainWorker pid=6659)[0m Beginning training...


[36m(RayTrainWorker pid=6659)[0m wandb: Currently logged in as: nicolepcx to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
[36m(RayTrainWorker pid=6659)[0m wandb: Tracking run with wandb version 0.21.0
[36m(RayTrainWorker pid=6659)[0m wandb: Run data is saved locally in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/artifacts/2025-07-14_05-29-32/tune_transformers/working_dirs/TorchTrainer_853b7_00007_7_batch_size=128,epochs=8,learning_rate=0.0200,weight_decay=0.0100_2025-07-14_05-29-32/wandb/run-20250714_053428-5lcg995z
[36m(RayTrainWorker pid=6659)[0m wandb: Run `wandb offline` to turn off syncing.
[36m(RayTrainWorker pid=6659)[0m wandb: Syncing run superglue_cb
[36m(RayTrainWorker pid=6659)[0m wandb: ⭐️ View project at https://wandb.ai/nicolepcx/huggingface
[36m(RayTrainWorker pid=6659)[0m wandb: 🚀 View run at https://wandb.ai/nicolepcx/huggingface/runs/5lcg995z
  0%|          | 0/120 [00:00<?, ?it/s]
[36m(SplitCoordinator pid=6727)[0m Registere

[2m[36m(pid=6727) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=6727) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

  1%|          | 1/120 [00:00<01:06,  1.80it/s]
  2%|▎         | 3/120 [00:00<00:22,  5.11it/s]
  4%|▍         | 5/120 [00:00<00:14,  8.02it/s]
  6%|▌         | 7/120 [00:00<00:10, 10.40it/s]
[36m(SplitCoordinator pid=6727)[0m ✔️  Dataset train_17_0 execution finished in 0.93 seconds
  8%|▊         | 9/120 [00:01<00:09, 11.68it/s]
  9%|▉         | 11/120 [00:01<00:09, 11.32it/s]


[36m(RayTrainWorker pid=6659)[0m {'loss': 12.3149, 'grad_norm': 47.12200164794922, 'learning_rate': 0.0175, 'epoch': 0.13}


[36m(RayTrainWorker pid=6659)[0m  11%|█         | 13/120 [00:01<00:08, 12.22it/s]
[36m(RayTrainWorker pid=6659)[0m  12%|█▎        | 15/120 [00:01<00:07, 13.48it/s]                                                 13%|█▎        | 16/120 [00:01<00:07, 13.48it/s]


[2m[36m(pid=6777) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=6777) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(RayTrainWorker pid=6659)[0m                                                  13%|█▎        | 16/120 [00:01<00:07, 13.48it/s]


[36m(RayTrainWorker pid=6659)[0m {'eval_loss': 3.042269229888916, 'eval_accuracy': 0.5, 'eval_f1': 0.2222222222222222, 'eval_runtime': 0.216, 'eval_samples_per_second': 259.264, 'eval_steps_per_second': 4.63, 'epoch': 0.13}


[36m(RayTrainWorker pid=6659)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/root/ray_results/tune_transformers/TorchTrainer_853b7_00007_7_batch_size=128,epochs=8,learning_rate=0.0200,weight_decay=0.0100_2025-07-14_05-29-32/checkpoint_000000)
[36m(SplitCoordinator pid=6777)[0m Registered dataset logger for dataset eval_18_0
[36m(SplitCoordinator pid=6777)[0m Starting execution of Dataset eval_18_0. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=6777)[0m Execution plan of Dataset eval_18_0: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
[36m(SplitCoordinator pid=6727)[0m Registered dataset logger for dataset train_17_1



Trial status: 7 TERMINATED | 1 RUNNING | 8 PENDING
Current time: 2025-07-14 05:34:37. Total running time: 5min 5s
Logical resource usage: 0/1 CPUs, 1.0/1 GPUs (0.0/1.0 accelerator_type:A100)
Current best trial: 853b7_00000 with eval_loss=0.8454922437667847 and params={'train_loop_config': {'learning_rate': 2e-05, 'epochs': 6, 'batch_size': 128, 'weight_decay': 0.0}}
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                 status         ...oop_config/epochs     ...config/batch_size     ...fig/learning_rate     ...nfig/weight_decay     iter     total time (s)      loss     grad_norm     learning_rate      epoch |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

[36m(SplitCoordinator pid=6727)[0m Starting execution of Dataset train_17_1. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=6727)[0m Execution plan of Dataset train_17_1: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
[36m(SplitCoordinator pid=6777)[0m ✔️  Dataset eval_18_0 execution finished in 0.12 seconds


[2m[36m(pid=6727) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=6727) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

You may want to consider increasing the `CheckpointConfig(num_to_keep)` or decreasing the frequency of saving checkpoints.
[36m(TrainTrainable pid=7012)[0m 2025-07-14 05:34:43.942519: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[36m(TrainTrainable pid=7012)[0m 2025-07-14 05:34:43.959437: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
[36m(TrainTrainable pid=7012)[0m E0000 00:00:1752471283.980918    7012 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
[36m(TrainTrainable pid=7012)[0m E0000 00:00:1752471283.987398    7012 cuda_blas.cc:1418] U


Trial TorchTrainer_853b7_00008 started with configuration:
+-------------------------------------------------+
| Trial TorchTrainer_853b7_00008 config           |
+-------------------------------------------------+
| train_loop_config/batch_size                 32 |
| train_loop_config/epochs                      2 |
| train_loop_config/learning_rate           2e-05 |
| train_loop_config/weight_decay              0.1 |
+-------------------------------------------------+


[36m(RayTrainWorker pid=7112)[0m Setting up process group for: env:// [rank=0, world_size=1]
[36m(TorchTrainer pid=7012)[0m Started distributed worker processes: 
[36m(TorchTrainer pid=7012)[0m - (node_id=dfac881cdf0b95ec3c65ef7c281e0d5ea5b341dd9a6d97ecd807b1a3, ip=172.28.0.12, pid=7112) world_rank=0, local_rank=0, node_rank=0
[36m(RayTrainWorker pid=7112)[0m 2025-07-14 05:34:53.868598: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[36m(RayTrainWorker pid=7112)[0m 2025-07-14 05:34:53.884792: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
[36m(RayTrainWorker pid=7112)[0m E0000 00:00:1752471293.905422    7112 cuda_dnn.cc:8310] 

[36m(RayTrainWorker pid=7112)[0m CUDA available: True


[36m(RayTrainWorker pid=7112)[0m Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
[36m(RayTrainWorker pid=7112)[0m You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[36m(RayTrainWorker pid=7112)[0m Beginning training...


[36m(RayTrainWorker pid=7112)[0m wandb: Currently logged in as: nicolepcx to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
[36m(RayTrainWorker pid=7112)[0m wandb: Tracking run with wandb version 0.21.0
[36m(RayTrainWorker pid=7112)[0m wandb: Run data is saved locally in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/artifacts/2025-07-14_05-29-32/tune_transformers/working_dirs/TorchTrainer_853b7_00008_8_batch_size=32,epochs=2,learning_rate=0.0000,weight_decay=0.1000_2025-07-14_05-29-32/wandb/run-20250714_053501-q30ia151
[36m(RayTrainWorker pid=7112)[0m wandb: Run `wandb offline` to turn off syncing.
[36m(RayTrainWorker pid=7112)[0m wandb: Syncing run superglue_cb
[36m(RayTrainWorker pid=7112)[0m wandb: ⭐️ View project at https://wandb.ai/nicolepcx/huggingface
[36m(RayTrainWorker pid=7112)[0m wandb: 🚀 View run at https://wandb.ai/nicolepcx/huggingface/runs/q30ia151
  0%|          | 0/30 [00:00<?, ?it/s]
[36m(SplitCoordinator pid=7180)[0m Registered 

[2m[36m(pid=7180) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=7180) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

  3%|▎         | 1/30 [00:00<00:16,  1.77it/s]
 10%|█         | 3/30 [00:00<00:05,  5.06it/s]
 17%|█▋        | 5/30 [00:00<00:03,  7.91it/s]
 23%|██▎       | 7/30 [00:00<00:02, 10.33it/s]
[36m(SplitCoordinator pid=7180)[0m ✔️  Dataset train_19_0 execution finished in 0.94 seconds
 30%|███       | 9/30 [00:01<00:01, 11.58it/s]
 37%|███▋      | 11/30 [00:01<00:01, 11.14it/s]


[36m(RayTrainWorker pid=7112)[0m {'loss': 1.0731, 'grad_norm': 2.8056464195251465, 'learning_rate': 1e-05, 'epoch': 0.53}


[36m(RayTrainWorker pid=7112)[0m  43%|████▎     | 13/30 [00:01<00:01, 12.08it/s]
[36m(RayTrainWorker pid=7112)[0m  50%|█████     | 15/30 [00:01<00:01, 13.39it/s]                                                53%|█████▎    | 16/30 [00:01<00:01, 13.39it/s]


[2m[36m(pid=7230) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=7230) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(RayTrainWorker pid=7112)[0m {'eval_loss': 0.9497998952865601, 'eval_accuracy': 0.4107142857142857, 'eval_f1': 0.1940928270042194, 'eval_runtime': 0.2099, 'eval_samples_per_second': 266.754, 'eval_steps_per_second': 9.527, 'epoch': 0.53}


[36m(RayTrainWorker pid=7112)[0m                                                 53%|█████▎    | 16/30 [00:01<00:01, 13.39it/s]



Trial status: 8 TERMINATED | 1 RUNNING | 7 PENDING
Current time: 2025-07-14 05:35:07. Total running time: 5min 35s
Logical resource usage: 0/1 CPUs, 1.0/1 GPUs (0.0/1.0 accelerator_type:A100)
Current best trial: 853b7_00000 with eval_loss=0.8454922437667847 and params={'train_loop_config': {'learning_rate': 2e-05, 'epochs': 6, 'batch_size': 128, 'weight_decay': 0.0}}
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                 status         ...oop_config/epochs     ...config/batch_size     ...fig/learning_rate     ...nfig/weight_decay     iter     total time (s)      loss     grad_norm     learning_rate      epoch |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

[36m(RayTrainWorker pid=7112)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/root/ray_results/tune_transformers/TorchTrainer_853b7_00008_8_batch_size=32,epochs=2,learning_rate=0.0000,weight_decay=0.1000_2025-07-14_05-29-32/checkpoint_000000)
[36m(SplitCoordinator pid=7230)[0m Registered dataset logger for dataset eval_20_0
[36m(SplitCoordinator pid=7230)[0m Starting execution of Dataset eval_20_0. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=7230)[0m Execution plan of Dataset eval_20_0: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
[36m(SplitCoordinator pid=7180)[0m Registered dataset logger for dataset train_19_1
[36m(SplitCoordinator pid=7180)[0m Starting execution of Dataset train_19_1. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=7180)[0m Execution plan of Dataset train_19_1: InputDataBuffer[Input] -> Out

[2m[36m(pid=7180) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=7180) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(RayTrainWorker pid=7112)[0m  57%|█████▋    | 17/30 [00:05<00:09,  1.41it/s]
 63%|██████▎   | 19/30 [00:05<00:05,  1.96it/s]
 70%|███████   | 21/30 [00:05<00:03,  2.68it/s]
[36m(SplitCoordinator pid=7230)[0m ✔️  Dataset eval_20_0 execution finished in 0.11 seconds
 77%|███████▋  | 23/30 [00:06<00:01,  3.62it/s]
[36m(SplitCoordinator pid=7180)[0m ✔️  Dataset train_19_1 execution finished in 0.57 seconds
 83%|████████▎ | 25/30 [00:06<00:01,  4.62it/s]
 90%|█████████ | 27/30 [00:06<00:00,  5.74it/s]
 97%|█████████▋| 29/30 [00:06<00:00,  7.01it/s]


[36m(RayTrainWorker pid=7112)[0m {'loss': 0.8661, 'grad_norm': 4.483766555786133, 'learning_rate': 6.666666666666667e-07, 'epoch': 1.47}


[36m(RayTrainWorker pid=7112)[0m                                                100%|██████████| 30/30 [00:06<00:00,  7.01it/s]


[2m[36m(pid=7230) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=7230) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(RayTrainWorker pid=7112)[0m {'eval_loss': 0.906518816947937, 'eval_accuracy': 0.5357142857142857, 'eval_f1': 0.3524384112619407, 'eval_runtime': 0.1837, 'eval_samples_per_second': 304.849, 'eval_steps_per_second': 10.887, 'epoch': 1.47}


[36m(RayTrainWorker pid=7112)[0m                                                100%|██████████| 30/30 [00:06<00:00,  7.01it/s]


[36m(RayTrainWorker pid=7112)[0m {'train_runtime': 13.5814, 'train_samples_per_second': 70.685, 'train_steps_per_second': 2.209, 'train_loss': 0.9764814058939616, 'epoch': 1.47}


[36m(RayTrainWorker pid=7112)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/root/ray_results/tune_transformers/TorchTrainer_853b7_00008_8_batch_size=32,epochs=2,learning_rate=0.0000,weight_decay=0.1000_2025-07-14_05-29-32/checkpoint_000001)
[36m(SplitCoordinator pid=7230)[0m Registered dataset logger for dataset eval_20_1
[36m(SplitCoordinator pid=7230)[0m Starting execution of Dataset eval_20_1. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=7230)[0m Execution plan of Dataset eval_20_1: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
[36m(SplitCoordinator pid=7230)[0m ✔️  Dataset eval_20_1 execution finished in 0.11 seconds
[36m(RayTrainWorker pid=7112)[0m                                                100%|██████████| 30/30 [00:12<00:00,  7.01it/s]100%|██████████| 30/30 [00:12<00:00,  2.38it/s]
[36m(SplitCoordinator pid=7230)[0m Registered dataset logger for dataset

[2m[36m(pid=7230) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=7230) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(SplitCoordinator pid=7230)[0m ✔️  Dataset eval_20_2 execution finished in 0.12 seconds


[36m(RayTrainWorker pid=7112)[0m Final evaluation: {'eval_loss': 0.906518816947937, 'eval_accuracy': 0.5357142857142857, 'eval_f1': 0.3524384112619407, 'eval_runtime': 0.2914, 'eval_samples_per_second': 192.157, 'eval_steps_per_second': 6.863, 'epoch': 1.4666666666666668}


[36m(RayTrainWorker pid=7112)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/root/ray_results/tune_transformers/TorchTrainer_853b7_00008_8_batch_size=32,epochs=2,learning_rate=0.0000,weight_decay=0.1000_2025-07-14_05-29-32/checkpoint_000002)
You may want to consider increasing the `CheckpointConfig(num_to_keep)` or decreasing the frequency of saving checkpoints.



Trial TorchTrainer_853b7_00008 completed after 3 iterations at 2025-07-14 05:35:18. Total running time: 5min 45s
+-------------------------------------------------------------+
| Trial TorchTrainer_853b7_00008 result                       |
+-------------------------------------------------------------+
| checkpoint_dir_name                       checkpoint_000002 |
| time_this_iter_s                                    2.02193 |
| time_total_s                                       28.17588 |
| training_iteration                                        3 |
| eval_accuracy                                       0.53571 |
| eval_loss                                           0.90652 |
+-------------------------------------------------------------+


[36m(RayTrainWorker pid=7112)[0m Exception ignored in atexit callback: <function _start_and_connect_service.<locals>.teardown_atexit at 0x7d5019306160>
[36m(RayTrainWorker pid=7112)[0m Traceback (most recent call last):
[36m(RayTrainWorker pid=7112)[0m   File "/usr/local/lib/python3.11/dist-packages/wandb/sdk/lib/service/service_connection.py", line 54, in teardown_atexit
[36m(RayTrainWorker pid=7112)[0m     conn.teardown(hooks.exit_code)
[36m(RayTrainWorker pid=7112)[0m   File "/usr/local/lib/python3.11/dist-packages/wandb/sdk/lib/service/service_connection.py", line 190, in teardown
[36m(RayTrainWorker pid=7112)[0m     self._client.send_server_request(
[36m(RayTrainWorker pid=7112)[0m   File "/usr/local/lib/python3.11/dist-packages/wandb/sdk/lib/sock_client.py", line 150, in send_server_request
[36m(RayTrainWorker pid=7112)[0m     self._send_message(msg)
[36m(RayTrainWorker pid=7112)[0m   File "/usr/local/lib/python3.11/dist-packages/wandb/sdk/lib/sock_client.py", l


Trial TorchTrainer_853b7_00009 started with configuration:
+--------------------------------------------------+
| Trial TorchTrainer_853b7_00009 config            |
+--------------------------------------------------+
| train_loop_config/batch_size                  16 |
| train_loop_config/epochs                       2 |
| train_loop_config/learning_rate           0.0002 |
| train_loop_config/weight_decay               0.1 |
+--------------------------------------------------+


[36m(TorchTrainer pid=7513)[0m Started distributed worker processes: 
[36m(TorchTrainer pid=7513)[0m - (node_id=dfac881cdf0b95ec3c65ef7c281e0d5ea5b341dd9a6d97ecd807b1a3, ip=172.28.0.12, pid=7620) world_rank=0, local_rank=0, node_rank=0
[36m(RayTrainWorker pid=7620)[0m Setting up process group for: env:// [rank=0, world_size=1]
[36m(RayTrainWorker pid=7620)[0m 2025-07-14 05:35:34.142698: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[36m(RayTrainWorker pid=7620)[0m 2025-07-14 05:35:34.159118: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
[36m(RayTrainWorker pid=7620)[0m E0000 00:00:1752471334.180025    7620 cuda_dnn.cc:8310] 


Trial status: 9 TERMINATED | 1 RUNNING | 6 PENDING
Current time: 2025-07-14 05:35:37. Total running time: 6min 5s
Logical resource usage: 0/1 CPUs, 1.0/1 GPUs (0.0/1.0 accelerator_type:A100)
Current best trial: 853b7_00000 with eval_loss=0.8454922437667847 and params={'train_loop_config': {'learning_rate': 2e-05, 'epochs': 6, 'batch_size': 128, 'weight_decay': 0.0}}
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                 status         ...oop_config/epochs     ...config/batch_size     ...fig/learning_rate     ...nfig/weight_decay     iter     total time (s)      loss     grad_norm     learning_rate      epoch |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

[36m(RayTrainWorker pid=7620)[0m Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
[36m(RayTrainWorker pid=7620)[0m You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[36m(RayTrainWorker pid=7620)[0m Beginning training...


[36m(RayTrainWorker pid=7620)[0m wandb: Currently logged in as: nicolepcx to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
[36m(RayTrainWorker pid=7620)[0m wandb: Tracking run with wandb version 0.21.0
[36m(RayTrainWorker pid=7620)[0m wandb: Run data is saved locally in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/artifacts/2025-07-14_05-29-32/tune_transformers/working_dirs/TorchTrainer_853b7_00009_9_batch_size=16,epochs=2,learning_rate=0.0002,weight_decay=0.1000_2025-07-14_05-29-32/wandb/run-20250714_053541-bbn8nlkd
[36m(RayTrainWorker pid=7620)[0m wandb: Run `wandb offline` to turn off syncing.
[36m(RayTrainWorker pid=7620)[0m wandb: Syncing run superglue_cb
[36m(RayTrainWorker pid=7620)[0m wandb: ⭐️ View project at https://wandb.ai/nicolepcx/huggingface
[36m(RayTrainWorker pid=7620)[0m wandb: 🚀 View run at https://wandb.ai/nicolepcx/huggingface/runs/bbn8nlkd
  0%|          | 0/30 [00:00<?, ?it/s]


[2m[36m(pid=7682) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=7682) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(SplitCoordinator pid=7682)[0m Registered dataset logger for dataset train_21_0
[36m(SplitCoordinator pid=7682)[0m Starting execution of Dataset train_21_0. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=7682)[0m Execution plan of Dataset train_21_0: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
  3%|▎         | 1/30 [00:00<00:16,  1.77it/s]
 10%|█         | 3/30 [00:00<00:05,  5.04it/s]
 17%|█▋        | 5/30 [00:00<00:03,  7.81it/s]
 23%|██▎       | 7/30 [00:00<00:02, 10.20it/s]
[36m(SplitCoordinator pid=7682)[0m ✔️  Dataset train_21_0 execution finished in 0.95 seconds
 30%|███       | 9/30 [00:01<00:01, 11.47it/s]
 37%|███▋      | 11/30 [00:01<00:01, 11.16it/s]
 43%|████▎     | 13/30 [00:01<00:01, 12.10it/s]


[36m(RayTrainWorker pid=7620)[0m {'loss': 0.983, 'grad_norm': 1.9636911153793335, 'learning_rate': 0.0001, 'epoch': 0.53}


[36m(RayTrainWorker pid=7620)[0m  50%|█████     | 15/30 [00:01<00:01, 13.41it/s]                                                53%|█████▎    | 16/30 [00:01<00:01, 13.41it/s]


[2m[36m(pid=7732) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=7732) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(RayTrainWorker pid=7620)[0m                                                 53%|█████▎    | 16/30 [00:01<00:01, 13.41it/s]


[36m(RayTrainWorker pid=7620)[0m {'eval_loss': 0.8639593720436096, 'eval_accuracy': 0.6964285714285714, 'eval_f1': 0.48016752894801673, 'eval_runtime': 0.2066, 'eval_samples_per_second': 271.033, 'eval_steps_per_second': 19.359, 'epoch': 0.53}


[36m(RayTrainWorker pid=7620)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/root/ray_results/tune_transformers/TorchTrainer_853b7_00009_9_batch_size=16,epochs=2,learning_rate=0.0002,weight_decay=0.1000_2025-07-14_05-29-32/checkpoint_000000)
[36m(SplitCoordinator pid=7732)[0m Registered dataset logger for dataset eval_22_0
[36m(SplitCoordinator pid=7732)[0m Starting execution of Dataset eval_22_0. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=7732)[0m Execution plan of Dataset eval_22_0: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
[36m(SplitCoordinator pid=7732)[0m ✔️  Dataset eval_22_0 execution finished in 0.11 seconds
[36m(SplitCoordinator pid=7682)[0m Registered dataset logger for dataset train_21_1


[2m[36m(pid=7682) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=7682) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(SplitCoordinator pid=7682)[0m Starting execution of Dataset train_21_1. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=7682)[0m Execution plan of Dataset train_21_1: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
 57%|█████▋    | 17/30 [00:07<00:12,  1.03it/s]
You may want to consider increasing the `CheckpointConfig(num_to_keep)` or decreasing the frequency of saving checkpoints.
 63%|██████▎   | 19/30 [00:07<00:07,  1.44it/s]
 70%|███████   | 21/30 [00:07<00:04,  2.00it/s]
 77%|███████▋  | 23/30 [00:07<00:02,  2.72it/s]
[36m(SplitCoordinator pid=7682)[0m ✔️  Dataset train_21_1 execution finished in 0.59 seconds
 83%|████████▎ | 25/30 [00:07<00:01,  3.60it/s]


[36m(RayTrainWorker pid=7620)[0m {'loss': 0.6838, 'grad_norm': 6.151392936706543, 'learning_rate': 6.666666666666667e-06, 'epoch': 1.47}


[36m(RayTrainWorker pid=7620)[0m  90%|█████████ | 27/30 [00:08<00:00,  4.61it/s]
[36m(RayTrainWorker pid=7620)[0m  97%|█████████▋| 29/30 [00:08<00:00,  5.78it/s]                                               100%|██████████| 30/30 [00:08<00:00,  5.78it/s]


[2m[36m(pid=7732) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=7732) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(RayTrainWorker pid=7620)[0m                                                100%|██████████| 30/30 [00:08<00:00,  5.78it/s]


[36m(RayTrainWorker pid=7620)[0m {'eval_loss': 0.7809239625930786, 'eval_accuracy': 0.6785714285714286, 'eval_f1': 0.4706589706589706, 'eval_runtime': 0.1887, 'eval_samples_per_second': 296.758, 'eval_steps_per_second': 21.197, 'epoch': 1.47}
[36m(RayTrainWorker pid=7620)[0m {'train_runtime': 16.8639, 'train_samples_per_second': 28.463, 'train_steps_per_second': 1.779, 'train_loss': 0.8433420181274414, 'epoch': 1.47}


[36m(RayTrainWorker pid=7620)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/root/ray_results/tune_transformers/TorchTrainer_853b7_00009_9_batch_size=16,epochs=2,learning_rate=0.0002,weight_decay=0.1000_2025-07-14_05-29-32/checkpoint_000001)
[36m(SplitCoordinator pid=7732)[0m Registered dataset logger for dataset eval_22_1
[36m(SplitCoordinator pid=7732)[0m Starting execution of Dataset eval_22_1. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=7732)[0m Execution plan of Dataset eval_22_1: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
[36m(SplitCoordinator pid=7732)[0m ✔️  Dataset eval_22_1 execution finished in 0.11 seconds
[36m(RayTrainWorker pid=7620)[0m                                                100%|██████████| 30/30 [00:15<00:00,  5.78it/s]100%|██████████| 30/30 [00:15<00:00,  1.88it/s]
[36m(SplitCoordinator pid=7732)[0m Registered dataset logger for dataset

[2m[36m(pid=7732) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=7732) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(SplitCoordinator pid=7732)[0m ✔️  Dataset eval_22_2 execution finished in 0.12 seconds


[36m(RayTrainWorker pid=7620)[0m Final evaluation: {'eval_loss': 0.7809239625930786, 'eval_accuracy': 0.6785714285714286, 'eval_f1': 0.4706589706589706, 'eval_runtime': 0.2187, 'eval_samples_per_second': 256.078, 'eval_steps_per_second': 18.291, 'epoch': 1.4666666666666668}


[36m(RayTrainWorker pid=7620)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/root/ray_results/tune_transformers/TorchTrainer_853b7_00009_9_batch_size=16,epochs=2,learning_rate=0.0002,weight_decay=0.1000_2025-07-14_05-29-32/checkpoint_000002)
You may want to consider increasing the `CheckpointConfig(num_to_keep)` or decreasing the frequency of saving checkpoints.



Trial TorchTrainer_853b7_00009 completed after 3 iterations at 2025-07-14 05:36:01. Total running time: 6min 29s
+-------------------------------------------------------------+
| Trial TorchTrainer_853b7_00009 result                       |
+-------------------------------------------------------------+
| checkpoint_dir_name                       checkpoint_000002 |
| time_this_iter_s                                    1.83484 |
| time_total_s                                       31.51704 |
| training_iteration                                        3 |
| eval_accuracy                                       0.67857 |
| eval_loss                                           0.78092 |
+-------------------------------------------------------------+


[36m(RayTrainWorker pid=7620)[0m Exception ignored in atexit callback: <function _start_and_connect_service.<locals>.teardown_atexit at 0x7f1874a1e160>
[36m(RayTrainWorker pid=7620)[0m Traceback (most recent call last):
[36m(RayTrainWorker pid=7620)[0m   File "/usr/local/lib/python3.11/dist-packages/wandb/sdk/lib/service/service_connection.py", line 54, in teardown_atexit
[36m(RayTrainWorker pid=7620)[0m     conn.teardown(hooks.exit_code)
[36m(RayTrainWorker pid=7620)[0m   File "/usr/local/lib/python3.11/dist-packages/wandb/sdk/lib/service/service_connection.py", line 190, in teardown
[36m(RayTrainWorker pid=7620)[0m     self._client.send_server_request(
[36m(RayTrainWorker pid=7620)[0m   File "/usr/local/lib/python3.11/dist-packages/wandb/sdk/lib/sock_client.py", line 150, in send_server_request
[36m(RayTrainWorker pid=7620)[0m     self._send_message(msg)
[36m(RayTrainWorker pid=7620)[0m   File "/usr/local/lib/python3.11/dist-packages/wandb/sdk/lib/sock_client.py", l


Trial status: 10 TERMINATED | 6 PENDING
Current time: 2025-07-14 05:36:07. Total running time: 6min 35s
Logical resource usage: 0/1 CPUs, 1.0/1 GPUs (0.0/1.0 accelerator_type:A100)
Current best trial: 853b7_00009 with eval_loss=0.7809239625930786 and params={'train_loop_config': {'learning_rate': 0.0002, 'epochs': 2, 'batch_size': 16, 'weight_decay': 0.1}}
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                 status         ...oop_config/epochs     ...config/batch_size     ...fig/learning_rate     ...nfig/weight_decay     iter     total time (s)      loss     grad_norm     learning_rate      epoch |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

[36m(TrainTrainable pid=8033)[0m 2025-07-14 05:36:07.984025: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[36m(TrainTrainable pid=8033)[0m 2025-07-14 05:36:08.000794: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
[36m(TrainTrainable pid=8033)[0m E0000 00:00:1752471368.021513    8033 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
[36m(TrainTrainable pid=8033)[0m E0000 00:00:1752471368.027819    8033 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
[3


Trial TorchTrainer_853b7_00010 started with configuration:
+-------------------------------------------------+
| Trial TorchTrainer_853b7_00010 config           |
+-------------------------------------------------+
| train_loop_config/batch_size                 32 |
| train_loop_config/epochs                      8 |
| train_loop_config/learning_rate           0.002 |
| train_loop_config/weight_decay              0.1 |
+-------------------------------------------------+


[36m(RayTrainWorker pid=8134)[0m Setting up process group for: env:// [rank=0, world_size=1]
[36m(TorchTrainer pid=8033)[0m Started distributed worker processes: 
[36m(TorchTrainer pid=8033)[0m - (node_id=dfac881cdf0b95ec3c65ef7c281e0d5ea5b341dd9a6d97ecd807b1a3, ip=172.28.0.12, pid=8134) world_rank=0, local_rank=0, node_rank=0
[36m(RayTrainWorker pid=8134)[0m 2025-07-14 05:36:17.972620: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[36m(RayTrainWorker pid=8134)[0m 2025-07-14 05:36:17.988956: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
[36m(RayTrainWorker pid=8134)[0m E0000 00:00:1752471378.009669    8134 cuda_dnn.cc:8310] 

[36m(RayTrainWorker pid=8134)[0m CUDA available: True


[36m(RayTrainWorker pid=8134)[0m Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
[36m(RayTrainWorker pid=8134)[0m You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[36m(RayTrainWorker pid=8134)[0m Beginning training...


[36m(RayTrainWorker pid=8134)[0m wandb: Currently logged in as: nicolepcx to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
[36m(RayTrainWorker pid=8134)[0m wandb: Tracking run with wandb version 0.21.0
[36m(RayTrainWorker pid=8134)[0m wandb: Run data is saved locally in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/artifacts/2025-07-14_05-29-32/tune_transformers/working_dirs/TorchTrainer_853b7_00010_10_batch_size=32,epochs=8,learning_rate=0.0020,weight_decay=0.1000_2025-07-14_05-29-32/wandb/run-20250714_053625-eh5hu3qn
[36m(RayTrainWorker pid=8134)[0m wandb: Run `wandb offline` to turn off syncing.
[36m(RayTrainWorker pid=8134)[0m wandb: Syncing run superglue_cb
[36m(RayTrainWorker pid=8134)[0m wandb: ⭐️ View project at https://wandb.ai/nicolepcx/huggingface
[36m(RayTrainWorker pid=8134)[0m wandb: 🚀 View run at https://wandb.ai/nicolepcx/huggingface/runs/eh5hu3qn
  0%|          | 0/120 [00:00<?, ?it/s]


[2m[36m(pid=8200) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=8200) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(SplitCoordinator pid=8200)[0m Registered dataset logger for dataset train_23_0
[36m(SplitCoordinator pid=8200)[0m Starting execution of Dataset train_23_0. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=8200)[0m Execution plan of Dataset train_23_0: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
  1%|          | 1/120 [00:00<01:08,  1.74it/s]
  2%|▎         | 3/120 [00:00<00:23,  4.97it/s]
  4%|▍         | 5/120 [00:00<00:14,  7.91it/s]
  6%|▌         | 7/120 [00:00<00:10, 10.37it/s]
[36m(SplitCoordinator pid=8200)[0m ✔️  Dataset train_23_0 execution finished in 0.95 seconds
  8%|▊         | 9/120 [00:01<00:09, 11.67it/s]
  9%|▉         | 11/120 [00:01<00:09, 11.35it/s]
 11%|█         | 13/120 [00:01<00:08, 12.25it/s]


[36m(RayTrainWorker pid=8134)[0m {'loss': 1.5048, 'grad_norm': 11.928565979003906, 'learning_rate': 0.00175, 'epoch': 0.13}


[36m(RayTrainWorker pid=8134)[0m  12%|█▎        | 15/120 [00:01<00:07, 13.54it/s]
[36m(RayTrainWorker pid=8134)[0m                                                  13%|█▎        | 16/120 [00:01<00:07, 13.54it/s]


[2m[36m(pid=8250) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=8250) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(RayTrainWorker pid=8134)[0m                                                  13%|█▎        | 16/120 [00:01<00:07, 13.54it/s]


[36m(RayTrainWorker pid=8134)[0m {'eval_loss': 1.6822737455368042, 'eval_accuracy': 0.5, 'eval_f1': 0.2222222222222222, 'eval_runtime': 0.211, 'eval_samples_per_second': 265.352, 'eval_steps_per_second': 9.477, 'epoch': 0.13}


[36m(RayTrainWorker pid=8134)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/root/ray_results/tune_transformers/TorchTrainer_853b7_00010_10_batch_size=32,epochs=8,learning_rate=0.0020,weight_decay=0.1000_2025-07-14_05-29-32/checkpoint_000000)
[36m(SplitCoordinator pid=8250)[0m Registered dataset logger for dataset eval_24_0
[36m(SplitCoordinator pid=8250)[0m Starting execution of Dataset eval_24_0. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=8250)[0m Execution plan of Dataset eval_24_0: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
[36m(SplitCoordinator pid=8250)[0m ✔️  Dataset eval_24_0 execution finished in 0.11 seconds


[2m[36m(pid=8200) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=8200) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(SplitCoordinator pid=8200)[0m Registered dataset logger for dataset train_23_1
[36m(SplitCoordinator pid=8200)[0m Starting execution of Dataset train_23_1. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=8200)[0m Execution plan of Dataset train_23_1: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
 14%|█▍        | 17/120 [00:06<01:25,  1.20it/s]
 16%|█▌        | 19/120 [00:06<00:59,  1.68it/s]
 18%|█▊        | 21/120 [00:06<00:42,  2.34it/s]
 19%|█▉        | 23/120 [00:06<00:30,  3.18it/s]
[36m(SplitCoordinator pid=8200)[0m ✔️  Dataset train_23_1 execution finished in 0.52 seconds
 21%|██        | 25/120 [00:06<00:22,  4.16it/s]
 22%|██▎       | 27/120 [00:07<00:17,  5.23it/s]


[36m(RayTrainWorker pid=8134)[0m {'loss': 1.335, 'grad_norm': 12.714150428771973, 'learning_rate': 0.0014833333333333335, 'epoch': 1.13}


[36m(RayTrainWorker pid=8134)[0m  24%|██▍       | 29/120 [00:07<00:14,  6.48it/s]
[36m(RayTrainWorker pid=8134)[0m  26%|██▌       | 31/120 [00:07<00:11,  7.98it/s]                                                 27%|██▋       | 32/120 [00:07<00:11,  7.98it/s]


[2m[36m(pid=8250) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=8250) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(RayTrainWorker pid=8134)[0m                                                  27%|██▋       | 32/120 [00:07<00:11,  7.98it/s]


[36m(RayTrainWorker pid=8134)[0m {'eval_loss': 2.046280860900879, 'eval_accuracy': 0.5, 'eval_f1': 0.2222222222222222, 'eval_runtime': 0.1864, 'eval_samples_per_second': 300.465, 'eval_steps_per_second': 10.731, 'epoch': 1.13}

Trial TorchTrainer_853b7_00010 completed after 1 iterations at 2025-07-14 05:36:34. Total running time: 7min 2s
+-------------------------------------------------------------+
| Trial TorchTrainer_853b7_00010 result                       |
+-------------------------------------------------------------+
| checkpoint_dir_name                       checkpoint_000000 |
| time_this_iter_s                                   20.03545 |
| time_total_s                                       20.03545 |
| training_iteration                                        1 |
| epoch                                               0.13333 |
| eval_accuracy                                           0.5 |
| eval_f1                                             0.22222 |
| eval_loss       

You may want to consider increasing the `CheckpointConfig(num_to_keep)` or decreasing the frequency of saving checkpoints.



Trial status: 11 TERMINATED | 5 PENDING
Current time: 2025-07-14 05:36:37. Total running time: 7min 5s
Logical resource usage: 0/1 CPUs, 1.0/1 GPUs (0.0/1.0 accelerator_type:A100)
Current best trial: 853b7_00009 with eval_loss=0.7809239625930786 and params={'train_loop_config': {'learning_rate': 0.0002, 'epochs': 2, 'batch_size': 16, 'weight_decay': 0.1}}
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                 status         ...oop_config/epochs     ...config/batch_size     ...fig/learning_rate     ...nfig/weight_decay     iter     total time (s)      loss     grad_norm     learning_rate      epoch |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

[36m(TrainTrainable pid=8502)[0m 2025-07-14 05:36:42.043760: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[36m(SplitCoordinator pid=8250)[0m Registered dataset logger for dataset eval_24_1
[36m(SplitCoordinator pid=8250)[0m Starting execution of Dataset eval_24_1. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=8250)[0m Execution plan of Dataset eval_24_1: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
[36m(SplitCoordinator pid=8250)[0m ✔️  Dataset eval_24_1 execution finished in 0.11 seconds
[36m(TrainTrainable pid=8502)[0m 2025-07-14 05:36:42.062262: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for pl


Trial TorchTrainer_853b7_00011 started with configuration:
+------------------------------------------------+
| Trial TorchTrainer_853b7_00011 config          |
+------------------------------------------------+
| train_loop_config/batch_size                32 |
| train_loop_config/epochs                     6 |
| train_loop_config/learning_rate           0.02 |
| train_loop_config/weight_decay             0.1 |
+------------------------------------------------+


[36m(RayTrainWorker pid=8601)[0m Setting up process group for: env:// [rank=0, world_size=1]
[36m(TorchTrainer pid=8502)[0m Started distributed worker processes: 
[36m(TorchTrainer pid=8502)[0m - (node_id=dfac881cdf0b95ec3c65ef7c281e0d5ea5b341dd9a6d97ecd807b1a3, ip=172.28.0.12, pid=8601) world_rank=0, local_rank=0, node_rank=0
[36m(RayTrainWorker pid=8601)[0m 2025-07-14 05:36:51.922314: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[36m(RayTrainWorker pid=8601)[0m 2025-07-14 05:36:51.938624: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
[36m(RayTrainWorker pid=8601)[0m E0000 00:00:1752471411.959404    8601 cuda_dnn.cc:8310] 

[36m(RayTrainWorker pid=8601)[0m CUDA available: True


[36m(RayTrainWorker pid=8601)[0m Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
[36m(RayTrainWorker pid=8601)[0m You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[36m(RayTrainWorker pid=8601)[0m Beginning training...


[36m(RayTrainWorker pid=8601)[0m wandb: Currently logged in as: nicolepcx to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
[36m(RayTrainWorker pid=8601)[0m wandb: Tracking run with wandb version 0.21.0
[36m(RayTrainWorker pid=8601)[0m wandb: Run data is saved locally in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/artifacts/2025-07-14_05-29-32/tune_transformers/working_dirs/TorchTrainer_853b7_00011_11_batch_size=32,epochs=6,learning_rate=0.0200,weight_decay=0.1000_2025-07-14_05-29-32/wandb/run-20250714_053659-ql8vfe6u
[36m(RayTrainWorker pid=8601)[0m wandb: Run `wandb offline` to turn off syncing.
[36m(RayTrainWorker pid=8601)[0m wandb: Syncing run superglue_cb
[36m(RayTrainWorker pid=8601)[0m wandb: ⭐️ View project at https://wandb.ai/nicolepcx/huggingface
[36m(RayTrainWorker pid=8601)[0m wandb: 🚀 View run at https://wandb.ai/nicolepcx/huggingface/runs/ql8vfe6u
  0%|          | 0/90 [00:00<?, ?it/s]


[2m[36m(pid=8669) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=8669) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(SplitCoordinator pid=8669)[0m Registered dataset logger for dataset train_25_0
[36m(SplitCoordinator pid=8669)[0m Starting execution of Dataset train_25_0. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=8669)[0m Execution plan of Dataset train_25_0: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
  1%|          | 1/90 [00:00<00:50,  1.77it/s]
  3%|▎         | 3/90 [00:00<00:17,  5.03it/s]
  6%|▌         | 5/90 [00:00<00:10,  7.81it/s]
  8%|▊         | 7/90 [00:00<00:08, 10.22it/s]
[36m(SplitCoordinator pid=8669)[0m ✔️  Dataset train_25_0 execution finished in 0.95 seconds
 10%|█         | 9/90 [00:01<00:07, 11.47it/s]
 12%|█▏        | 11/90 [00:01<00:07, 11.27it/s]


[36m(RayTrainWorker pid=8601)[0m {'loss': 7.7034, 'grad_norm': 13.369223594665527, 'learning_rate': 0.016666666666666666, 'epoch': 0.18}


[36m(RayTrainWorker pid=8601)[0m  14%|█▍        | 13/90 [00:01<00:06, 12.14it/s]
[36m(RayTrainWorker pid=8601)[0m  17%|█▋        | 15/90 [00:01<00:05, 13.43it/s]                                                18%|█▊        | 16/90 [00:01<00:05, 13.43it/s]


[2m[36m(pid=8719) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=8719) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(RayTrainWorker pid=8601)[0m {'eval_loss': 2.518627643585205, 'eval_accuracy': 0.5, 'eval_f1': 0.2222222222222222, 'eval_runtime': 0.2179, 'eval_samples_per_second': 257.052, 'eval_steps_per_second': 9.18, 'epoch': 0.18}


[36m(RayTrainWorker pid=8601)[0m                                                 18%|█▊        | 16/90 [00:01<00:05, 13.43it/s]
You may want to consider increasing the `CheckpointConfig(num_to_keep)` or decreasing the frequency of saving checkpoints.



Trial status: 11 TERMINATED | 1 RUNNING | 4 PENDING
Current time: 2025-07-14 05:37:07. Total running time: 7min 35s
Logical resource usage: 0/1 CPUs, 1.0/1 GPUs (0.0/1.0 accelerator_type:A100)
Current best trial: 853b7_00009 with eval_loss=0.7809239625930786 and params={'train_loop_config': {'learning_rate': 0.0002, 'epochs': 2, 'batch_size': 16, 'weight_decay': 0.1}}
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                 status         ...oop_config/epochs     ...config/batch_size     ...fig/learning_rate     ...nfig/weight_decay     iter     total time (s)      loss     grad_norm     learning_rate      epoch |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

[36m(RayTrainWorker pid=8601)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/root/ray_results/tune_transformers/TorchTrainer_853b7_00011_11_batch_size=32,epochs=6,learning_rate=0.0200,weight_decay=0.1000_2025-07-14_05-29-32/checkpoint_000000)
[36m(SplitCoordinator pid=8719)[0m Registered dataset logger for dataset eval_26_0
[36m(SplitCoordinator pid=8719)[0m Starting execution of Dataset eval_26_0. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=8719)[0m Execution plan of Dataset eval_26_0: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
[36m(SplitCoordinator pid=8719)[0m ✔️  Dataset eval_26_0 execution finished in 0.12 seconds
[36m(TrainTrainable pid=8946)[0m 2025-07-14 05:37:14.247025: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders.


Trial TorchTrainer_853b7_00012 started with configuration:
+-------------------------------------------------+
| Trial TorchTrainer_853b7_00012 config           |
+-------------------------------------------------+
| train_loop_config/batch_size                 64 |
| train_loop_config/epochs                      4 |
| train_loop_config/learning_rate           2e-05 |
| train_loop_config/weight_decay            0.001 |
+-------------------------------------------------+


[36m(TorchTrainer pid=8946)[0m Started distributed worker processes: 
[36m(TorchTrainer pid=8946)[0m - (node_id=dfac881cdf0b95ec3c65ef7c281e0d5ea5b341dd9a6d97ecd807b1a3, ip=172.28.0.12, pid=9046) world_rank=0, local_rank=0, node_rank=0
[36m(RayTrainWorker pid=9046)[0m Setting up process group for: env:// [rank=0, world_size=1]
[36m(RayTrainWorker pid=9046)[0m 2025-07-14 05:37:24.226470: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[36m(RayTrainWorker pid=9046)[0m 2025-07-14 05:37:24.244746: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
[36m(RayTrainWorker pid=9046)[0m E0000 00:00:1752471444.266200    9046 cuda_dnn.cc:8310] 

[36m(RayTrainWorker pid=9046)[0m CUDA available: True


[36m(RayTrainWorker pid=9046)[0m Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
[36m(RayTrainWorker pid=9046)[0m You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[36m(RayTrainWorker pid=9046)[0m Beginning training...


[36m(RayTrainWorker pid=9046)[0m wandb: Currently logged in as: nicolepcx to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
[36m(RayTrainWorker pid=9046)[0m wandb: Tracking run with wandb version 0.21.0
[36m(RayTrainWorker pid=9046)[0m wandb: Run data is saved locally in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/artifacts/2025-07-14_05-29-32/tune_transformers/working_dirs/TorchTrainer_853b7_00012_12_batch_size=64,epochs=4,learning_rate=0.0000,weight_decay=0.0010_2025-07-14_05-29-32/wandb/run-20250714_053732-ksq5php3
[36m(RayTrainWorker pid=9046)[0m wandb: Run `wandb offline` to turn off syncing.
[36m(RayTrainWorker pid=9046)[0m wandb: Syncing run superglue_cb
[36m(RayTrainWorker pid=9046)[0m wandb: ⭐️ View project at https://wandb.ai/nicolepcx/huggingface
[36m(RayTrainWorker pid=9046)[0m wandb: 🚀 View run at https://wandb.ai/nicolepcx/huggingface/runs/ksq5php3
  0%|          | 0/60 [00:00<?, ?it/s]


[2m[36m(pid=9114) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=9114) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(SplitCoordinator pid=9114)[0m Registered dataset logger for dataset train_27_0
[36m(SplitCoordinator pid=9114)[0m Starting execution of Dataset train_27_0. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=9114)[0m Execution plan of Dataset train_27_0: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
  2%|▏         | 1/60 [00:00<00:33,  1.79it/s]
  5%|▌         | 3/60 [00:00<00:11,  5.07it/s]
  8%|▊         | 5/60 [00:00<00:07,  7.81it/s]
 12%|█▏        | 7/60 [00:00<00:05, 10.20it/s]
[36m(SplitCoordinator pid=9114)[0m ✔️  Dataset train_27_0 execution finished in 0.95 seconds
 15%|█▌        | 9/60 [00:01<00:04, 11.50it/s]
 18%|█▊        | 11/60 [00:01<00:04, 11.30it/s]
 22%|██▏       | 13/60 [00:01<00:03, 12.19it/s]


[36m(RayTrainWorker pid=9046)[0m {'loss': 1.0549, 'grad_norm': 2.1848177909851074, 'learning_rate': 1.5000000000000002e-05, 'epoch': 0.27}


[36m(RayTrainWorker pid=9046)[0m  25%|██▌       | 15/60 [00:01<00:03, 13.47it/s]
[36m(RayTrainWorker pid=9046)[0m                                                 27%|██▋       | 16/60 [00:01<00:03, 13.47it/s]


[2m[36m(pid=9166) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=9166) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(RayTrainWorker pid=9046)[0m {'eval_loss': 0.9507132172584534, 'eval_accuracy': 0.42857142857142855, 'eval_f1': 0.21956970232832304, 'eval_runtime': 0.2114, 'eval_samples_per_second': 264.952, 'eval_steps_per_second': 4.731, 'epoch': 0.27}


[36m(RayTrainWorker pid=9046)[0m                                                 27%|██▋       | 16/60 [00:01<00:03, 13.47it/s]



Trial status: 12 TERMINATED | 1 RUNNING | 3 PENDING
Current time: 2025-07-14 05:37:37. Total running time: 8min 5s
Logical resource usage: 0/1 CPUs, 1.0/1 GPUs (0.0/1.0 accelerator_type:A100)
Current best trial: 853b7_00009 with eval_loss=0.7809239625930786 and params={'train_loop_config': {'learning_rate': 0.0002, 'epochs': 2, 'batch_size': 16, 'weight_decay': 0.1}}
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                 status         ...oop_config/epochs     ...config/batch_size     ...fig/learning_rate     ...nfig/weight_decay     iter     total time (s)      loss     grad_norm     learning_rate      epoch |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

[36m(RayTrainWorker pid=9046)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/root/ray_results/tune_transformers/TorchTrainer_853b7_00012_12_batch_size=64,epochs=4,learning_rate=0.0000,weight_decay=0.0010_2025-07-14_05-29-32/checkpoint_000000)
[36m(SplitCoordinator pid=9166)[0m Registered dataset logger for dataset eval_28_0
[36m(SplitCoordinator pid=9166)[0m Starting execution of Dataset eval_28_0. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=9166)[0m Execution plan of Dataset eval_28_0: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]


[2m[36m(pid=9114) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=9114) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(SplitCoordinator pid=9114)[0m Registered dataset logger for dataset train_27_1
[36m(SplitCoordinator pid=9114)[0m Starting execution of Dataset train_27_1. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=9114)[0m Execution plan of Dataset train_27_1: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
 28%|██▊       | 17/60 [00:06<00:33,  1.30it/s]
[36m(SplitCoordinator pid=9166)[0m ✔️  Dataset eval_28_0 execution finished in 0.11 seconds
 32%|███▏      | 19/60 [00:06<00:22,  1.80it/s]
 35%|███▌      | 21/60 [00:06<00:15,  2.49it/s]
 38%|███▊      | 23/60 [00:06<00:11,  3.36it/s]
[36m(SplitCoordinator pid=9114)[0m ✔️  Dataset train_27_1 execution finished in 0.57 seconds
 42%|████▏     | 25/60 [00:06<00:08,  4.36it/s]
 45%|████▌     | 27/60 [00:06<00:06,  5.45it/s]
 48%|████▊     | 29/60 [00:06<00:04,  6.70it/s]


[36m(RayTrainWorker pid=9046)[0m {'loss': 0.8561, 'grad_norm': 2.5184457302093506, 'learning_rate': 9.666666666666667e-06, 'epoch': 1.27}


[36m(RayTrainWorker pid=9046)[0m  52%|█████▏    | 31/60 [00:07<00:03,  8.21it/s]
[36m(RayTrainWorker pid=9046)[0m                                                 53%|█████▎    | 32/60 [00:07<00:03,  8.21it/s]


[2m[36m(pid=9166) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=9166) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(RayTrainWorker pid=9046)[0m {'eval_loss': 0.8541810512542725, 'eval_accuracy': 0.6785714285714286, 'eval_f1': 0.47391812865497074, 'eval_runtime': 0.1901, 'eval_samples_per_second': 294.637, 'eval_steps_per_second': 5.261, 'epoch': 1.27}


[36m(RayTrainWorker pid=9046)[0m                                                 53%|█████▎    | 32/60 [00:07<00:03,  8.21it/s]
You may want to consider increasing the `CheckpointConfig(num_to_keep)` or decreasing the frequency of saving checkpoints.
[36m(RayTrainWorker pid=9046)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/root/ray_results/tune_transformers/TorchTrainer_853b7_00012_12_batch_size=64,epochs=4,learning_rate=0.0000,weight_decay=0.0010_2025-07-14_05-29-32/checkpoint_000001)
[36m(SplitCoordinator pid=9166)[0m Registered dataset logger for dataset eval_28_1
[36m(SplitCoordinator pid=9166)[0m Starting execution of Dataset eval_28_1. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=9166)[0m Execution plan of Dataset eval_28_1: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
[36m(SplitCoordinator pid=9166)[0m ✔️  Dataset eval_28_1 execution finished in 0.11 second

[2m[36m(pid=9114) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=9114) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(SplitCoordinator pid=9114)[0m Registered dataset logger for dataset train_27_2
[36m(SplitCoordinator pid=9114)[0m Starting execution of Dataset train_27_2. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=9114)[0m Execution plan of Dataset train_27_2: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
 55%|█████▌    | 33/60 [00:16<00:40,  1.49s/it]
 58%|█████▊    | 35/60 [00:16<00:26,  1.07s/it]
 62%|██████▏   | 37/60 [00:16<00:17,  1.31it/s]
 65%|██████▌   | 39/60 [00:16<00:11,  1.81it/s]
[36m(SplitCoordinator pid=9114)[0m ✔️  Dataset train_27_2 execution finished in 0.57 seconds
 68%|██████▊   | 41/60 [00:16<00:07,  2.45it/s]
 72%|███████▏  | 43/60 [00:17<00:05,  3.23it/s]


[36m(RayTrainWorker pid=9046)[0m {'loss': 0.7414, 'grad_norm': 2.6513924598693848, 'learning_rate': 4.333333333333334e-06, 'epoch': 2.27}


[36m(RayTrainWorker pid=9046)[0m  75%|███████▌  | 45/60 [00:17<00:03,  4.21it/s]
[36m(RayTrainWorker pid=9046)[0m  78%|███████▊  | 47/60 [00:17<00:02,  5.44it/s]                                                80%|████████  | 48/60 [00:17<00:02,  5.44it/s]


[2m[36m(pid=9166) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=9166) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(RayTrainWorker pid=9046)[0m {'eval_loss': 0.8248645067214966, 'eval_accuracy': 0.6964285714285714, 'eval_f1': 0.4862772695285011, 'eval_runtime': 0.1919, 'eval_samples_per_second': 291.833, 'eval_steps_per_second': 5.211, 'epoch': 2.27}


[36m(RayTrainWorker pid=9046)[0m                                                 80%|████████  | 48/60 [00:17<00:02,  5.44it/s]
[36m(RayTrainWorker pid=9046)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/root/ray_results/tune_transformers/TorchTrainer_853b7_00012_12_batch_size=64,epochs=4,learning_rate=0.0000,weight_decay=0.0010_2025-07-14_05-29-32/checkpoint_000002)
[36m(SplitCoordinator pid=9166)[0m Registered dataset logger for dataset eval_28_2
[36m(SplitCoordinator pid=9166)[0m Starting execution of Dataset eval_28_2. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=9166)[0m Execution plan of Dataset eval_28_2: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
[36m(SplitCoordinator pid=9166)[0m ✔️  Dataset eval_28_2 execution finished in 0.12 seconds


[2m[36m(pid=9114) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=9114) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(SplitCoordinator pid=9114)[0m Registered dataset logger for dataset train_27_3
[36m(SplitCoordinator pid=9114)[0m Starting execution of Dataset train_27_3. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=9114)[0m Execution plan of Dataset train_27_3: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
 82%|████████▏ | 49/60 [00:24<00:13,  1.27s/it]
 85%|████████▌ | 51/60 [00:25<00:08,  1.10it/s]
 88%|████████▊ | 53/60 [00:25<00:04,  1.53it/s]
 92%|█████████▏| 55/60 [00:25<00:02,  2.10it/s]
[36m(SplitCoordinator pid=9114)[0m ✔️  Dataset train_27_3 execution finished in 0.55 seconds
 95%|█████████▌| 57/60 [00:25<00:01,  2.83it/s]


[36m(RayTrainWorker pid=9046)[0m {'loss': 0.6304, 'grad_norm': 2.5897536277770996, 'learning_rate': 3.3333333333333335e-07, 'epoch': 3.2}


[36m(RayTrainWorker pid=9046)[0m  98%|█████████▊| 59/60 [00:25<00:00,  3.69it/s]                                               100%|██████████| 60/60 [00:25<00:00,  3.69it/s]


[2m[36m(pid=9166) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=9166) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(RayTrainWorker pid=9046)[0m {'eval_loss': 0.8204472661018372, 'eval_accuracy': 0.6964285714285714, 'eval_f1': 0.4862772695285011, 'eval_runtime': 1.5618, 'eval_samples_per_second': 35.856, 'eval_steps_per_second': 0.64, 'epoch': 3.2}


[36m(RayTrainWorker pid=9046)[0m                                                100%|██████████| 60/60 [00:27<00:00,  3.69it/s]
[36m(RayTrainWorker pid=9046)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/root/ray_results/tune_transformers/TorchTrainer_853b7_00012_12_batch_size=64,epochs=4,learning_rate=0.0000,weight_decay=0.0010_2025-07-14_05-29-32/checkpoint_000003)
[36m(SplitCoordinator pid=9166)[0m Registered dataset logger for dataset eval_28_3
[36m(SplitCoordinator pid=9166)[0m Starting execution of Dataset eval_28_3. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=9166)[0m Execution plan of Dataset eval_28_3: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
[36m(SplitCoordinator pid=9166)[0m ✔️  Dataset eval_28_3 execution finished in 1.48 seconds


Trial status: 12 TERMINATED | 1 RUNNING | 3 PENDING
Current time: 2025-07-14 05:38:08. Total running time: 8min 35s
Logical resource usage: 0/1 CPUs, 1.0/1 GPUs (0.0/1.0 accelerator_type:A100)
Current best trial: 853b7_00009 with eval_loss=0.7809239625930786 and params={'train_loop_config': {'learning_rate': 0.0002, 'epochs': 2, 'batch_size': 16, 'weight_decay': 0.1}}
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                 status         ...oop_config/epochs     ...config/batch_size     ...fig/learning_rate     ...nfig/weight_decay     iter     total time (s)      loss     grad_norm     learning_rate      epoch |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

[36m(RayTrainWorker pid=9046)[0m                                                100%|██████████| 60/60 [00:35<00:00,  3.69it/s]100%|██████████| 60/60 [00:35<00:00,  1.71it/s]


[2m[36m(pid=9166) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=9166) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(SplitCoordinator pid=9166)[0m Registered dataset logger for dataset eval_28_4
[36m(SplitCoordinator pid=9166)[0m Starting execution of Dataset eval_28_4. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=9166)[0m Execution plan of Dataset eval_28_4: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
[36m(SplitCoordinator pid=9166)[0m ✔️  Dataset eval_28_4 execution finished in 0.14 seconds


[36m(RayTrainWorker pid=9046)[0m Final evaluation: {'eval_loss': 0.8204472661018372, 'eval_accuracy': 0.6964285714285714, 'eval_f1': 0.4862772695285011, 'eval_runtime': 0.2305, 'eval_samples_per_second': 242.948, 'eval_steps_per_second': 4.338, 'epoch': 3.2}


[36m(RayTrainWorker pid=9046)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/root/ray_results/tune_transformers/TorchTrainer_853b7_00012_12_batch_size=64,epochs=4,learning_rate=0.0000,weight_decay=0.0010_2025-07-14_05-29-32/checkpoint_000004)
You may want to consider increasing the `CheckpointConfig(num_to_keep)` or decreasing the frequency of saving checkpoints.



Trial TorchTrainer_853b7_00012 completed after 5 iterations at 2025-07-14 05:38:11. Total running time: 8min 39s
+-------------------------------------------------------------+
| Trial TorchTrainer_853b7_00012 result                       |
+-------------------------------------------------------------+
| checkpoint_dir_name                       checkpoint_000004 |
| time_this_iter_s                                     2.7336 |
| time_total_s                                       50.80622 |
| training_iteration                                        5 |
| eval_accuracy                                       0.69643 |
| eval_loss                                           0.82045 |
+-------------------------------------------------------------+


[36m(RayTrainWorker pid=9046)[0m Exception ignored in atexit callback: <function _start_and_connect_service.<locals>.teardown_atexit at 0x7e683274a0c0>
[36m(RayTrainWorker pid=9046)[0m Traceback (most recent call last):
[36m(RayTrainWorker pid=9046)[0m   File "/usr/local/lib/python3.11/dist-packages/wandb/sdk/lib/service/service_connection.py", line 54, in teardown_atexit
[36m(RayTrainWorker pid=9046)[0m     conn.teardown(hooks.exit_code)
[36m(RayTrainWorker pid=9046)[0m   File "/usr/local/lib/python3.11/dist-packages/wandb/sdk/lib/service/service_connection.py", line 190, in teardown
[36m(RayTrainWorker pid=9046)[0m     
[36m(RayTrainWorker pid=9046)[0m self._client.send_server_request(
[36m(RayTrainWorker pid=9046)[0m   File "/usr/local/lib/python3.11/dist-packages/wandb/sdk/lib/sock_client.py", line 150, in send_server_request
[36m(RayTrainWorker pid=9046)[0m     
[36m(RayTrainWorker pid=9046)[0m self._send_message(msg)
[36m(RayTrainWorker pid=9046)[0m   File "


Trial TorchTrainer_853b7_00013 started with configuration:
+--------------------------------------------------+
| Trial TorchTrainer_853b7_00013 config            |
+--------------------------------------------------+
| train_loop_config/batch_size                  16 |
| train_loop_config/epochs                       4 |
| train_loop_config/learning_rate           0.0002 |
| train_loop_config/weight_decay             0.001 |
+--------------------------------------------------+


[36m(RayTrainWorker pid=9686)[0m Setting up process group for: env:// [rank=0, world_size=1]
[36m(TorchTrainer pid=9575)[0m Started distributed worker processes: 
[36m(TorchTrainer pid=9575)[0m - (node_id=dfac881cdf0b95ec3c65ef7c281e0d5ea5b341dd9a6d97ecd807b1a3, ip=172.28.0.12, pid=9686) world_rank=0, local_rank=0, node_rank=0
[36m(RayTrainWorker pid=9686)[0m 2025-07-14 05:38:28.837258: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[36m(RayTrainWorker pid=9686)[0m 2025-07-14 05:38:28.854998: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
[36m(RayTrainWorker pid=9686)[0m E0000 00:00:1752471508.876110    9686 cuda_dnn.cc:8310] 

[36m(RayTrainWorker pid=9686)[0m CUDA available: True


[36m(RayTrainWorker pid=9686)[0m Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
[36m(RayTrainWorker pid=9686)[0m You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[36m(RayTrainWorker pid=9686)[0m Beginning training...


[36m(RayTrainWorker pid=9686)[0m wandb: Currently logged in as: nicolepcx to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
[36m(RayTrainWorker pid=9686)[0m wandb: Tracking run with wandb version 0.21.0
[36m(RayTrainWorker pid=9686)[0m wandb: Run data is saved locally in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/artifacts/2025-07-14_05-29-32/tune_transformers/working_dirs/TorchTrainer_853b7_00013_13_batch_size=16,epochs=4,learning_rate=0.0002,weight_decay=0.0010_2025-07-14_05-29-32/wandb/run-20250714_053837-yazdqj07
[36m(RayTrainWorker pid=9686)[0m wandb: Run `wandb offline` to turn off syncing.
[36m(RayTrainWorker pid=9686)[0m wandb: Syncing run superglue_cb
[36m(RayTrainWorker pid=9686)[0m wandb: ⭐️ View project at https://wandb.ai/nicolepcx/huggingface
[36m(RayTrainWorker pid=9686)[0m wandb: 🚀 View run at https://wandb.ai/nicolepcx/huggingface/runs/yazdqj07
  0%|          | 0/60 [00:00<?, ?it/s]


[2m[36m(pid=9752) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=9752) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(SplitCoordinator pid=9752)[0m Registered dataset logger for dataset train_29_0
[36m(SplitCoordinator pid=9752)[0m Starting execution of Dataset train_29_0. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=9752)[0m Execution plan of Dataset train_29_0: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]



Trial status: 13 TERMINATED | 1 RUNNING | 2 PENDING
Current time: 2025-07-14 05:38:38. Total running time: 9min 5s
Logical resource usage: 0/1 CPUs, 1.0/1 GPUs (0.0/1.0 accelerator_type:A100)
Current best trial: 853b7_00009 with eval_loss=0.7809239625930786 and params={'train_loop_config': {'learning_rate': 0.0002, 'epochs': 2, 'batch_size': 16, 'weight_decay': 0.1}}
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                 status         ...oop_config/epochs     ...config/batch_size     ...fig/learning_rate     ...nfig/weight_decay     iter     total time (s)      loss     grad_norm     learning_rate      epoch |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

  2%|▏         | 1/60 [00:00<00:33,  1.74it/s]
  5%|▌         | 3/60 [00:00<00:11,  4.97it/s]
  8%|▊         | 5/60 [00:00<00:07,  7.82it/s]
 12%|█▏        | 7/60 [00:00<00:05, 10.25it/s]
[36m(SplitCoordinator pid=9752)[0m ✔️  Dataset train_29_0 execution finished in 0.95 seconds
 15%|█▌        | 9/60 [00:01<00:04, 11.57it/s]
 18%|█▊        | 11/60 [00:01<00:04, 11.30it/s]
 22%|██▏       | 13/60 [00:01<00:03, 12.21it/s]


[36m(RayTrainWorker pid=9686)[0m {'loss': 0.9917, 'grad_norm': 4.576046943664551, 'learning_rate': 0.00015000000000000001, 'epoch': 0.27}


[36m(RayTrainWorker pid=9686)[0m  25%|██▌       | 15/60 [00:01<00:03, 13.49it/s]
[36m(RayTrainWorker pid=9686)[0m                                                 27%|██▋       | 16/60 [00:01<00:03, 13.49it/s]


[2m[36m(pid=9798) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=9798) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(RayTrainWorker pid=9686)[0m                                                 27%|██▋       | 16/60 [00:01<00:03, 13.49it/s]


[36m(RayTrainWorker pid=9686)[0m {'eval_loss': 0.9868366718292236, 'eval_accuracy': 0.5, 'eval_f1': 0.2222222222222222, 'eval_runtime': 0.2163, 'eval_samples_per_second': 258.934, 'eval_steps_per_second': 18.495, 'epoch': 0.27}


[36m(RayTrainWorker pid=9686)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/root/ray_results/tune_transformers/TorchTrainer_853b7_00013_13_batch_size=16,epochs=4,learning_rate=0.0002,weight_decay=0.0010_2025-07-14_05-29-32/checkpoint_000000)
[36m(SplitCoordinator pid=9798)[0m Registered dataset logger for dataset eval_30_0
[36m(SplitCoordinator pid=9798)[0m Starting execution of Dataset eval_30_0. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=9798)[0m Execution plan of Dataset eval_30_0: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
[36m(SplitCoordinator pid=9798)[0m ✔️  Dataset eval_30_0 execution finished in 0.11 seconds


[2m[36m(pid=9752) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=9752) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(SplitCoordinator pid=9752)[0m Registered dataset logger for dataset train_29_1
[36m(SplitCoordinator pid=9752)[0m Starting execution of Dataset train_29_1. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=9752)[0m Execution plan of Dataset train_29_1: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
 28%|██▊       | 17/60 [00:09<00:54,  1.26s/it]
 32%|███▏      | 19/60 [00:09<00:36,  1.13it/s]
 35%|███▌      | 21/60 [00:09<00:24,  1.58it/s]
 38%|███▊      | 23/60 [00:09<00:16,  2.18it/s]
[36m(SplitCoordinator pid=9752)[0m ✔️  Dataset train_29_1 execution finished in 0.54 seconds
 42%|████▏     | 25/60 [00:09<00:11,  2.94it/s]
 45%|████▌     | 27/60 [00:09<00:08,  3.84it/s]


[36m(RayTrainWorker pid=9686)[0m {'loss': 1.0283, 'grad_norm': 1.396965503692627, 'learning_rate': 9.666666666666667e-05, 'epoch': 1.27}


[36m(RayTrainWorker pid=9686)[0m  48%|████▊     | 29/60 [00:09<00:06,  4.93it/s]
[36m(RayTrainWorker pid=9686)[0m  52%|█████▏    | 31/60 [00:10<00:04,  6.26it/s]                                                53%|█████▎    | 32/60 [00:10<00:04,  6.26it/s]


[2m[36m(pid=9798) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=9798) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(RayTrainWorker pid=9686)[0m                                                 53%|█████▎    | 32/60 [00:10<00:04,  6.26it/s]


[36m(RayTrainWorker pid=9686)[0m {'eval_loss': 0.9530501365661621, 'eval_accuracy': 0.5, 'eval_f1': 0.2222222222222222, 'eval_runtime': 0.1939, 'eval_samples_per_second': 288.878, 'eval_steps_per_second': 20.634, 'epoch': 1.27}

Trial TorchTrainer_853b7_00013 completed after 2 iterations at 2025-07-14 05:38:55. Total running time: 9min 23s
+-------------------------------------------------------------+
| Trial TorchTrainer_853b7_00013 result                       |
+-------------------------------------------------------------+
| checkpoint_dir_name                       checkpoint_000001 |
| time_this_iter_s                                    8.28605 |
| time_total_s                                       31.52741 |
| training_iteration                                        2 |
| epoch                                               1.26667 |
| eval_accuracy                                           0.5 |
| eval_f1                                             0.22222 |
| eval_loss     

[36m(RayTrainWorker pid=9686)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/root/ray_results/tune_transformers/TorchTrainer_853b7_00013_13_batch_size=16,epochs=4,learning_rate=0.0002,weight_decay=0.0010_2025-07-14_05-29-32/checkpoint_000001)
[36m(SplitCoordinator pid=9798)[0m Registered dataset logger for dataset eval_30_1
[36m(SplitCoordinator pid=9798)[0m Starting execution of Dataset eval_30_1. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=9798)[0m Execution plan of Dataset eval_30_1: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
[36m(SplitCoordinator pid=9798)[0m ✔️  Dataset eval_30_1 execution finished in 0.11 seconds


[2m[36m(pid=9752) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=9752) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(SplitCoordinator pid=9752)[0m Registered dataset logger for dataset train_29_2
[36m(SplitCoordinator pid=9752)[0m Starting execution of Dataset train_29_2. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=9752)[0m Execution plan of Dataset train_29_2: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
[36m(TrainTrainable pid=10093)[0m 2025-07-14 05:39:01.784576: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[36m(TrainTrainable pid=10093)[0m 2025-07-14 05:39:01.801071: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
[36m(TrainTrainable pid=10093)[0m E0000


Trial TorchTrainer_853b7_00014 started with configuration:
+-------------------------------------------------+
| Trial TorchTrainer_853b7_00014 config           |
+-------------------------------------------------+
| train_loop_config/batch_size                128 |
| train_loop_config/epochs                      8 |
| train_loop_config/learning_rate           0.002 |
| train_loop_config/weight_decay            0.001 |
+-------------------------------------------------+

Trial status: 14 TERMINATED | 1 RUNNING | 1 PENDING
Current time: 2025-07-14 05:39:08. Total running time: 9min 35s
Logical resource usage: 0/1 CPUs, 1.0/1 GPUs (0.0/1.0 accelerator_type:A100)
Current best trial: 853b7_00009 with eval_loss=0.7809239625930786 and params={'train_loop_config': {'learning_rate': 0.0002, 'epochs': 2, 'batch_size': 16, 'weight_decay': 0.1}}
+-------------------------------------------------------------------------------------------------------------------------------------------------------

[36m(TorchTrainer pid=10093)[0m Started distributed worker processes: 
[36m(TorchTrainer pid=10093)[0m - (node_id=dfac881cdf0b95ec3c65ef7c281e0d5ea5b341dd9a6d97ecd807b1a3, ip=172.28.0.12, pid=10193) world_rank=0, local_rank=0, node_rank=0
[36m(RayTrainWorker pid=10193)[0m Setting up process group for: env:// [rank=0, world_size=1]
[36m(RayTrainWorker pid=10193)[0m 2025-07-14 05:39:11.886803: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[36m(RayTrainWorker pid=10193)[0m 2025-07-14 05:39:11.903331: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
[36m(RayTrainWorker pid=10193)[0m E0000 00:00:1752471551.924330   10193 cuda_dnn.cc

[36m(RayTrainWorker pid=10193)[0m CUDA available: True


[36m(RayTrainWorker pid=10193)[0m Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
[36m(RayTrainWorker pid=10193)[0m You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[36m(RayTrainWorker pid=10193)[0m Beginning training...


[36m(RayTrainWorker pid=10193)[0m wandb: Currently logged in as: nicolepcx to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
[36m(RayTrainWorker pid=10193)[0m wandb: Tracking run with wandb version 0.21.0
[36m(RayTrainWorker pid=10193)[0m wandb: Run data is saved locally in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/artifacts/2025-07-14_05-29-32/tune_transformers/working_dirs/TorchTrainer_853b7_00014_14_batch_size=128,epochs=8,learning_rate=0.0020,weight_decay=0.0010_2025-07-14_05-29-32/wandb/run-20250714_053919-xul5lojq
[36m(RayTrainWorker pid=10193)[0m wandb: Run `wandb offline` to turn off syncing.
[36m(RayTrainWorker pid=10193)[0m wandb: Syncing run superglue_cb
[36m(RayTrainWorker pid=10193)[0m wandb: ⭐️ View project at https://wandb.ai/nicolepcx/huggingface
[36m(RayTrainWorker pid=10193)[0m wandb: 🚀 View run at https://wandb.ai/nicolepcx/huggingface/runs/xul5lojq
  0%|          | 0/120 [00:00<?, ?it/s]


[2m[36m(pid=10259) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=10259) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(SplitCoordinator pid=10259)[0m Registered dataset logger for dataset train_31_0
[36m(SplitCoordinator pid=10259)[0m Starting execution of Dataset train_31_0. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=10259)[0m Execution plan of Dataset train_31_0: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
  1%|          | 1/120 [00:00<01:10,  1.69it/s]
  2%|▎         | 3/120 [00:00<00:24,  4.84it/s]
  4%|▍         | 5/120 [00:00<00:15,  7.45it/s]
  6%|▌         | 7/120 [00:00<00:11,  9.68it/s]
[36m(SplitCoordinator pid=10259)[0m ✔️  Dataset train_31_0 execution finished in 1.00 seconds
  8%|▊         | 9/120 [00:01<00:10, 10.99it/s]
  9%|▉         | 11/120 [00:01<00:09, 10.91it/s]
 11%|█         | 13/120 [00:01<00:09, 11.88it/s]


[36m(RayTrainWorker pid=10193)[0m {'loss': 2.0457, 'grad_norm': 8.499955177307129, 'learning_rate': 0.00175, 'epoch': 0.13}


[36m(RayTrainWorker pid=10193)[0m  12%|█▎        | 15/120 [00:01<00:07, 13.21it/s]
[36m(RayTrainWorker pid=10193)[0m                                                  13%|█▎        | 16/120 [00:01<00:07, 13.21it/s]


[2m[36m(pid=10309) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=10309) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(RayTrainWorker pid=10193)[0m                                                  13%|█▎        | 16/120 [00:01<00:07, 13.21it/s]


[36m(RayTrainWorker pid=10193)[0m {'eval_loss': 1.4230986833572388, 'eval_accuracy': 0.5, 'eval_f1': 0.2222222222222222, 'eval_runtime': 0.226, 'eval_samples_per_second': 247.799, 'eval_steps_per_second': 4.425, 'epoch': 0.13}


[36m(RayTrainWorker pid=10193)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/root/ray_results/tune_transformers/TorchTrainer_853b7_00014_14_batch_size=128,epochs=8,learning_rate=0.0020,weight_decay=0.0010_2025-07-14_05-29-32/checkpoint_000000)
[36m(SplitCoordinator pid=10309)[0m Registered dataset logger for dataset eval_32_0
[36m(SplitCoordinator pid=10309)[0m Starting execution of Dataset eval_32_0. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=10309)[0m Execution plan of Dataset eval_32_0: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
[36m(SplitCoordinator pid=10309)[0m ✔️  Dataset eval_32_0 execution finished in 0.12 seconds


[2m[36m(pid=10259) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=10259) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(SplitCoordinator pid=10259)[0m Registered dataset logger for dataset train_31_1
[36m(SplitCoordinator pid=10259)[0m Starting execution of Dataset train_31_1. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=10259)[0m Execution plan of Dataset train_31_1: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
 14%|█▍        | 17/120 [00:06<01:23,  1.24it/s]
 16%|█▌        | 19/120 [00:06<00:58,  1.73it/s]
 18%|█▊        | 21/120 [00:06<00:41,  2.39it/s]
 19%|█▉        | 23/120 [00:06<00:29,  3.25it/s]
[36m(SplitCoordinator pid=10259)[0m ✔️  Dataset train_31_1 execution finished in 0.55 seconds
 21%|██        | 25/120 [00:06<00:22,  4.26it/s]
 22%|██▎       | 27/120 [00:07<00:17,  5.34it/s]
 24%|██▍       | 29/120 [00:07<00:13,  6.59it/s]


[36m(RayTrainWorker pid=10193)[0m {'loss': 1.3298, 'grad_norm': 4.709969997406006, 'learning_rate': 0.0014833333333333335, 'epoch': 1.13}


[36m(RayTrainWorker pid=10193)[0m  26%|██▌       | 31/120 [00:07<00:10,  8.10it/s]
[36m(RayTrainWorker pid=10193)[0m                                                  27%|██▋       | 32/120 [00:07<00:10,  8.10it/s]


[2m[36m(pid=10309) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=10309) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]


Trial TorchTrainer_853b7_00014 completed after 1 iterations at 2025-07-14 05:39:28. Total running time: 9min 55s
+-------------------------------------------------------------+
| Trial TorchTrainer_853b7_00014 result                       |
+-------------------------------------------------------------+
| checkpoint_dir_name                       checkpoint_000000 |
| time_this_iter_s                                   20.19151 |
| time_total_s                                       20.19151 |
| training_iteration                                        1 |
| epoch                                               0.13333 |
| eval_accuracy                                           0.5 |
| eval_f1                                             0.22222 |
| eval_loss                                            1.4231 |
| eval_runtime                                          0.226 |
| eval_samples_per_second                             247.799 |
| eval_steps_per_second                               

You may want to consider increasing the `CheckpointConfig(num_to_keep)` or decreasing the frequency of saving checkpoints.



Trial status: 15 TERMINATED | 1 PENDING
Current time: 2025-07-14 05:39:38. Total running time: 10min 6s
Logical resource usage: 0/1 CPUs, 1.0/1 GPUs (0.0/1.0 accelerator_type:A100)
Current best trial: 853b7_00009 with eval_loss=0.7809239625930786 and params={'train_loop_config': {'learning_rate': 0.0002, 'epochs': 2, 'batch_size': 16, 'weight_decay': 0.1}}
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                 status         ...oop_config/epochs     ...config/batch_size     ...fig/learning_rate     ...nfig/weight_decay     iter     total time (s)      loss     grad_norm     learning_rate      epoch |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

[36m(TrainTrainable pid=10562)[0m 2025-07-14 05:39:38.962956: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[36m(SplitCoordinator pid=10309)[0m Registered dataset logger for dataset eval_32_1
[36m(SplitCoordinator pid=10309)[0m Starting execution of Dataset eval_32_1. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=10309)[0m Execution plan of Dataset eval_32_1: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
[36m(SplitCoordinator pid=10309)[0m ✔️  Dataset eval_32_1 execution finished in 0.11 seconds
[36m(TrainTrainable pid=10562)[0m 2025-07-14 05:39:38.980118: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory 


Trial TorchTrainer_853b7_00015 started with configuration:
+-------------------------------------------------+
| Trial TorchTrainer_853b7_00015 config           |
+-------------------------------------------------+
| train_loop_config/batch_size                 32 |
| train_loop_config/epochs                      8 |
| train_loop_config/learning_rate            0.02 |
| train_loop_config/weight_decay            0.001 |
+-------------------------------------------------+


[36m(RayTrainWorker pid=10669)[0m Setting up process group for: env:// [rank=0, world_size=1]
[36m(TorchTrainer pid=10562)[0m Started distributed worker processes: 
[36m(TorchTrainer pid=10562)[0m - (node_id=dfac881cdf0b95ec3c65ef7c281e0d5ea5b341dd9a6d97ecd807b1a3, ip=172.28.0.12, pid=10669) world_rank=0, local_rank=0, node_rank=0
[36m(RayTrainWorker pid=10669)[0m 2025-07-14 05:39:48.891568: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[36m(RayTrainWorker pid=10669)[0m 2025-07-14 05:39:48.907934: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
[36m(RayTrainWorker pid=10669)[0m E0000 00:00:1752471588.929326   10669 cuda_dnn.cc

[36m(RayTrainWorker pid=10669)[0m CUDA available: True


[36m(RayTrainWorker pid=10669)[0m Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
[36m(RayTrainWorker pid=10669)[0m You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[36m(RayTrainWorker pid=10669)[0m Beginning training...


[36m(RayTrainWorker pid=10669)[0m wandb: Currently logged in as: nicolepcx to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
[36m(RayTrainWorker pid=10669)[0m wandb: Tracking run with wandb version 0.21.0
[36m(RayTrainWorker pid=10669)[0m wandb: Run data is saved locally in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/artifacts/2025-07-14_05-29-32/tune_transformers/working_dirs/TorchTrainer_853b7_00015_15_batch_size=32,epochs=8,learning_rate=0.0200,weight_decay=0.0010_2025-07-14_05-29-32/wandb/run-20250714_053956-gnxfkw4x
[36m(RayTrainWorker pid=10669)[0m wandb: Run `wandb offline` to turn off syncing.
[36m(RayTrainWorker pid=10669)[0m wandb: Syncing run superglue_cb
[36m(RayTrainWorker pid=10669)[0m wandb: ⭐️ View project at https://wandb.ai/nicolepcx/huggingface
[36m(RayTrainWorker pid=10669)[0m wandb: 🚀 View run at https://wandb.ai/nicolepcx/huggingface/runs/gnxfkw4x
  0%|          | 0/120 [00:00<?, ?it/s]


[2m[36m(pid=10737) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=10737) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(SplitCoordinator pid=10737)[0m Registered dataset logger for dataset train_33_0
[36m(SplitCoordinator pid=10737)[0m Starting execution of Dataset train_33_0. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=10737)[0m Execution plan of Dataset train_33_0: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
  1%|          | 1/120 [00:00<01:06,  1.78it/s]
  2%|▎         | 3/120 [00:00<00:23,  5.02it/s]
  4%|▍         | 5/120 [00:00<00:14,  7.89it/s]
  6%|▌         | 7/120 [00:00<00:10, 10.28it/s]
[36m(SplitCoordinator pid=10737)[0m ✔️  Dataset train_33_0 execution finished in 0.95 seconds
  8%|▊         | 9/120 [00:01<00:09, 11.53it/s]
  9%|▉         | 11/120 [00:01<00:09, 11.27it/s]


[36m(RayTrainWorker pid=10669)[0m {'loss': 28.9163, 'grad_norm': 120.58164978027344, 'learning_rate': 0.0175, 'epoch': 0.13}


[36m(RayTrainWorker pid=10669)[0m  11%|█         | 13/120 [00:01<00:08, 12.16it/s]
[36m(RayTrainWorker pid=10669)[0m  12%|█▎        | 15/120 [00:01<00:07, 13.45it/s]                                                 13%|█▎        | 16/120 [00:01<00:07, 13.45it/s]


[2m[36m(pid=10787) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=10787) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(RayTrainWorker pid=10669)[0m                                                  13%|█▎        | 16/120 [00:01<00:07, 13.45it/s]


[36m(RayTrainWorker pid=10669)[0m {'eval_loss': 23.915176391601562, 'eval_accuracy': 0.4107142857142857, 'eval_f1': 0.1940928270042194, 'eval_runtime': 0.22, 'eval_samples_per_second': 254.49, 'eval_steps_per_second': 9.089, 'epoch': 0.13}

Trial TorchTrainer_853b7_00015 completed after 1 iterations at 2025-07-14 05:40:02. Total running time: 10min 30s
+-------------------------------------------------------------+
| Trial TorchTrainer_853b7_00015 result                       |
+-------------------------------------------------------------+
| checkpoint_dir_name                       checkpoint_000000 |
| time_this_iter_s                                   19.07696 |
| time_total_s                                       19.07696 |
| training_iteration                                        1 |
| epoch                                               0.13333 |
| eval_accuracy                                       0.41071 |
| eval_f1                                             0.19409 |
| e

[36m(RayTrainWorker pid=10669)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/root/ray_results/tune_transformers/TorchTrainer_853b7_00015_15_batch_size=32,epochs=8,learning_rate=0.0200,weight_decay=0.0010_2025-07-14_05-29-32/checkpoint_000000)
[36m(SplitCoordinator pid=10787)[0m Registered dataset logger for dataset eval_34_0
[36m(SplitCoordinator pid=10787)[0m Starting execution of Dataset eval_34_0. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=10787)[0m Execution plan of Dataset eval_34_0: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
2025-07-14 05:40:10,442	INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to '/root/ray_results/tune_transformers' in 7.7316s.



Trial status: 16 TERMINATED
Current time: 2025-07-14 05:40:10. Total running time: 10min 38s
Logical resource usage: 0/1 CPUs, 1.0/1 GPUs (0.0/1.0 accelerator_type:A100)
Current best trial: 853b7_00009 with eval_loss=0.7809239625930786 and params={'train_loop_config': {'learning_rate': 0.0002, 'epochs': 2, 'batch_size': 16, 'weight_decay': 0.1}}
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                 status         ...oop_config/epochs     ...config/batch_size     ...fig/learning_rate     ...nfig/weight_decay     iter     total time (s)      loss     grad_norm     learning_rate      epoch |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

## View Training results

View the results of the tuning run as a dataframe.

In [None]:
tune_results.get_dataframe().sort_values("eval_loss")

Unnamed: 0,loss,grad_norm,learning_rate,epoch,step,eval_loss,eval_accuracy,eval_f1,eval_runtime,eval_samples_per_second,...,pid,hostname,node_ip,time_since_restore,iterations_since_restore,config/train_loop_config/learning_rate,config/train_loop_config/epochs,config/train_loop_config/batch_size,config/train_loop_config/weight_decay,logdir
9,,,,,,0.780924,0.678571,,,,...,7513,fe77ef01b3ab,172.28.0.12,31.517041,3,0.0002,2,16,0.1,853b7_00009
12,,,,,,0.820447,0.696429,,,,...,8946,fe77ef01b3ab,172.28.0.12,50.806218,5,2e-05,4,64,0.001,853b7_00012
0,,,,,,0.845492,0.678571,,,,...,3042,fe77ef01b3ab,172.28.0.12,57.555266,7,2e-05,6,128,0.0,853b7_00000
8,,,,,,0.906519,0.535714,,,,...,7012,fe77ef01b3ab,172.28.0.12,28.17588,3,2e-05,2,32,0.1,853b7_00008
4,0.9009,3.37315,6.666667e-07,1.466667,30.0,0.911185,0.589286,0.395522,0.185,302.643,...,5145,fe77ef01b3ab,172.28.0.12,26.998468,2,2e-05,2,128,0.01,853b7_00004
13,1.0283,1.396966,9.666667e-05,1.266667,32.0,0.95305,0.5,0.222222,0.1939,288.878,...,9575,fe77ef01b3ab,172.28.0.12,31.527412,2,0.0002,4,16,0.001,853b7_00013
1,0.9918,2.880574,0.00015,0.266667,16.0,1.011027,0.5,0.222222,0.2131,262.767,...,3817,fe77ef01b3ab,172.28.0.12,20.134871,1,0.0002,4,16,0.0,853b7_00001
5,0.9997,5.086144,0.00015,0.266667,16.0,1.06974,0.5,0.222222,0.2119,264.221,...,5635,fe77ef01b3ab,172.28.0.12,19.229907,1,0.0002,4,32,0.01,853b7_00005
2,1.5553,14.364967,0.0015,0.266667,16.0,1.420181,0.5,0.222222,0.2071,270.416,...,4259,fe77ef01b3ab,172.28.0.12,19.510899,1,0.002,4,16,0.0,853b7_00002
14,2.0457,8.499955,0.00175,0.133333,16.0,1.423099,0.5,0.222222,0.226,247.799,...,10093,fe77ef01b3ab,172.28.0.12,20.191509,1,0.002,8,128,0.001,853b7_00014


## Get best hyperparameters

In [None]:
best_hyperparameters = tune_results.get_best_result()

In [None]:
best_trial = tune_results.get_best_result(metric="eval_loss", mode="min")
train_loop_config = best_trial.config['train_loop_config']

## Train the model with best hyperparameters

In [None]:
# Preparing the configuration with the best hyperparameters
trainer = TorchTrainer(
    train_loop_per_worker=train_func,
    train_loop_config=train_loop_config,
    scaling_config=ScalingConfig(num_workers=num_workers, use_gpu=use_gpu),
    datasets={
        "train": ray_datasets["train"],
        "eval": ray_datasets["validation"],
    },
    run_config=RunConfig(
        checkpoint_config=CheckpointConfig(
            num_to_keep=1,
            checkpoint_score_attribute="eval_loss",
            checkpoint_score_order="min",
        ),
    ),
)

# Start the training process with the best hyperparameters
result = trainer.fit()




View detailed results here: /root/ray_results/TorchTrainer_2025-07-14_05-40-10


[36m(TrainTrainable pid=11035)[0m 2025-07-14 05:40:16.035236: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[36m(TrainTrainable pid=11035)[0m 2025-07-14 05:40:16.052348: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
[36m(TrainTrainable pid=11035)[0m E0000 00:00:1752471616.073427   11035 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
[36m(TrainTrainable pid=11035)[0m E0000 00:00:1752471616.079958   11035 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


Training started with configuration:
+------------------------------------------+
| Training config                          |
+------------------------------------------+
| train_loop_config/batch_size          16 |
| train_loop_config/epochs               2 |
| train_loop_config/learning_rate   0.0002 |
| train_loop_config/weight_decay       0.1 |
+------------------------------------------+


[36m(TorchTrainer pid=11035)[0m Started distributed worker processes: 
[36m(TorchTrainer pid=11035)[0m - (node_id=dfac881cdf0b95ec3c65ef7c281e0d5ea5b341dd9a6d97ecd807b1a3, ip=172.28.0.12, pid=11136) world_rank=0, local_rank=0, node_rank=0
[36m(RayTrainWorker pid=11136)[0m Setting up process group for: env:// [rank=0, world_size=1]
[36m(RayTrainWorker pid=11136)[0m 2025-07-14 05:40:26.043580: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[36m(RayTrainWorker pid=11136)[0m 2025-07-14 05:40:26.059935: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
[36m(RayTrainWorker pid=11136)[0m E0000 00:00:1752471626.083752   11136 cuda_dnn.cc

[36m(RayTrainWorker pid=11136)[0m CUDA available: True


[36m(RayTrainWorker pid=11136)[0m Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
[36m(RayTrainWorker pid=11136)[0m You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[36m(RayTrainWorker pid=11136)[0m Beginning training...


[36m(RayTrainWorker pid=11136)[0m wandb: Currently logged in as: nicolepcx to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
[36m(RayTrainWorker pid=11136)[0m wandb: Tracking run with wandb version 0.21.0
[36m(RayTrainWorker pid=11136)[0m wandb: Run data is saved locally in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/artifacts/2025-07-14_05-40-10/TorchTrainer_2025-07-14_05-40-10/working_dirs/TorchTrainer_01ca6_00000_0_2025-07-14_05-40-10/wandb/run-20250714_054033-nt76batq
[36m(RayTrainWorker pid=11136)[0m wandb: Run `wandb offline` to turn off syncing.
[36m(RayTrainWorker pid=11136)[0m wandb: Syncing run superglue_cb
[36m(RayTrainWorker pid=11136)[0m wandb: ⭐️ View project at https://wandb.ai/nicolepcx/huggingface
[36m(RayTrainWorker pid=11136)[0m wandb: 🚀 View run at https://wandb.ai/nicolepcx/huggingface/runs/nt76batq
  0%|          | 0/30 [00:00<?, ?it/s]
[36m(SplitCoordinator pid=11202)[0m Registered dataset logger for dataset train_35_0
[3

[2m[36m(pid=11202) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=11202) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

  3%|▎         | 1/30 [00:00<00:16,  1.79it/s]
 10%|█         | 3/30 [00:00<00:05,  5.07it/s]
 17%|█▋        | 5/30 [00:00<00:03,  7.87it/s]
 23%|██▎       | 7/30 [00:00<00:02, 10.29it/s]
[36m(SplitCoordinator pid=11202)[0m ✔️  Dataset train_35_0 execution finished in 0.95 seconds
 30%|███       | 9/30 [00:01<00:01, 11.48it/s]
 37%|███▋      | 11/30 [00:01<00:01, 11.23it/s]
 43%|████▎     | 13/30 [00:01<00:01, 12.14it/s]


[36m(RayTrainWorker pid=11136)[0m {'loss': 1.0042, 'grad_norm': 2.1445605754852295, 'learning_rate': 0.0001, 'epoch': 0.53}


[36m(RayTrainWorker pid=11136)[0m  50%|█████     | 15/30 [00:01<00:01, 13.44it/s]                                                53%|█████▎    | 16/30 [00:01<00:01, 13.44it/s]


[2m[36m(pid=11256) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=11256) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(RayTrainWorker pid=11136)[0m {'eval_loss': 0.9425932765007019, 'eval_accuracy': 0.5, 'eval_f1': 0.2222222222222222, 'eval_runtime': 0.2096, 'eval_samples_per_second': 267.191, 'eval_steps_per_second': 19.085, 'epoch': 0.53}


[36m(RayTrainWorker pid=11136)[0m                                                 53%|█████▎    | 16/30 [00:01<00:01, 13.44it/s]
[36m(RayTrainWorker pid=11136)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/root/ray_results/TorchTrainer_2025-07-14_05-40-10/TorchTrainer_01ca6_00000_0_2025-07-14_05-40-10/checkpoint_000000)
[36m(SplitCoordinator pid=11256)[0m Registered dataset logger for dataset eval_36_0
[36m(SplitCoordinator pid=11256)[0m Starting execution of Dataset eval_36_0. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=11256)[0m Execution plan of Dataset eval_36_0: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
[36m(SplitCoordinator pid=11256)[0m ✔️  Dataset eval_36_0 execution finished in 0.11 seconds
[36m(SplitCoordinator pid=11202)[0m Registered dataset logger for dataset train_35_1



Training finished iteration 1 at 2025-07-14 05:40:42. Total running time: 31s
+---------------------------------------------+
| Training result                             |
+---------------------------------------------+
| checkpoint_dir_name       checkpoint_000000 |
| time_this_iter_s                   20.30288 |
| time_total_s                       20.30288 |
| training_iteration                        1 |
| epoch                               0.53333 |
| eval_accuracy                           0.5 |
| eval_f1                             0.22222 |
| eval_loss                           0.94259 |
| eval_runtime                         0.2096 |
| eval_samples_per_second             267.191 |
| eval_steps_per_second                19.085 |
| grad_norm                           2.14456 |
| learning_rate                        0.0001 |
| loss                                 1.0042 |
| step                                     16 |
+---------------------------------------------+


[36m(SplitCoordinator pid=11202)[0m Starting execution of Dataset train_35_1. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=11202)[0m Execution plan of Dataset train_35_1: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]


[2m[36m(pid=11202) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=11202) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(RayTrainWorker pid=11136)[0m  57%|█████▋    | 17/30 [00:07<00:13,  1.03s/it]
You may want to consider increasing the `CheckpointConfig(num_to_keep)` or decreasing the frequency of saving checkpoints.
 63%|██████▎   | 19/30 [00:07<00:08,  1.37it/s]


Training saved a checkpoint for iteration 1 at: (local)/root/ray_results/TorchTrainer_2025-07-14_05-40-10/TorchTrainer_01ca6_00000_0_2025-07-14_05-40-10/checkpoint_000000


 70%|███████   | 21/30 [00:07<00:04,  1.91it/s]
 77%|███████▋  | 23/30 [00:08<00:02,  2.62it/s]
[36m(SplitCoordinator pid=11202)[0m ✔️  Dataset train_35_1 execution finished in 0.55 seconds
 83%|████████▎ | 25/30 [00:08<00:01,  3.47it/s]


[36m(RayTrainWorker pid=11136)[0m {'loss': 0.9108, 'grad_norm': 4.80816650390625, 'learning_rate': 6.666666666666667e-06, 'epoch': 1.47}


[36m(RayTrainWorker pid=11136)[0m  90%|█████████ | 27/30 [00:08<00:00,  4.46it/s]
[36m(RayTrainWorker pid=11136)[0m  97%|█████████▋| 29/30 [00:08<00:00,  5.65it/s]                                               100%|██████████| 30/30 [00:08<00:00,  5.65it/s]


[2m[36m(pid=11256) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=11256) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(RayTrainWorker pid=11136)[0m {'eval_loss': 0.8720429539680481, 'eval_accuracy': 0.6607142857142857, 'eval_f1': 0.4612159329140461, 'eval_runtime': 0.1819, 'eval_samples_per_second': 307.842, 'eval_steps_per_second': 21.989, 'epoch': 1.47}


[36m(RayTrainWorker pid=11136)[0m                                                100%|██████████| 30/30 [00:08<00:00,  5.65it/s]



Training finished iteration 2 at 2025-07-14 05:40:47. Total running time: 37s
+---------------------------------------------+
| Training result                             |
+---------------------------------------------+
| checkpoint_dir_name       checkpoint_000001 |
| time_this_iter_s                    5.19791 |
| time_total_s                       25.50079 |
| training_iteration                        2 |
| epoch                               1.46667 |
| eval_accuracy                       0.66071 |
| eval_f1                             0.46122 |
| eval_loss                           0.87204 |
| eval_runtime                         0.1819 |
| eval_samples_per_second             307.842 |
| eval_steps_per_second                21.989 |
| grad_norm                           4.80817 |
| learning_rate                       0.00001 |
| loss                                 0.9108 |
| step                                     30 |
+---------------------------------------------+
Training 

[36m(RayTrainWorker pid=11136)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/root/ray_results/TorchTrainer_2025-07-14_05-40-10/TorchTrainer_01ca6_00000_0_2025-07-14_05-40-10/checkpoint_000001)
[36m(SplitCoordinator pid=11256)[0m Registered dataset logger for dataset eval_36_1
[36m(SplitCoordinator pid=11256)[0m Starting execution of Dataset eval_36_1. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=11256)[0m Execution plan of Dataset eval_36_1: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
[36m(SplitCoordinator pid=11256)[0m ✔️  Dataset eval_36_1 execution finished in 0.11 seconds
[36m(RayTrainWorker pid=11136)[0m                                                100%|██████████| 30/30 [00:13<00:00,  5.65it/s]100%|██████████| 30/30 [00:13<00:00,  2.27it/s]


[2m[36m(pid=11256) [0mRunning 0: 0.00 row [00:00, ? row/s]

[2m[36m(pid=11256) [0m- split(1, equal=True) 1: 0.00 row [00:00, ? row/s]

[36m(SplitCoordinator pid=11256)[0m Registered dataset logger for dataset eval_36_2
[36m(SplitCoordinator pid=11256)[0m Starting execution of Dataset eval_36_2. Full logs are in /tmp/ray/session_2025-07-14_05-29-01_733730_2160/logs/ray-data
[36m(SplitCoordinator pid=11256)[0m Execution plan of Dataset eval_36_2: InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
[36m(SplitCoordinator pid=11256)[0m ✔️  Dataset eval_36_2 execution finished in 0.11 seconds


[36m(RayTrainWorker pid=11136)[0m Final evaluation: {'eval_loss': 0.8720429539680481, 'eval_accuracy': 0.6607142857142857, 'eval_f1': 0.4612159329140461, 'eval_runtime': 0.2018, 'eval_samples_per_second': 277.539, 'eval_steps_per_second': 19.824, 'epoch': 1.4666666666666668}


[36m(RayTrainWorker pid=11136)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/root/ray_results/TorchTrainer_2025-07-14_05-40-10/TorchTrainer_01ca6_00000_0_2025-07-14_05-40-10/checkpoint_000002)
You may want to consider increasing the `CheckpointConfig(num_to_keep)` or decreasing the frequency of saving checkpoints.



Training finished iteration 3 at 2025-07-14 05:40:48. Total running time: 38s
+-----------------------------------------+
| Training result                         |
+-----------------------------------------+
| checkpoint_dir_name   checkpoint_000002 |
| time_this_iter_s                1.15634 |
| time_total_s                   26.65713 |
| training_iteration                    3 |
| eval_accuracy                   0.66071 |
| eval_loss                       0.87204 |
+-----------------------------------------+
Training saved a checkpoint for iteration 3 at: (local)/root/ray_results/TorchTrainer_2025-07-14_05-40-10/TorchTrainer_01ca6_00000_0_2025-07-14_05-40-10/checkpoint_000002

Training completed after 3 iterations at 2025-07-14 05:40:50. Total running time: 39s


2025-07-14 05:40:50,358	INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to '/root/ray_results/TorchTrainer_2025-07-14_05-40-10' in 0.3288s.





## Save model from checkpoint

In [None]:
wandb.finish()

In [None]:
result.checkpoint

Checkpoint(filesystem=local, path=/root/ray_results/TorchTrainer_2025-07-14_05-40-10/TorchTrainer_01ca6_00000_0_2025-07-14_05-40-10/checkpoint_000002)

In [None]:
with result.checkpoint.as_directory() as checkpoint_dir:
    model = AutoModelForSequenceClassification.from_pretrained(checkpoint_dir)
    tokenizer = AutoTokenizer.from_pretrained(checkpoint_dir)
    print(f"Model loaded from {checkpoint_dir}")


Model loaded from /root/ray_results/TorchTrainer_2025-07-14_05-40-10/TorchTrainer_01ca6_00000_0_2025-07-14_05-40-10/checkpoint_000002



## Share the model

To share the model via Hugging Face, additional steps are necessary.

You have get your authentication token from the Hugging Face website. If you're not registered yet, create an account [here](https://huggingface.co/join). Subsequently, run the cell below and enter your username and password:

In [None]:
#notebook_login()
#model.push_to_hub()