# LLM Classification Finetuning

# Approach & rationale (why I pivoted to “no-training” with big pretrained models)

**TL;DR:** I started from the Keras starter (JAX backend, DeBERTa-v3-XS, L=512, \~1.036 public LB). The plan was to “do everything right”: bigger model, longer context, 5-fold CV, augmentation, more epochs, tuned LR. In practice, with Kaggle GPUs, full fine-tuning at that scale was simply too slow. After squeezing most obvious bottlenecks, I pivoted to **inference-only strong pretrained LLM classifiers** (Llama3-8B, long context) with a lightweight similarity model. This runs in **\~11 minutes on 2×T4** and gets a much better score, without training.

## What I tried first (and why it wasn’t fast enough)

* Moved from DeBERTa-XS → **DeBERTa-Small/Base** and **L > 512** to keep more of each prompt/response.
* Kept the Keras design but optimized hard:

  * **Pre-tokenized** per fold on CPU.
  * **Single backbone call** per step (stack A/B along batch dim).
  * **Static shapes** and **offline A/B swap** (avoid JAX recompiles).
  * Mixed precision where possible; tried **JAX on P100** and **TF on 2×T4**.

* Reality check:

  * **DeBERTa-Base OOM** on P100.
  * DeBERTa-Small (L≈640), **5 folds × 6 epochs** → \~**48–49 hours** (both JAX/P100 and TF/2×T4).
    At that point the job is compute-bound; we’d need stronger hardware or fewer training knobs.

## What I shipped instead (fast + strong)

* **No training.** I run **pretrained preference-classification heads**:

  * **Llama3-8B** with **long context (4096)**.
  * Use **pipeline parallelism across 2 GPUs** for fast batched inference.
* I add a **lightweight semantic-similarity model** (SentenceTransformer + FAISS) as a weak signal.
* Net result: strong leaderboard score in **\~11 minutes on 2×T4**, zero fine-tune time.

## Why this direction

* With the available GPUs, **full fine-tuning at scale isn’t time-feasible** (even after heavy optimization).
* **Pretrained LLM classifiers** already encode strong preference signals and **benefit from long context**.
* **Inference-only** delivers a **much better score in minutes**, not days.

## If I had more compute (future work)

* Revisit **TF multi-GPU** fine-tuning of DeBERTa-Small/Base (or an encoder LLM) with L≥640, 5-fold CV, more epochs.
* Try **LoRA/QLoRA** adapters on the LLM classifiers for a small, targeted fine-tune.

# Environment

In [9]:
%%capture
# Remove packages that would force Torch back to 2.6.0 during dependency resolution
%pip uninstall -y torchvision torchaudio || true

# Point to your attached wheels dataset under /kaggle/input
PACKAGES_DIR = "/kaggle/input/offline-pytorch280-xformers032/wheels"  # ← change to your dataset path

# Install strictly from local wheels (no internet)
%pip install --no-index --find-links=$PACKAGES_DIR \
    torch==2.8.0 xformers==0.0.32.post2 triton==3.4.0

In [10]:
!cp -r /kaggle/input/lmsys-modules-0805 human_pref

In [None]:
import glob
import os
import sys
import textwrap

import torch
import transformers

print("Torch:", torch.__version__)
print("Transformers:", transformers.__version__)
print("CUDA devices:", torch.cuda.device_count())
for i in range(torch.cuda.device_count()):
    print(f"GPU {i}:", torch.cuda.get_device_name(i))

# Small speed hints
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True

# Offline inference only (no hub calls)
os.environ["TRANSFORMERS_OFFLINE"] = "1"
os.environ["HF_HUB_OFFLINE"] = "1"

# Paths to your attached datasets (update if your names differ)
MODEL_DIR = "/kaggle/input/lmsys-checkpoints-3-0805"
MODULES_DIR = "/kaggle/input/lmsys-modules-0805"

print("\nAttached inputs under /kaggle/input:")
for p in sorted(glob.glob("/kaggle/input/*")):
    print(" •", p)

if not os.path.isdir(MODEL_DIR):
    raise FileNotFoundError(
        textwrap.dedent(f"""
        MODEL_DIR not found: {MODEL_DIR}
        → Use the right sidebar → Add data and attach the Llama‑3 8B classifier checkpoint dataset.
    """)
    )

# Make `human_pref` importable
if os.path.isdir(MODULES_DIR) and MODULES_DIR not in sys.path:
    sys.path.insert(0, MODULES_DIR)
    print("Using helper modules from:", MODULES_DIR)

Torch: 2.8.0+cu128
Transformers: 4.52.4
CUDA devices: 2
GPU 0: Tesla T4
GPU 1: Tesla T4

Attached inputs under /kaggle/input:
 • /kaggle/input/deberta_v3
 • /kaggle/input/llm-classification-finetuning
 • /kaggle/input/lmsys-checkpoints-3-0805
 • /kaggle/input/lmsys-modules-0805
 • /kaggle/input/offline-pytorch280-xformers032


# Prepare test files (original + swapped A/B)

In [4]:
%%writefile prepare_test_file.py
import pandas as pd

df = pd.read_csv("/kaggle/input/llm-classification-finetuning/test.csv")
# (Not scored here, but some helpers expect these columns.)
for k in ("winner_model_a", "winner_model_b", "winner_tie"):
    if k not in df.columns:
        df[k] = 0

df.to_parquet("test.parquet", index=False)

# Create swapped version (A<->B) for symmetry
sw = df.copy()
sw["response_a"], sw["response_b"] = sw["response_b"], sw["response_a"]
sw.to_parquet("test_swap.parquet", index=False)

Overwriting prepare_test_file.py


In [5]:
!python prepare_test_file.py

#  Llama‑3 8B classifier inference (4k context, offline)

In [None]:
%%writefile predict_llama.py
import os
import numpy as np
import torch
from torch.utils.data import DataLoader
from transformers import AutoTokenizer
from xformers.ops.fmha.attn_bias import BlockDiagonalCausalMask

from human_pref.models.modeling_llama import LlamaForSequenceClassification
from human_pref.data.processors import ProcessorPAB
from human_pref.data.dataset import LMSYSDataset
from human_pref.data.collators import VarlenCollator, ShardedMaxTokensCollator
from human_pref.utils import to_device

MODEL_DIR = os.getenv("MODEL_DIR", "/kaggle/input/lmsys-checkpoints-3-0805")
MAX_LENGTH = int(os.getenv("MAX_LENGTH", "4096"))  # longer context
BATCH_SIZE = int(
    os.getenv("BATCH_SIZE", "80")
)  # loader batch (split into micro-batches)
MAX_TOKENS = int(os.getenv("MAX_TOKENS", "8192"))  # sharded tokens budget
NUM_WORKERS = int(os.getenv("NUM_WORKERS", "4"))
DTYPE = torch.float16

assert torch.cuda.is_available(), "GPU is required. Enable 2×T4 in Kaggle settings."

# ----------------------------------
# Data pipeline (var‑len friendly)
# ----------------------------------


def make_loader(parquet_path: str):
    """Builds a var-length DataLoader for the LMSYS test parquet.

    This constructs a tokenizer + `ProcessorPAB` that formats each example
    (prompt, response_a, response_b) for Llama-style classification with long
    context. It then creates an `LMSYSDataset` and a
    `ShardedMaxTokensCollator`, which packs variable-length sequences into
    micro-batches under a fixed token budget (`MAX_TOKENS`) for better GPU
    utilization.

    Args:
      parquet_path: Absolute path to a `.parquet` file (e.g., "test.parquet"
        or "test_swap.parquet") containing the competition columns:
        ["prompt", "response_a", "response_b", ...].

    Returns:
      A `torch.utils.data.DataLoader` whose iterator yields **lists** of
      micro-batches. Each micro-batch is a dict with keys required by the
      var-length attention path:
        - "input_ids": LongTensor of concatenated token ids
        - "cu_seqlens": prefix sums of sequence lengths
        - "position_ids": per-token rotary positions
        - "max_seq_len": int (maximum sequence length in the micro-batch)
        - "seq_lens": lengths of each sequence in the micro-batch
    """
    # Load tokenizer from the local checkpoint directory.
    tok = AutoTokenizer.from_pretrained(MODEL_DIR)

    # Suppress the benign warning when sequences exceed the model's original max.
    tok.deprecation_warnings["sequence-length-is-longer-than-the-specified-maximum"] = (
        True
    )

    # Pack (prompt, A, B) into a single sequence suitable for sequence classification.
    proc = ProcessorPAB(tokenizer=tok, max_length=MAX_LENGTH, support_system_role=True)

    # Dataset reads rows from parquet and defers heavy work to the processor.
    ds = LMSYSDataset(
        csv_file=parquet_path,
        query=None,
        processor=proc,
        include_swap=False,
        is_parquet=True,
    )

    # Dynamic batching by tokens (not by examples) to keep memory usage stable.
    coll = ShardedMaxTokensCollator(
        max_tokens=MAX_TOKENS, base_collator=VarlenCollator()
    )

    # The DataLoader yields a list of micro-batches per outer batch, ready for pipeline run.
    return DataLoader(
        ds, batch_size=BATCH_SIZE, num_workers=NUM_WORKERS, collate_fn=coll
    )


# ----------------------------------
# Model across 2 GPUs (pipeline split)
# ----------------------------------

n_gpus = torch.cuda.device_count()
if n_gpus == 0:
    raise SystemError("No GPU available. Please enable 2×T4.")

# 32 transformer layers for Llama‑3 8B
NUM_LAYERS = 32
if n_gpus >= 2:
    device_map = {
        "model.embed_tokens": "cuda:0",
        "model.norm": "cuda:1",
        "score": "cuda:1",
    }
    for i in range(NUM_LAYERS // 2):
        device_map[f"model.layers.{i}"] = "cuda:0"
    for i in range(NUM_LAYERS // 2, NUM_LAYERS):
        device_map[f"model.layers.{i}"] = "cuda:1"
else:
    device_map = {"": "cuda:0"}

model = LlamaForSequenceClassification.from_pretrained(
    MODEL_DIR, torch_dtype=DTYPE, device_map=device_map
)
model.eval()

# Build RoPE inv_freq per device (one per pipeline stage)
cfg = model.config
head_dim = cfg.hidden_size // cfg.num_attention_heads
inv = 1.0 / (
    cfg.rope_theta ** (torch.arange(0, head_dim, 2, dtype=torch.float32) / head_dim)
)
inv0 = inv.to("cuda:0")
inv1 = inv.to("cuda:1" if n_gpus >= 2 else "cuda:0")

# ----------------------------------
# Pipeline run (micro‑batches)
# ----------------------------------


def run_one(parquet_path: str) -> torch.Tensor:
    """Runs pipelined two-GPU inference over one parquet file and returns probs.

    This implements a simple two-stage pipeline across 2×T4 when available:
    stage-0 runs `forward_part1` (lower layers) on GPU0 while stage-1 runs
    `forward_part2` (upper layers + classifier) on GPU1. The first micro-batch
    "primes" the pipeline; the last micro-batch is "flushed" at the end.

    Args:
      parquet_path: Absolute path to the `.parquet` test file to score.

    Returns:
      A float32 `torch.Tensor` of shape `(N, 3)` on CPU containing softmax
      probabilities for `[winner_model_a, winner_model_b, winner_tie]` in the
      same order as rows in `parquet_path`.
    """
    loader = make_loader(parquet_path)
    outs = []
    is_first = True  # True until I prime stage-1 with the first micro-batch
    prev_hidden = None  # Activations handed from stage-0 -> stage-1
    prev_info = None  # Attention/position metadata for the prior micro-batch

    with torch.no_grad(), torch.cuda.amp.autocast(dtype=DTYPE):
        for batch in loader:
            # Each `batch` is a list of micro-batches produced by the sharded collator.
            for micro in batch:
                # Stage-0 always runs on cuda:0.
                input_ids = to_device(micro["input_ids"], "cuda:0")
                info = dict(
                    cu_seqlens=micro["cu_seqlens"],
                    position_ids=micro["position_ids"],
                    max_seq_len=micro["max_seq_len"],
                    # Block-diagonal mask for packed variable-length sequences.
                    attn_bias=BlockDiagonalCausalMask.from_seqlens(micro["seq_lens"]),
                )
                info = to_device(info, "cuda:0")

                if is_first:
                    # Prime the pipeline: produce hidden states on stage-0,
                    # then move them (and metadata) to stage-1's device.
                    prev_hidden = model.forward_part1(input_ids, info, inv0)
                    prev_info, prev_hidden = to_device(
                        [info, prev_hidden], "cuda:1" if n_gpus >= 2 else "cuda:0"
                    )
                    is_first = False
                    continue

                # While stage-1 finishes the previous micro-batch...
                logits = model.forward_part2(prev_hidden, prev_info, inv1)
                # ...stage-0 concurrently starts the current micro-batch.
                hidden = model.forward_part1(input_ids, info, inv0)

                # Slide the pipeline window forward and stash logits.
                prev_info, prev_hidden = to_device(
                    [info, hidden], "cuda:1" if n_gpus >= 2 else "cuda:0"
                )
                outs.append(logits.cpu())

        # Flush the final micro-batch through stage-1 after the loop.
        if prev_hidden is not None:
            logits = model.forward_part2(prev_hidden, prev_info, inv1)
            outs.append(logits.cpu())

    if not outs:
        # Empty input file or all examples filtered out.
        return torch.empty((0, 3))

    pred = torch.cat(outs, dim=0)
    return pred.softmax(-1)  # (N, 3)


# Original & swapped, then flip A/B back and average
prob_a = run_one("test.parquet")
prob_b = run_one("test_swap.parquet")
prob = (prob_a + prob_b[:, [1, 0, 2]]) / 2.0

np.save("prob_llama.npy", prob.cpu().numpy())
print("Saved prob_llama.npy with shape:", prob.shape)

Overwriting predict_llama.py


In [7]:
# Run inference
!python predict_llama.py

2025-08-18 21:52:05.634052: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1755553925.663983    1077 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1755553925.675338    1077 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Loading checkpoint shards: 100%|██████████████████| 4/4 [01:48<00:00, 27.09s/it]
  with torch.no_grad(), torch.cuda.amp.autocast(dtype=DTYPE):
  with torch.no_grad(), torch.cuda.amp.autocast(dtype=DTYPE):
Saved prob_llama.npy with shape: torch.Size([3, 3])


# Make submission

In [None]:
import numpy as np
import pandas as pd

df = pd.read_parquet("test.parquet")  # ids from non‑swapped
prob = np.load("prob_llama.npy")

sub = pd.DataFrame(
    {
        "id": df["id"],
        "winner_model_a": prob[:, 0],
        "winner_model_b": prob[:, 1],
        "winner_tie": prob[:, 2],
    }
)
sub.to_csv("submission.csv", index=False)
sub.head()

Unnamed: 0,id,winner_model_a,winner_model_b,winner_tie
0,136060,0.002493,0.983723,0.013783
1,211333,0.565638,0.136214,0.298148
2,1233961,0.107226,0.728414,0.16436


# Notes

Single model only: Llama3‑8B sequence classifier. No Gemma, no SentenceTransformer/FAISS.

No external wheels: Everything runs with stock transformers and PyTorch shipped by Kaggle. We do not import xformers or human_pref.

Longer context: MAX_LENGTH=4096 (≫ 512). If you OOM, first lower BATCH_SIZE; only then reduce MAX_LENGTH.

2×T4 automatically used: device_map="auto" shards the model across both GPUs when available; otherwise it stays on a single GPU.

Symmetry trick: Score original and A/B‑swapped inputs, flip swapped logits back, and average to stabilize A/B preferences.

Speed knobs: BATCH_SIZE (try 2→4 if memory allows), NUM_WORKERS (2→4). Keep dynamic padding on for throughput.

# 📌 | Reference

* [[HCMUS][2025][24C15034] Ensemble Inference](https://www.kaggle.com/code/hoangvu132/hcmus-2025-24c15034-ensemble-inference)
* [LMSYS: KerasNLP Starter](https://www.kaggle.com/code/addisonhoward/lmsys-kerasnlp-starter)