# Kaggle: LLM Classification Finetuning


## 0. Environment and Dependencies

Primary libraries used in this notebook:

- `pandas`, `numpy`: data processing
- `matplotlib`, `seaborn`: visualization
- `scikit-learn`: dataset splitting and metrics (accuracy, log_loss)
- `transformers`, `datasets`, `torch`: LLM fine-tuning (e.g., DistilBERT)

Local environment (GPU setup):

- PyTorch: 2.7.1+cu118
- CUDA available: True
- CUDA version: 11.8
- GPU model: NVIDIA GeForce RTX 3050 Laptop GPU
- Current device: 0

Make sure all dependencies are installed before running the notebook.

In [None]:
import sys
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, log_loss

import torch
from datasets import Dataset
import transformers
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    Trainer,
    TrainingArguments,
    DataCollatorWithPadding,
)

sns.set_theme(style="whitegrid")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")
if torch.cuda.is_available():
    print(f"GPU model: {torch.cuda.get_device_name(0)}")
    print(f"Current device: {torch.cuda.current_device()}")
print(f"Transformers version: {transformers.__version__}")
print(f"Python executable: {sys.executable}")

## 1. Data Loading

- Train set: `Dataset/train.csv`
- Test set: `Dataset/test.csv`

Train set columns:

- `id, model_a, model_b, prompt, response_a, response_b, winner_model_a, winner_model_b, winner_tie`

Test set columns:

- `id, prompt, response_a, response_b`

Goal: given the prompt and two responses, predict which response users prefer (A / B / tie).

In [None]:
train_path = os.path.join("Dataset", "train.csv")
test_path = os.path.join("Dataset", "test.csv")

df_train = pd.read_csv(train_path)
df_test = pd.read_csv(test_path)

print("Train shape:", df_train.shape)
print("Test shape:", df_test.shape)

df_train.head()

## 2. Exploratory Data Analysis (EDA)

Before training, visualize the dataset to understand its structure. Focus on:

- Label distribution (winner)
- Basic statistics and distributions of text lengths (prompt / response)
- A few sample rows to understand the task format

In [None]:
# Merge winner columns into a single label
def get_winner(row):
    if row["winner_model_a"] == 1:
        return "Model A"
    if row["winner_model_b"] == 1:
        return "Model B"
    return "Tie"

df_train["winner"] = df_train.apply(get_winner, axis=1)

df_train["winner"].value_counts()

In [None]:
df_train["len_prompt"] = df_train["prompt"].astype(str).apply(len)
df_train["len_resp_a"] = df_train["response_a"].astype(str).apply(len)
df_train["len_resp_b"] = df_train["response_b"].astype(str).apply(len)

q_prompt = df_train["len_prompt"].quantile(0.95)
q_resp_a = df_train["len_resp_a"].quantile(0.95)
q_resp_b = df_train["len_resp_b"].quantile(0.95)

fig, axes = plt.subplots(1, 3, figsize=(15, 4))
sns.histplot(df_train["len_prompt"], bins=50, ax=axes[0])
axes[0].set_title("Prompt Length (0-95% quantile)")
axes[0].set_xlim(0, q_prompt)

sns.histplot(df_train["len_resp_a"], bins=50, ax=axes[1])
axes[1].set_title("Response A Length (0-95% quantile)")
axes[1].set_xlim(0, q_resp_a)

sns.histplot(df_train["len_resp_b"], bins=50, ax=axes[2])
axes[2].set_title("Response B Length (0-95% quantile)")
axes[2].set_xlim(0, q_resp_b)

plt.tight_layout()
plt.show()

In [None]:
df_train[["prompt", "response_a", "response_b", "winner"]].head(3)

## 3. Model Choice and Principles

This is a **text multi-class classification** task: given `(prompt, response_a, response_b)`, predict which response is preferred (or tie).

This notebook uses **DistilBERT** as the base model:

- DistilBERT is a distilled, smaller BERT with fewer parameters and faster inference, while preserving most semantic capability.
- Pretraining learns general language representations; fine-tuning maps them to the preference classification task.
- Model card and paper:
  - https://huggingface.co/distilbert-base-uncased
  - https://arxiv.org/abs/1910.01108

### Input Construction Strategy

We concatenate `(prompt, response_a, response_b)` into a single sequence so the model can compare both responses within one context window.

```text
Prompt: <prompt> \n Response A: <response_a> \n Response B: <response_b>
```

Why this works:
- Self-attention aligns key information across segments, learning prompt-response alignment and A/B differences.
- The classifier reads a single global representation (e.g., [CLS] or pooled embedding), effectively comparing all three segments.
- A fixed template (Prompt / Response A / Response B) makes the input structure explicit and stable.

The output layer is a 3-class classifier:
- Class 0: Model A wins
- Class 1: Model B wins
- Class 2: Tie

In [None]:
MODEL_NAME = "distilbert-base-uncased"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

label2id = {"Model A": 0, "Model B": 1, "Tie": 2}
id2label = {v: k for k, v in label2id.items()}

df_train["label"] = df_train["winner"].map(label2id)
df_train["label"].value_counts()

## 4. Metrics and Loss Function

### Loss Function

- For multi-class classification, the standard choice is **cross-entropy loss**.
- In `transformers`, `AutoModelForSequenceClassification` automatically applies cross-entropy when `labels` are provided.

### Metrics

- **Accuracy**: correct predictions / total samples.
- **Log Loss**: a Kaggle-standard metric that measures how close the predicted probability distribution is to the true labels (lower is better).

In the Trainer, we compute both metrics via a custom `compute_metrics` function.

In [None]:
def preprocess_function(examples):
    texts = [
        f"Prompt: {p} \n Response A: {a} \n Response B: {b}"
        for p, a, b in zip(examples["prompt"], examples["response_a"], examples["response_b"])
    ]
    tokenized = tokenizer(
        texts,
        padding="max_length",
        truncation=True,
        max_length=512,
    )
    if "label" in examples:
        tokenized["labels"] = examples["label"]
    return tokenized

# Split train/validation sets
train_df, val_df = train_test_split(
    df_train,
    test_size=0.1,
    random_state=42,
    stratify=df_train["label"],
)

train_dataset = Dataset.from_pandas(train_df.reset_index(drop=True))
val_dataset = Dataset.from_pandas(val_df.reset_index(drop=True))
test_dataset = Dataset.from_pandas(df_test.reset_index(drop=True))

train_enc = train_dataset.map(
    preprocess_function,
    batched=True,
    remove_columns=train_dataset.column_names,
)
val_enc = val_dataset.map(
    preprocess_function,
    batched=True,
    remove_columns=val_dataset.column_names,
)
test_enc = test_dataset.map(
    preprocess_function,
    batched=True,
    remove_columns=test_dataset.column_names,
)

train_enc.set_format("torch")
val_enc.set_format("torch")
test_enc.set_format("torch")

In [None]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    probs = torch.softmax(torch.tensor(logits), dim=-1).numpy()
    preds = probs.argmax(axis=-1)
    acc = accuracy_score(labels, preds)
    try:
        ll = log_loss(labels, probs)
    except ValueError:
        ll = float("nan")
    return {"accuracy": acc, "log_loss": ll}

## 5. Model Building and Training

We build a 3-class model with `AutoModelForSequenceClassification`:

- `num_labels=3`
- `id2label` / `label2id` map class IDs to readable labels.

Key training parameters and what they do:
- `learning_rate`: step size; too large causes instability, too small slows convergence. Typical range: 1e-5 to 5e-5.
- `num_train_epochs`: number of full passes; higher can overfit. Monitor validation metrics as it increases.
- `per_device_train_batch_size`: batch size per GPU; limited by VRAM. Use gradient accumulation to simulate larger batches.
- `gradient_accumulation_steps`: accumulates gradients across steps; effective batch = batch_size Ã— accumulation_steps.
- `weight_decay`: regularization to reduce overfitting; commonly 0.01.
- `fp16`: mixed precision for faster training and lower memory usage on GPUs.
- `eval_strategy` / `save_strategy`: evaluation and checkpointing cadence; must match when `load_best_model_at_end=True`.
- `logging_steps`: log interval for tracking loss.
- `save_total_limit`: limits checkpoint count to save disk space.
- `warmup_ratio` or `warmup_steps`: warms up the learning rate for stability.

Tuning tips:
- Start with fewer epochs and a smaller batch to validate the pipeline, then scale up.
- If validation loss rises, reduce epochs or increase regularization.
- When VRAM is limited, use gradient accumulation instead of a larger batch.

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME,
    num_labels=3,
    id2label=id2label,
    label2id=label2id,
).to(device)

training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,

    # GPU parameters
    per_device_train_batch_size=8,
    per_device_eval_batch_size=16,
    gradient_accumulation_steps=1,
    fp16=True,

    # training epochs
    num_train_epochs=3,

    # optimizer parameters
    weight_decay=0.01,
    warmup_ratio=0.1,

    # evaluation&save parameters
    eval_strategy="steps", # or "epoch"
    eval_steps=100,
    save_strategy="steps",
    save_steps=200,
    save_total_limit=2,
    load_best_model_at_end=True,

    # logging&report parameters
    logging_steps=50,
    report_to="none",
    # gradient_checkpointing=True # enable gradient checkpointing
)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_enc,
    eval_dataset=val_enc,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    # callbacks=[transformers.EarlyStoppingCallback(early_stopping_patience=3)] # enable early stopping
)

trainer.train()

## 6. Validation Evaluation and Test Prediction

1. Evaluate on the validation set and report accuracy and log_loss.
2. Predict on the test set and generate `submission.csv` with:
   - `id`
   - `winner_model_a`, `winner_model_b`, `winner_tie` (predicted probabilities for each class).

In [None]:
history = pd.DataFrame(trainer.state.log_history)

train_metrics = history.loc[history["loss"].notna(), ["step", "loss"]]
eval_cols = [c for c in ["eval_loss", "eval_accuracy", "eval_log_loss", "eval_runtime", "eval_samples_per_second", "eval_steps_per_second"] if c in history.columns]
eval_metrics = history.loc[history["eval_loss"].notna(), ["step"] + eval_cols]

train_metrics = train_metrics.rename(columns={"loss": "Training Loss"})
eval_metrics = eval_metrics.rename(columns={
    "eval_loss": "Validation Loss",
    "eval_accuracy": "Accuracy",
    "eval_log_loss": "Log Loss",
    "eval_runtime": "Runtime",
    "eval_samples_per_second": "Samples Per Second",
    "eval_steps_per_second": "Steps Per Second",
})

metrics_table = eval_metrics.merge(train_metrics, on="step", how="left")
metrics_table = metrics_table.rename(columns={"step": "Step"})
metrics_table = metrics_table[["Step", "Training Loss", "Validation Loss", "Accuracy", "Log Loss", "Runtime", "Samples Per Second", "Steps Per Second"]]

metrics_table

In [None]:
plt.figure(figsize=(10, 4))
plt.plot(metrics_table["Step"], metrics_table["Training Loss"], label="Training Loss")
plt.plot(metrics_table["Step"], metrics_table["Validation Loss"], label="Validation Loss")
plt.xlabel("Step")
plt.ylabel("Loss")
plt.legend()
plt.tight_layout()
plt.show()

plt.figure(figsize=(10, 4))
plt.plot(metrics_table["Step"], metrics_table["Accuracy"], label="Accuracy")
plt.plot(metrics_table["Step"], metrics_table["Log Loss"], label="Log Loss")
plt.xlabel("Step")
plt.ylabel("Metric")
plt.legend()
plt.tight_layout()
plt.show()


In [None]:
eval_results = trainer.evaluate()
print("Evaluation results:")
print(eval_results)

In [None]:
predictions = trainer.predict(test_enc)
probs = torch.softmax(torch.tensor(predictions.predictions), dim=-1).numpy()

submission = pd.DataFrame({
    "id": df_test["id"],
    "winner_model_a": probs[:, 0],
    "winner_model_b": probs[:, 1],
    "winner_tie": probs[:, 2],
})

submission_path = "submission.csv"
submission.to_csv(submission_path, index=False)
print(f"Submission file saved to {submission_path}")
submission.head()

## 7. Summary

This notebook demonstrates:

- How to load and visualize Kaggle LLM Classification Finetuning `train.csv` / `test.csv` data
- How to combine three text fields (prompt, response_a, response_b) into one model input
- How to fine-tune a 3-class DistilBERT model with cross-entropy loss
- How to evaluate on the validation set (accuracy, log_loss)
- How to predict the test set and generate a submission file in the required format

You can extend this by:

- Trying a larger model or incorporating `model_a`/`model_b` as features
- Improving the input construction (e.g., encode A/B separately then compare)
- Adding richer visualizations and error analysis