<center><img src="https://keras.io/img/logo-small.png" alt="Keras logo" width="100"><br/>
This starter notebook is provided by the Keras team.</center>

# Getting Started on LLM Classification Fine Tuning with [KerasNLP](https://github.com/keras-team/keras-nlp) and [Keras](https://github.com/keras-team/keras)

<div align="center">
    <img src="https://i.ibb.co/wJMF5HL/lmsys.png">
</div>

In this competition, our aim is to predict which LLM responses users will prefer in a head-to-head battle between chatbots powered by large language models (LLMs). In other words, the goal of the competition is to predict the preferences of the judges and determine the likelihood that a given prompt/response pair is selected as the winner. This notebook will guide you through the process of fine-tuning the **DebertaV3** model for this competition using the **Shared Weight** strategy with KerasNLP. This strategy is similar to how Multiple Choice Question (MCQ) models are trained. Additionally, we will use mixed precision for faster training and inference.

**Did you know**: This notebook is backend-agnostic, which means it supports TensorFlow, PyTorch, and JAX backends. However, the best performance can be achieved with `JAX`. KerasNLP and Keras enable the choice of the preferred backend. Explore further details on [Keras](https://keras.io/keras_3/).

**Note**: For a deeper understanding of KerasNLP, refer to the [KerasNLP guides](https://keras.io/keras_nlp/).


# 📚 | Import Libraries 

In [None]:
import os
os.environ["KERAS_BACKEND"] = "tensorflow"  # or "tensorflow" or "torch"

import keras_nlp
import keras
import tensorflow as tf
#import tensorflow_addons as tfa

import random
import numpy as np 
import pandas as pd
from tqdm import tqdm
import json

import matplotlib.pyplot as plt
import matplotlib as mpl
import plotly.express as px

from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

In [None]:
import wandb

# 將這行放在程式最前面，替換 YOUR_API_KEY 為你的實際 key
key="cfbd7f24413a77f911641acae8914b1dc77e1b83"

In [None]:
#!pip install --upgrade wandb  # 確保使用最新版本
#from wandb.keras import WandbCallback

#wandb.login(key="cfbd7f24413a77f911641acae8914b1dc77e1b83")
#wandb.init(project="kerasnlp-training", name="your-run-name")  # 自訂專案與實驗名稱

## Library Version

# ⚙️ | Configuration

In [None]:
class CFG:
    seed = 42  # Random seed

    # 模型和输入
    preset = "deberta_v3_extra_small_en"
    sequence_length = 512

    # 两阶段总共跑的 epoch = head_only_epochs + fine_tune_epochs
    head_only_epochs = 3        # 前 3 轮只训 head
    fine_tune_epochs   = 5      # 接着 5 轮微调 backbone
    # （你也可以把 epochs = head_only_epochs + fine_tune_epochs = 8，删掉 epochs 字段）

    # 数据和批次
    batch_size = 16

    # 优化器&调度
    learning_rate = 5e-6
    weight_decay  = 1e-4
    # Warmup 比例调大些，让最开始的几步能更稳定地升到 lr_max
    warmup_ratio = 0.2

    # 正则化
    # Head 里多加一层 Dense+Dropout，所以 DropoutRate 调高到 0.3 让它更抗过拟合
    dropout_rate = 0.3

    # 微调策略：解冻 backbone 最后多少层
    # 之前是 4 层，建议解冻更多层试试，比如 12
    unfrozen_backbone_layers = 12

    # 标签映射
    label2name = {0: 'winner_model_a', 1: 'winner_model_b', 2: 'winner_tie'}
    name2label = {v:k for k, v in label2name.items()}
    class_labels = list(label2name.keys())
    class_names  = list(label2name.values())

# ♻️ | Reproducibility 
Sets value for random seed to produce similar result in each run.

In [None]:
tf.keras.utils.set_random_seed(CFG.seed)
np.random.seed(CFG.seed)
random.seed(CFG.seed)

# 🧮 | Mixed Precision

In this notebook, we will use mixed precision instead of float32 precision for training and inference to reduce GPU memory usage. This will ultimately allow us to use larger batch sizes, thus reducing our training and inference time.

In [None]:
from tensorflow.keras import mixed_precision
mixed_precision.set_global_policy("mixed_float16")

# 📁 | Dataset Path 

In [None]:
BASE_PATH = '/kaggle/input/llm-classification-finetuning'

# 📖 | Meta Data 

The competition dataset comprises user interactions from the ChatBot Arena. In each interaction, a judge presents one or more prompts to two different large language models and then indicates which model provided the more satisfactory response. The training data contains `55,000` rows, with an expected `25,000` rows in the test set.

## Files

### `train.csv`
- `id`: Unique identifier for each row.
- `model_[a/b]`: Model identity, present in train.csv but not in test.csv.
- `prompt`: Input prompt given to both models.
- `response_[a/b]`: Model_[a/b]'s response to the prompt.
- `winner_model_[a/b/tie]`: Binary columns indicating the judge's selection (ground truth target).

### `test.csv`
- `id`: Unique identifier for each row.
- `prompt`: Input prompt given to both models.
- `response_[a/b]`: Model_[a/b]'s response to the prompt.

> Note that each interaction may have multiple prompts and responses, but this notebook will use only **one prompt per interaction**. You can choose to use all prompts and responses. Additionally, prompts and responses in the dataframe are provided as string-formatted lists, so they need to be converted to literal lists using `eval()`.


## Train Data

In [None]:
''''# Load Train Data
df = pd.read_csv(f'{BASE_PATH}/train.csv') 

# Sample data
# df = df.sample(frac=0.10)

# Take the first prompt and its associated response
df["prompt"] = df.prompt.map(lambda x: eval(x)[0])
df["response_a"] = df.response_a.map(lambda x: eval(x.replace("null","''"))[0])
df["response_b"] = df.response_b.map(lambda x: eval(x.replace("null", "''"))[0])

# Label conversion
df["class_name"] = df[["winner_model_a", "winner_model_b" , "winner_tie"]].idxmax(axis=1)
df["class_label"] = df.class_name.map(CFG.name2label)

# Show Sample
df.head()
'''
import json
from pathlib import Path

# 1. 建立路徑
BASE_PATH = Path(BASE_PATH)  # 如果你之前已經是 str，這行可把它轉成 Path
train_file = BASE_PATH / "train.csv"
if not train_file.exists():
    raise FileNotFoundError(f"找不到訓練檔案：{train_file}")

# 2. 讀檔並打散
df = pd.read_csv(train_file)
df = df.sample(frac=1, random_state=CFG.seed).reset_index(drop=True)

# 3. 安全地解析 JSON string 並取第一個元素
def safe_first_element(json_str: str):
    if not isinstance(json_str, str) or len(json_str) == 0:
        return ""
    # 把 'null' 換成 JSON 可接受的空字串
    cleaned = json_str.replace("null", '""')
    try:
        arr = json.loads(cleaned)
        if isinstance(arr, list) and len(arr) > 0:
            return arr[0]
    except json.JSONDecodeError:
        pass
    return ""

df["prompt"]     = df["prompt"].apply(safe_first_element)
df["response_a"] = df["response_a"].apply(safe_first_element)
df["response_b"] = df["response_b"].apply(safe_first_element)

# 4. 標籤轉換（和你原本一樣）
df["class_name"]  = df[["winner_model_a","winner_model_b","winner_tie"]].idxmax(axis=1)
df["class_label"] = df["class_name"].map(CFG.name2label)

# 5. 檢視結果
df.head()


In [None]:
''''# 檢查缺失值
missing_values = df.isnull().sum()
print("缺失值統計：")
print(missing_values)

# 如果你只想列出有缺失值的欄位
missing_columns = missing_values[missing_values > 0]
print("\n有缺失值的欄位：")
print(missing_columns)
'''
# 1. 計算缺失值
missing_values = df.isnull().sum()
missing_columns = missing_values[missing_values > 0]

print("缺失值統計：")
print(missing_values)

if not missing_columns.empty:
    print("\n有缺失值的欄位：")
    print(missing_columns)

# 2. 互動式顯示缺失值表格（可排序／篩選）
#from ace_tools import display_dataframe_to_user

missing_df = (
    missing_values
    .rename("missing_count")
    .to_frame()
    .assign(missing_pct=lambda d: d["missing_count"] / len(df) * 100)
    .reset_index()
    .rename(columns={"index": "column"})
    .query("missing_count > 0")
)

#display_dataframe_to_user("缺失值欄位一覽", missing_df)
print("\n缺失值欄位一覽：")
print(missing_df)

# 3. 處理缺失值
#    (a) 針對關鍵欄位直接丟棄含 NaN 的列
df = df.dropna(subset=["prompt", "response_a", "response_b"]).reset_index(drop=True)

#    (b) 其餘欄位用空字串填補
df = df.fillna("")

# 4. 最後檢視處理後的前幾筆
print(df.head())


## Test Data

In [None]:
'''# Load Test Data
test_df = pd.read_csv(f'{BASE_PATH}/test.csv')

# Take the first prompt and response
test_df["prompt"] = test_df.prompt.map(lambda x: eval(x)[0])
test_df["response_a"] = test_df.response_a.map(lambda x: eval(x.replace("null","''"))[0])
test_df["response_b"] = test_df.response_b.map(lambda x: eval(x.replace("null", "''"))[0])

# Show Sample
test_df.head()
'''
import json
from pathlib import Path

# 1. 確保 BASE_PATH 是 Path 物件
BASE_PATH = Path(BASE_PATH)
test_file = BASE_PATH / "test.csv"
if not test_file.exists():
    raise FileNotFoundError(f"找不到測試檔案：{test_file}")

# 2. 讀取測試資料
test_df = pd.read_csv(test_file)

# 3. 安全解析函式（同訓練集）
def safe_first_element(json_str: str):
    if not isinstance(json_str, str) or len(json_str) == 0:
        return ""
    cleaned = json_str.replace("null", '""')
    try:
        arr = json.loads(cleaned)
        if isinstance(arr, list) and len(arr) > 0:
            return arr[0]
    except json.JSONDecodeError:
        pass
    return ""

# 4. 處理 prompt / response 欄位
test_df["prompt"]     = test_df["prompt"].apply(safe_first_element)
test_df["response_a"] = test_df["response_a"].apply(safe_first_element)
test_df["response_b"] = test_df["response_b"].apply(safe_first_element)

# 5. 檢視前幾筆
print(test_df.head())


### Alec 查看資料狀態

In [None]:
# 1. 計算 winner
winner = df[['winner_model_a','winner_model_b','winner_tie']].idxmax(axis=1).map({
    'winner_model_a':'A_win',
    'winner_model_b':'B_win',
    'winner_tie':'Tie'
})
# 2. 計算比例並排序
counts = winner.value_counts(normalize=True).sort_index()  

# 3. 繪圖
import matplotlib.pyplot as plt

plt.figure()
ax = counts.plot.bar(title='Winner Distribution')
ax.set_ylabel('Proportion')

# 4. 標註百分比
for patch in ax.patches:
    height = patch.get_height()
    ax.annotate(f"{height:.1%}",
                (patch.get_x() + patch.get_width() / 2, height),
                ha='center', va='bottom')

plt.show()

# 5. 印出數值
print(counts)

In [None]:
pair_stats = (
    df
    .groupby(['model_a','model_b'])
    .agg(
        total=('id','size'),
        a_wins=('winner_model_a','sum'),
        b_wins=('winner_model_b','sum'),
        ties=('winner_tie','sum'),
    )
    .assign(
        a_win_rate=lambda d: (d['a_wins']/d['total']).round(3),
        b_win_rate=lambda d: (d['b_wins']/d['total']).round(3),
        tie_rate=lambda d: (d['ties']/d['total']).round(3),
    )
    .reset_index()
    .sort_values('total', ascending=False)
)

display(pair_stats.head(10))


## Contextualize Response with Prompt

In our approach, we will contextualize each response with the prompt instead of using a single prompt for all responses. This means that for each response, we will provide the model with the same set of prompts combined with their respective response (e.g., `(P + R_A)`, `(P + R_B)`, etc.). This approach is similar to the multiple-choice question task in NLP.

> Note that some prompts and responses may not be encoded with `utf-8`, resulting in errors when creating the dataloader. In such cases, we will replace them with an empty string.


In [None]:
# Define a function to create options based on the prompt and choices
def make_pairs(row):
    def clean_text(text: any) -> str:
        """
        將輸入強制轉成 str，再丟掉非法 UTF-8 字元。
        """
        if not isinstance(text, str):
            return ""
        return text.encode("utf-8", errors="ignore") \
                   .decode("utf-8", errors="ignore")

    # 一次拿到乾淨的 prompt / responses
    prompt     = clean_text(row.get("prompt", ""))
    response_a = clean_text(row.get("response_a", ""))
    response_b = clean_text(row.get("response_b", ""))

    # （可選）判斷是否有非字串輸入
    row["encode_fail"] = any(not isinstance(row.get(col, None), str)
                              for col in ["prompt", "response_a", "response_b"])

    # 建立 options
    row["options"] = [
        f"Prompt: {prompt}\n\nResponse A: {response_a}",
        f"Prompt: {prompt}\n\nResponse B: {response_b}"
    ]
    return row

In [None]:
df = df.apply(make_pairs, axis=1)  # Apply the make_pairs function to each row in df
display(df.head(2))  # Display the first 2 rows of df

test_df = test_df.apply(make_pairs, axis=1)  # Apply the make_pairs function to each row in df
display(test_df.head(2))  # Display the first 2 rows of df

## Encoding Fail Statistics

Let's examine how many samples have encoding issues. From the code below, we can see that only $1\%$ of the samples failed to be encoded, while $99\%$ of the samples don't have any issues. A similar pattern can be expected for the test data as well. Thus, considering empty strings for this small portion of the data will not have much impact on our training and inference.

In [None]:
df.encode_fail.value_counts(normalize=False)

# 🎨 | Exploratory Data Analysis (EDA)

## LLM Distribution

In [None]:
import plotly.io as pio
pio.renderers.default = 'notebook_connected'

In [None]:
import plotly.express as px

# 1. 合併並計算次數
llm_series = pd.concat([df.model_a, df.model_b], ignore_index=True)
llm_counts = (
    llm_series
    .value_counts()                      # 得到一个 Series，index 是 LLM，值是出现次数
    .rename_axis("LLM")                  # 把 index 名称设成 "LLM"
    .reset_index(name="Count")           # 把 Series 转成 DataFrame，并把值列命名为 "Count"
)
# 2. 排序（降冪）
llm_counts = llm_counts.sort_values(by="Count", ascending=False)
print(llm_counts.columns)

# 3. 畫圖
plt.figure(figsize=(12, 6))
bars = plt.bar(llm_counts["LLM"], llm_counts["Count"])
for bar in bars:
    h = bar.get_height()
    plt.text(
        bar.get_x() + bar.get_width() / 2,
        h,
        f"{h}",
        ha="center",
        va="bottom",
        fontsize=9
    )

# 4. 旋转 x 轴标签，防止重叠
plt.xticks(rotation=45, ha="right")

# 5. 添加标题和坐标轴标签
plt.title("Distribution of LLMs")
plt.xlabel("LLM")
plt.ylabel("Count")

# 6. 自动调整布局
plt.tight_layout()

# 7. 显示图形
plt.show()


## Winning Distribution

In [None]:
# 1. 計算並重構 counts DataFrame
counts = (
    df["class_name"]
    .value_counts()                      # Series: index=class_name, value=frequency
    .rename_axis("Winner")               # 把 index 名稱改成 Winner
    .reset_index(name="Win Count")       # 把 value 變成 Win Count 欄位
)
print("Columns in counts:", counts.columns.tolist())

# 2. 按 Win Count 降序排序（可选）
counts_sorted = counts.sort_values(by="Win Count", ascending=False).reset_index(drop=True)

# 3. 绘图
plt.figure(figsize=(8, 5))
bars = plt.bar(counts_sorted["Winner"], counts_sorted["Win Count"])

# 4. 在柱顶标注数值
for bar in bars:
    h = bar.get_height()
    plt.text(
        bar.get_x() + bar.get_width() / 2,
        h,
        f"{h}",
        ha="center",
        va="bottom",
        fontsize=9
    )

# 5. 旋转 x 轴标签防止重叠
plt.xticks(rotation=45, ha="right")

# 6. 添加标题和坐标轴标签
plt.title("Winner Distribution for Train Data")
plt.xlabel("Winner")
plt.ylabel("Win Count")

# 7. 自动调整布局并显示
plt.tight_layout()
plt.show()








# 🔪 | Data Split

In the code snippet provided below, we will divide the existing data into training and validation using a stratification of `class_label` column.

In [None]:
from sklearn.model_selection import train_test_split

train_df, valid_df = train_test_split(
    df,
    test_size=0.2,
    stratify=df["class_label"],
    shuffle=True,
    random_state=CFG.seed,    # ← 加上這行，保證每次切分都一樣
)


# 🍽️ | Preprocessing

**What it does:** The preprocessor takes input strings and transforms them into a dictionary (`token_ids`, `padding_mask`) containing preprocessed tensors. This process starts with tokenization, where input strings are converted into sequences of token IDs.

**Why it's important:** Initially, raw text data is complex and challenging for modeling due to its high dimensionality. By converting text into a compact set of tokens, such as transforming `"The quick brown fox"` into `["the", "qu", "##ick", "br", "##own", "fox"]`, we simplify the data. Many models rely on special tokens and additional tensors to understand input. These tokens help divide input and identify padding, among other tasks. Making all sequences the same length through padding boosts computational efficiency, making subsequent steps smoother.

Explore the following pages to access the available preprocessing and tokenizer layers in **KerasNLP**:
- [Preprocessing](https://keras.io/api/keras_nlp/preprocessing_layers/)
- [Tokenizers](https://keras.io/api/keras_nlp/tokenizers/)

In [None]:
preprocessor = keras_nlp.models.DebertaV3Preprocessor.from_preset(
    preset=CFG.preset, # Name of the model
    sequence_length=CFG.sequence_length, # Max sequence length, will be padded if shorter
)

Now, let's examine what the output shape of the preprocessing layer looks like. The output shape of the layer can be represented as $(num\_responses, sequence\_length)$.

In [None]:
outs = preprocessor(df.options.iloc[0])  # Process options for the first row

# Display the shape of each processed output
for k, v in outs.items():
    print(k, ":", v.shape)

We'll use the `preprocessing_fn` function to transform each text option using the `dataset.map(preprocessing_fn)` method.

In [None]:
def preprocess_fn_train(options: list[str], class_label: int):
    """
    用在訓練/驗證：輸入一個 options list 和它的 class_label，
    回傳 (model_inputs_dict, class_label)。
    """
    # 安全檢查
    if not isinstance(options, list) or len(options) == 0:
        # full of zeros, 兩段 input_ids, attention_mask, etc.
        empty_inputs = {
            k: tf.zeros((len(options) or 1, CFG.sequence_length), dtype=tf.int32)
            for k in ["input_ids", "attention_mask", "token_type_ids"]
        }
        return empty_inputs, class_label

    # 真正 call preprocessor
    model_inputs = preprocessor(options)
    return model_inputs, class_label


def preprocess_fn_test(options: list[str]):
    """
    用在測試/推論：只有 options，回傳 model_inputs_dict。
    """
    if not isinstance(options, list) or len(options) == 0:
        return {
            k: tf.zeros((len(options) or 1, CFG.sequence_length), dtype=tf.int32)
            for k in ["input_ids", "attention_mask", "token_type_ids"]
        }
    return preprocessor(options)


# 🍚 | DataLoader

The code below sets up a robust data flow pipeline using `tf.data.Dataset` for data processing. Notable aspects of `tf.data` include its ability to simplify pipeline construction and represent components in sequences.

To learn more about `tf.data`, refer to this [documentation](https://www.tensorflow.org/guide/data).

In [None]:
def build_dataset(options, labels=None, batch_size=32,
                  cache=True, shuffle_size=1024):
    AUTO = tf.data.AUTOTUNE  # AUTOTUNE option
     # 1. 建立原始 ds，並選擇對應的 map function
    if labels is not None:
        ds = tf.data.Dataset.from_tensor_slices((options, labels))
        map_fn = preprocess_fn_train
    else:
        ds = tf.data.Dataset.from_tensor_slices(options)
        map_fn = preprocess_fn_test

    # 2. Shuffle（只在有 labels 時，且 shuffle_size > 0）
    if labels is not None and shuffle_size > 0:
        ds = ds.shuffle(shuffle_size, seed=CFG.seed, reshuffle_each_iteration=True)

    # 3. Cache（可選）
    if cache:
        ds = ds.cache()

    # 4. Map → preprocessing
    ds = ds.map(map_fn, num_parallel_calls=AUTO)

    # 5. Batch & Prefetch
    ds = ds.batch(batch_size, drop_remainder=False)
    ds = ds.prefetch(AUTO)

    return ds

## Build Train/Valid Dataloader

In [None]:
# Train
train_texts = train_df["options"].tolist()
train_labels = train_df["class_label"].tolist()
train_ds = build_dataset(
    train_texts,
    train_labels,
    batch_size=CFG.batch_size,
    cache=True,       # cache 訓練集以加速多 epoch 重跑
    shuffle_size=1024      # shuffle buffer size
)

# Valid
valid_texts = valid_df["options"].tolist()
valid_labels = valid_df["class_label"].tolist()
valid_ds = build_dataset(
    valid_texts,
    valid_labels,
    batch_size=CFG.batch_size,
    cache=False,      # 驗證集不需要 cache
    shuffle_size=0         # 不做 shuffle
)


# ⚓ | LR Schedule

Implementing a learning rate scheduler is crucial for transfer learning. The learning rate initiates at `lr_start` and gradually tapers down to `lr_min` using various techniques, including:
- `step`: Lowering the learning rate in step-wise manner resembling stairs.
- `cos`: Utilizing a cosine curve to gradually reduce the learning rate.
- `exp`: Exponentially decreasing the learning rate.

**Importance:** A well-structured learning rate schedule is essential for efficient model training, ensuring optimal convergence and avoiding issues such as overshooting or stagnation.

In [None]:
'''
import math

def get_lr_callback(batch_size=8, mode='cos', epochs=10, plot=False):
    lr_start, lr_max, lr_min = 1.0e-6, 0.6e-6 * batch_size, 1e-6
    lr_ramp_ep, lr_sus_ep, lr_decay = 2, 0, 0.8

    def lrfn(epoch):  # Learning rate update function
        if epoch < lr_ramp_ep: lr = (lr_max - lr_start) / lr_ramp_ep * epoch + lr_start
        elif epoch < lr_ramp_ep + lr_sus_ep: lr = lr_max
        elif mode == 'exp': lr = (lr_max - lr_min) * lr_decay**(epoch - lr_ramp_ep - lr_sus_ep) + lr_min
        elif mode == 'step': lr = lr_max * lr_decay**((epoch - lr_ramp_ep - lr_sus_ep) // 2)
        elif mode == 'cos':
            decay_total_epochs, decay_epoch_index = epochs - lr_ramp_ep - lr_sus_ep + 3, epoch - lr_ramp_ep - lr_sus_ep
            phase = math.pi * decay_epoch_index / decay_total_epochs
            lr = (lr_max - lr_min) * 0.5 * (1 + math.cos(phase)) + lr_min
        return lr

    if plot:  # Plot lr curve if plot is True
        plt.figure(figsize=(10, 5))
        plt.plot(np.arange(epochs), [lrfn(epoch) for epoch in np.arange(epochs)], marker='o')
        plt.xlabel('epoch'); plt.ylabel('lr')
        plt.title('LR Scheduler')
        plt.show()

    return keras.callbacks.LearningRateScheduler(lrfn, verbose=False)  # Create lr callback
'''
import tensorflow as tf

def get_cosine_warmup_schedule(steps_per_epoch: int):
    total_steps   = CFG.epochs * steps_per_epoch
    warmup_steps  = int(total_steps * CFG.warmup_ratio)
    # 1. 先做 warmup：从 0 → learning_rate
    warmup_fn = tf.keras.optimizers.schedules.PolynomialDecay(
        initial_learning_rate=0.0,
        decay_steps=warmup_steps,
        end_learning_rate=CFG.learning_rate,
        power=1.0
    )
    # 2. 再做剩余步数的 cosine decay：从 learning_rate → weight_decay（当作最小 lr）
    cosine_fn = tf.keras.optimizers.schedules.CosineDecay(
        initial_learning_rate=CFG.learning_rate,
        decay_steps=total_steps - warmup_steps,
        alpha=CFG.weight_decay/CFG.learning_rate  # α = lr_min / lr_max
    )
    # 3. 用 `tf.where` 把两个 schedule 拼起来
    def lr_schedule_fn(step):
        return tf.where(
            step < warmup_steps,
            warmup_fn(step),
            cosine_fn(step - warmup_steps)
        )
    return lr_schedule_fn




In [None]:
#lr_cb = get_lr_callback(CFG.batch_size, plot=True)
from tensorflow.keras.optimizers import AdamW


steps_per_epoch = len(train_df) // CFG.batch_size
lr_fn = get_cosine_warmup_schedule(steps_per_epoch)

# 使用內建的 AdamW
optimizer = AdamW(
    learning_rate=lr_fn,
    weight_decay=CFG.weight_decay
)
'''
model.compile(
    optimizer=optimizer,
    loss=tf.keras.losses.CategoricalCrossentropy(label_smoothing=0.05),
    metrics=[tf.keras.metrics.CategoricalAccuracy(name="accuracy")],
)
'''

# 💾 | Model Checkpointing

The following code will create a callback that will save the best checkpoint of the model during training, which we will use for inference in the submission.

In [None]:
# 把 checkpoint 存到專案的 checkpoints 目錄，先確保它存在
ckpt_dir = Path("checkpoints")
ckpt_dir.mkdir(exist_ok=True)

ckpt_cb = keras.callbacks.ModelCheckpoint(
    filepath=str(ckpt_dir / "best_model.{epoch:02d}-{val_loss:.4f}.weights.h5"),
    monitor="val_loss",             # 要對應 compile 時的 loss 名稱
    save_best_only=True,
    save_weights_only=True,          # 只存 weights
    mode="min",                      # loss 越低代表越好
    verbose=1                        # 儲存時顯示訊息
)

# 📏 | Metric

The metric for this competition is **Log Loss**. This metric can be expressed mathematically as,

$$
\text{Log Loss} = -\frac{1}{N} \sum_{i=1}^{N} \left( y_i \log(p_i) + (1 - y_i) \log(1 - p_i) \right)
$$

where $ N $ is the number of samples, $ y_i $ is the true label, and $ p_i $ is the predicted probability of the sample belonging to the positive class.

Note that this metric is similar to categorical cross entropy widely used in classification tasks. Thus, we don't need to implement the loss from scratch. As the Keras library already has an implementation of this metric, we will simply use the metric to monitor performance of our model.


In [None]:
#log_loss = keras.metrics.CategoricalCrossentropy(name="log_loss")

# 🤖 | Modeling

The `KerasNLP` library provides various NLP model architectures such as `Bert`, `Roberta`, `DebertaV3`, and more. While this notebook focuses on `DebertaV3`, you can explore others in the [KerasNLP documentation](https://keras.io/api/keras_nlp/models/). For a deeper understanding, refer to the [getting started guide](https://keras.io/guides/keras_nlp/getting_started/).

Our approach utilizes `keras_nlp.models.DebertaV3Classifier` to process each prompt and response pair, generating output embeddings. We then concatenate these embeddings and pass them through a Pooling layer and a classifier to obtain logits, followed by a `softmax` function for the final output.

When dealing with multiple responses, we use a weight-sharing strategy. This means we provide the model with one response at a time along with the prompt `(P + R_A)`, `(P + R_B)`, etc., using the same model weights for all responses. After obtaining embeddings for all responses, we concatenate them and apply average pooling. Next, we use a `Linear/Dense` layer along with the `Softmax` function as the classifier for the final result. Providing all responses at once would increase text length and complicate model handling. Note that, in the classifier, we use 3 classes for `winner_model_a`, `winner_model_b`, and `draw` cases.

The diagram below illustrates this approach:

<div align="center">
    <img src="https://i.postimg.cc/g0gcvy3f/Kaggle-drawio.png">
</div>

From a coding perspective, note that we use the same model for all responses with shared weights, contrary to the separate models implied in the diagram.

In [None]:
# Define input layers
'''
inputs = {
    "token_ids": keras.Input(shape=(2, None), dtype=tf.int32, name="token_ids"),
    "padding_mask": keras.Input(shape=(2, None), dtype=tf.int32, name="padding_mask"),
}
# Create a DebertaV3Classifier backbone
backbone = keras_nlp.models.DebertaV3Backbone.from_preset(
    CFG.preset,
)

# Compute embeddings for first response: (P + R_A) using backbone
response_a = {k: v[:, 0, :] for k, v in inputs.items()}
embed_a = backbone(response_a)

# Compute embeddings for second response: (P + R_B), using the same backbone
response_b = {k: v[:, 1, :] for k, v in inputs.items()}
embed_b = backbone(response_b)

# Compute final output
embeds = keras.layers.Concatenate(axis=-1)([embed_a, embed_b])
embeds = keras.layers.GlobalAveragePooling1D()(embeds)
outputs = keras.layers.Dense(3, activation="softmax", name="classifier")(embeds)
model = keras.Model(inputs, outputs)

# Compile the model with optimizer, loss, and metrics
model.compile(
    optimizer=keras.optimizers.Adam(5e-6),
    loss=keras.losses.CategoricalCrossentropy(label_smoothing=0.02),
    metrics=[
        log_loss,
        keras.metrics.CategoricalAccuracy(name="accuracy"),
    ],
)

from keras.callbacks import EarlyStopping, ModelCheckpoint
callbacks = [
    EarlyStopping(monitor="val_loss", patience=2, restore_best_weights=True),
    ModelCheckpoint("best_model.keras", save_best_only=True)
]
'''
import math
from tensorflow.keras import mixed_precision, regularizers
from tensorflow.keras.optimizers.schedules import LearningRateSchedule

# 1) Mixed precision
mixed_precision.set_global_policy("mixed_float16")

# 2) Prepare checkpoint dir
ckpt_dir = Path("checkpoints")
ckpt_dir.mkdir(exist_ok=True)

# 3) Model architecture
inputs = {
    "input_ids": keras.Input((2, CFG.sequence_length), tf.int32, name="input_ids"),
    "attention_mask": keras.Input((2, CFG.sequence_length), tf.int32, name="attention_mask"),
}

backbone = keras_nlp.models.DebertaV3Backbone.from_preset(CFG.preset)
backbone.trainable = False

def slice_pair_and_rename(d, idx):
    return {
        "token_ids":    d["input_ids"][:, idx, :],
        "padding_mask": d["attention_mask"][:, idx, :]
    }

a = backbone(slice_pair_and_rename(inputs, 0))
b = backbone(slice_pair_and_rename(inputs, 1))

x = keras.layers.Concatenate(axis=1)([a, b])
x = keras.layers.GlobalAveragePooling1D()(x)
# add extra Dense + Dropout per updated CFG.dropout_rate
x = keras.layers.Dense(
    256,
    activation="relu",
    kernel_regularizer=regularizers.l2(CFG.weight_decay)
)(x)
x = keras.layers.Dropout(CFG.dropout_rate)(x)
outputs = keras.layers.Dense(
    len(CFG.class_names),
    activation="softmax",
    dtype="float32",
    kernel_regularizer=regularizers.l2(CFG.weight_decay),
    name="classifier"
)(x)

model = keras.Model(inputs, outputs)

# 4) Cosine-with-warmup schedule
class CosineWarmupSchedule(LearningRateSchedule):
    def __init__(self, lr_max, lr_min, warmup_steps, total_steps):
        super().__init__()
        self.lr_max = lr_max; self.lr_min = lr_min
        self.warmup_steps = warmup_steps; self.total_steps = total_steps

    def __call__(self, step):
        step = tf.cast(step, tf.float32)
        ws = tf.cast(self.warmup_steps, tf.float32)
        ts = tf.cast(self.total_steps, tf.float32)
        warm = self.lr_max * (step / ws)
        dec_step = step - ws
        dec_steps = ts - ws
        cosine = 0.5 * (1 + tf.cos(math.pi * dec_step / dec_steps))
        decay_lr = self.lr_min + (self.lr_max - self.lr_min) * cosine
        return tf.where(step < ws, warm, decay_lr)

# compute total_steps & warmup_steps using updated CFG
steps_per_epoch = len(train_df) // CFG.batch_size
total_steps   = (CFG.head_only_epochs + CFG.fine_tune_epochs) * steps_per_epoch
warmup_steps  = int(CFG.head_only_epochs * steps_per_epoch * CFG.warmup_ratio)

lr_schedule = CosineWarmupSchedule(
    CFG.learning_rate, CFG.weight_decay, warmup_steps, total_steps
)

# 5) class weights
counts = train_df["class_label"].value_counts().to_dict()
total  = len(train_df)
class_weight = {k: total/v for k, v in counts.items()}

# 6) Stage 1 callbacks
callbacks_stage1 = [
    EarlyStopping(monitor="val_loss", patience=CFG.head_only_epochs,
                  restore_best_weights=True, verbose=1),
    ModelCheckpoint(filepath=str(ckpt_dir/"best_head.weights.h5"),
                    monitor="val_loss", save_best_only=True,
                    save_weights_only=True, verbose=1)
]

# 7) Stage 1: train head only
optimizer1 = AdamW(learning_rate=lr_schedule, weight_decay=CFG.weight_decay)
model.compile(
    optimizer=optimizer1,
    loss=keras.losses.SparseCategoricalCrossentropy(),
    metrics=[keras.metrics.SparseCategoricalAccuracy(name="accuracy")]
)
print(f" Stage1: training head for {CFG.head_only_epochs} epochs")
history_head = model.fit(
    train_ds,
    validation_data=valid_ds,
    epochs=CFG.head_only_epochs,
    class_weight=class_weight,
    callbacks=callbacks_stage1
)

# 8) Unfreeze more layers & Stage 2 callbacks
backbone.trainable = True
for layer in backbone.layers[:-CFG.unfrozen_backbone_layers]:
    layer.trainable = False

callbacks_stage2 = [
    EarlyStopping(monitor="val_loss", patience=CFG.fine_tune_epochs,
                  restore_best_weights=True, verbose=1),
    ModelCheckpoint(filepath=str(ckpt_dir/
        f"best_finetuned.epoch{{epoch:02d}}_val{{val_loss:.4f}}.weights.h5"),
                    monitor="val_loss", mode="min",
                    save_best_only=True, save_weights_only=True, verbose=1),
    ReduceLROnPlateau(monitor="val_loss", factor=0.5,
                      patience=2, min_lr=1e-7, verbose=1)
]

# 9) Stage 2: fine-tune backbone
optimizer2 = AdamW(learning_rate=CFG.learning_rate * 0.1,
                   weight_decay=CFG.weight_decay)
model.compile(
    optimizer=optimizer2,
    loss=keras.losses.SparseCategoricalCrossentropy(),
    metrics=[keras.metrics.SparseCategoricalAccuracy(name="accuracy")]
)
print(f" Stage2: fine-tuning last {CFG.unfrozen_backbone_layers} layers "
      f"for {CFG.fine_tune_epochs} epochs")
history_ft = model.fit(
    train_ds,
    validation_data=valid_ds,
    initial_epoch=CFG.head_only_epochs,
    epochs=CFG.head_only_epochs + CFG.fine_tune_epochs,
    class_weight=class_weight,
    callbacks=callbacks_stage2
)

### Model Summary

In [None]:
model.summary()

### Model Plot

In the model graph below, it may seem there are **four** inputs, but actually, there are **two** as discussed before. Our input consists of two parts, one for each response. However, for each input, we have `token_ids` and `padding_mask`, which makes it look like we have four inputs, but in reality, we have two inputs.

In [None]:
# Currently throwing error !! [probably library or env issue, so hopefully will be fixed soon]

# keras.utils.plot_model(model, show_shapes=True, show_layer_names=True)

# 🚂 | Training

In [None]:
'''
import wandb
#from wandb.keras import WandbCallback

# 初始化 wandb 專案（在 fit() 之前）
wandb.init(project="kerasnlp-training", name="deberta-run")

# 開始訓練模型，同步記錄到 wandb
history = model.fit(
    train_ds,
    epochs=CFG.epochs,
    validation_data=valid_ds,
    callbacks=[lr_cb, ckpt_cb, WandbCallback()]
)
'''
''''
from keras.callbacks import EarlyStopping, ModelCheckpoint
callbacks = [
    EarlyStopping(monitor="val_loss", patience=2, restore_best_weights=True),
    ModelCheckpoint("best_model.keras", save_best_only=True)
]
'''


In [None]:
# Start training the model
'''
# 開始訓練
history = model.fit(
    train_ds,
    validation_data=valid_ds,
    epochs=CFG.epochs,
    callbacks=callbacks,   # <- 直接引用統一好的 callbacks 列表
)
'''
'''
# Stage 1: 只训练 Head
history_head = model.fit(
    train_ds,
    validation_data=valid_ds,
    epochs=CFG.epochs,
    callbacks=callbacks,   # <- 直接引用統一好的 callbacks 列表
)
# Stage 2: 解冻微调
history_ft   = model.fit(
    train_ds,
    validation_data=valid_ds,
    epochs=CFG.epochs,
    callbacks=callbacks,   # <- 直接引用統一好的 callbacks 列表
)
'''

In [None]:
import matplotlib.pyplot as plt

# 1. 从两个 history 对象里合并指标
loss = history_head.history['loss'] + history_ft.history['loss']
val_loss = history_head.history.get('val_loss', []) + history_ft.history.get('val_loss', [])

acc = history_head.history.get('sparse_categorical_accuracy', history_head.history.get('accuracy', [])) \
      + history_ft.history.get('sparse_categorical_accuracy', history_ft.history.get('accuracy', []))
val_acc = history_head.history.get('val_sparse_categorical_accuracy', history_head.history.get('val_accuracy', [])) \
          + history_ft.history.get('val_sparse_categorical_accuracy', history_ft.history.get('val_accuracy', []))

# 2. x 轴：总共的 epoch 数
epochs = list(range(1, len(loss) + 1))

# 3. 绘图
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# 左：Loss
axes[0].plot(epochs, loss, label='Train Loss')
if val_loss:
    axes[0].plot(epochs, val_loss, label='Validation Loss')
axes[0].axvline(x=CFG.head_only_epochs + 0.5, color='gray', linestyle='--',
                label='Unfreeze Backbone')  # 标出解冻点
axes[0].set_title('Loss over Epochs')
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Loss')
axes[0].legend()

# 右：Accuracy
axes[1].plot(epochs, acc, label='Train Accuracy')
if val_acc:
    axes[1].plot(epochs, val_acc, label='Validation Accuracy')
axes[1].axvline(x=CFG.head_only_epochs + 0.5, color='gray', linestyle='--',
                label='Unfreeze Backbone')
axes[1].set_title('Accuracy over Epochs')
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Accuracy')
axes[1].legend()

plt.tight_layout()
plt.show()

## Load Best Model

After training, let's load the weight with best result to get the best performance.

In [None]:
'''
ckpt_dir = Path("checkpoints")
# 找到最新一次儲存的 checkpoint
latest_ckpt = tf.train.latest_checkpoint(str(ckpt_dir))
print("Loading weights from:", latest_ckpt)
model.load_weights(latest_ckpt)
'''
# 1. 指定 checkpoint 目录
ckpt_dir = Path("checkpoints")
if not ckpt_dir.exists():
    raise FileNotFoundError(f"Checkpoint directory not found: {ckpt_dir}")

# 2. 列出所有 .weights.h5 文件（包括 head-only 与 finetuned）
weight_files = list(ckpt_dir.glob("*.weights.h5"))
if not weight_files:
    raise FileNotFoundError(f"No .weights.h5 files found in {ckpt_dir}")

# 3. 从文件名中解析出 epoch 数字；head-only 文件设为 0
def extract_epoch(fp: Path) -> int:
    m = re.search(r"epoch(\d+)", fp.name)
    return int(m.group(1)) if m else 0

# 4. 选出 epoch 最大的那个文件
latest_file = max(weight_files, key=extract_epoch)

# 5. 加载权重
model.load_weights(str(latest_file))
print(f"Loaded weights from: {latest_file}")

# 🧪 | Prediction

In [None]:
# Build test dataset
test_texts = test_df["options"].tolist()

test_ds = build_dataset(
    options=test_texts,
    labels=None,                     # 明确告诉它没有 labels
    batch_size=CFG.batch_size,       # 或者 min(len(test_df), CFG.batch_size)
    cache=False,                     # 不 cache
    shuffle_size=0                   # 不 shuffle
)


In [None]:
# Make predictions using the trained model on test data
test_preds = model.predict(test_ds, verbose=1)

# 📬 | Submission

Following code will prepare the submission file.

In [None]:
'''
sub_df = test_df[["id"]].copy()
sub_df[CFG.class_names] = test_preds.tolist()
sub_df.to_csv("submission.csv", index=False)
sub_df.head()
'''
# 1. 把預測結果做成 DataFrame
preds_df = pd.DataFrame(test_preds, columns=CFG.class_names)

# 2. 合併 id
sub_df = pd.concat([test_df["id"].reset_index(drop=True), preds_df], axis=1)

# 3. （可選）加最終預測類別
sub_df["pred_class"] = preds_df.idxmax(axis=1)

# 4. 寫檔並輸出確認
sub_df.to_csv("submission.csv", index=False)
print("Saved submission.csv with columns:", sub_df.columns.tolist())
sub_df.head()

# 🔭 | Future Directions

In this notebook, we've achieved a good score with a small model and modest token length. But there's plenty of room to improve. Here's how:

1. Try bigger models like `Deberta-Base` or `Deberta-Small`, or even LLMs like `Gemma`.
2. Increase max token length to reduce loss of data.
3. Use a five-fold cross-validation and ensemble to make the model robust and get better scores.
4. Add augmentation like shuffling response orders for more robust performance.
5. Train for more epochs.
6. Tune the learning rate scheduler.

# 📌 | Reference

* [LLM Science Exam: KerasCore + KerasNLP [TPU]](https://www.kaggle.com/code/awsaf49/llm-science-exam-kerascore-kerasnlp-tpu)
* [AES 2.0: KerasNLP Starter](https://www.kaggle.com/code/awsaf49/aes-2-0-kerasnlp-starter)