# Gemma Classification Arena Notebook

## Section 1: Import Libraries and Basic Configuration

### Import Libraries and Suppress Warnings

- **Purpose**: Import necessary libraries and suppress warnings to keep the notebook clean.
- **Notes**: 
  - `warnings.filterwarnings("ignore")` suppresses all warnings.
  - `logging.getLogger("transformers.modeling_utils").setLevel(logging.ERROR)` reduces logging verbosity from the transformers library.


In [1]:
import time
import warnings
warnings.filterwarnings("ignore")

import logging

# Suppress warnings from the transformers library
logging.getLogger("transformers.modeling_utils").setLevel(logging.ERROR)

### Import Additional Libraries

- **Purpose**: Import libraries for model training, evaluation, and data handling.
- **Notes**:
  - `transformers` provides tools for working with pre-trained models.
  - `datasets` is used for handling datasets.
  - `torch` is the PyTorch library for deep learning.
  - `sklearn` provides metrics for model evaluation.
  - `peft` is used for parameter-efficient fine-tuning.


In [2]:
%%time
from transformers import (pipeline, AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments,
                          AutoConfig, BitsAndBytesConfig, TextDataset, DataCollatorWithPadding)
from datasets import Dataset
import torch
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from peft import PeftModel, get_peft_model, LoraConfig, TaskType

CPU times: user 15 s, sys: 2.3 s, total: 17.3 s
Wall time: 24 s


### Ensure GPU Utilization

- **Purpose**: Ensure the model uses GPU if available and clear GPU memory.
- **Notes**:
  - `torch.cuda.is_available()` checks if a GPU is available.
  - `torch.cuda.empty_cache()` clears unused memory from the GPU.


In [3]:
# Ensure GPU utilization
device = "cuda" if torch.cuda.is_available() else "cpu"
torch.cuda.empty_cache()

## Section 2: Load and Preprocess Dataset

### Load Dataset

- **Purpose**: Load and preprocess the dataset for training and evaluation.
- **Notes**:
  - `train_test_split` splits the dataset into training and evaluation sets.
  - `Dataset.from_pandas` converts a pandas DataFrame to a Hugging Face Dataset.

In [4]:

train_data = pd.read_parquet("/kaggle/input/wsdm-cup-multilingual-chatbot-arena/train.parquet")
test_data = pd.read_parquet("/kaggle/input/wsdm-cup-multilingual-chatbot-arena/test.parquet")

In [5]:
train_data.iloc[0]

id            00007cff95d7f7974642a785aca248b0f26e60d3312fac...
prompt                                       vieš po Slovensky?
response_a     Áno, hovorím po slovensky. Ako vám môžem pomôcť?
response_b    Áno, veď som tu! Môžem ti pomôcť s otázkami al...
winner                                                  model_a
model_a                                              o1-preview
model_b                                      reka-core-20240904
language                                                 Slovak
Name: 0, dtype: object

In [6]:
%%time
# Encode labels
train_data['label'] = train_data['winner'].map({'model_a': 0, 'model_b': 1})

# Preprocess function
def preprocess_function(row):
    return {
        "input_text": f"Prompt: {row['prompt']} | Response A: {row['response_a']} | Response B: {row['response_b']}",
        "labels": 0 if 'model_a' == row['winner'] else 1
    }

processed_data = train_data.apply(preprocess_function, axis=1)
df = pd.DataFrame(processed_data.tolist())

# Split into train and evaluation datasets
train_data, eval_data = train_test_split(df, test_size=0.2, random_state=42)
train_dataset = Dataset.from_pandas(train_data)
eval_dataset = Dataset.from_pandas(eval_data)

CPU times: user 1.93 s, sys: 2.1 s, total: 4.03 s
Wall time: 3.97 s


## Section 3: Load and Configure Model

### Load Model and Tokenizer

- **Purpose**: Load the pre-trained Gemma model and tokenizer, and configure them for sequence classification.
- **Notes**:
  - `AutoConfig.from_pretrained` loads the model configuration.
  - `AutoTokenizer.from_pretrained` loads the tokenizer.
  - `AutoModelForSequenceClassification.from_pretrained` loads the model for sequence classification.
  - `model.tie_weights()` ensures shared weights are properly linked.
  - `model.resize_token_embeddings(len(tokenizer))` resizes the token embeddings to match the tokenizer.


In [7]:
%%time

model_path = "/kaggle/input/gemma/transformers/2b-it/3"

# Optional: Configure quantization (uncomment if needed and supported)
# quantization_config = BitsAndBytesConfig(load_in_4bit=True)

# Load base model configuration
config = AutoConfig.from_pretrained(model_path)
config.hidden_activation = "gelu"
config.use_cache = False
config.num_labels = 2
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(
    model_path,
    config=config,
    # quantization_config=quantization_config,
    device_map="auto",
    ignore_mismatched_sizes=True
)

# Verify if meta tensors exist and initialize them
if any(param.device.type == "meta" for param in model.parameters()):
    print("Meta tensors found. Initializing...")
    model.tie_weights()  # Tie weights, ensures shared weights are properly linked after initialization.
    model = model.to_empty(device=device)  #replaces meta tensors with empty tensors, clearing the "meta" state and making the model ready for proper initialization or loading onto a device.
    model = model.to(device)  # Move to CUDA device

# Print model to get hidden layer names, to use it later in LoRA configuration
print(model)

# Adjust tokenizer and model
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
model.resize_token_embeddings(len(tokenizer))

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

GemmaForSequenceClassification(
  (model): GemmaModel(
    (embed_tokens): Embedding(256000, 2048, padding_idx=0)
    (layers): ModuleList(
      (0-17): 18 x GemmaDecoderLayer(
        (self_attn): GemmaSdpaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=256, bias=False)
          (v_proj): Linear(in_features=2048, out_features=256, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): GemmaRotaryEmbedding()
        )
        (mlp): GemmaMLP(
          (gate_proj): Linear(in_features=2048, out_features=16384, bias=False)
          (up_proj): Linear(in_features=2048, out_features=16384, bias=False)
          (down_proj): Linear(in_features=16384, out_features=2048, bias=False)
          (act_fn): GELUActivation()
        )
        (input_layernorm): GemmaRMSNorm((2048,), eps=1e-06)
        (post_attention_layernorm): GemmaRMSNorm((20

Embedding(256000, 2048, padding_idx=0)

### Apply LoRA (Low-Rank Adaptation) Configuration
- **Purpose**: Apply LoRA (Low-Rank Adaptation) for parameter-efficient fine-tuning.
- **Notes**:
  - `LoraConfig` configures LoRA with specific parameters:
    - `task_type=TaskType.SEQ_CLS` sets the task type to sequence classification.
    - `r=8` sets the rank of the low-rank matrices.
    - `lora_alpha=16` controls the scaling factor.
    - `lora_dropout=0.1` sets the dropout rate.
    - `target_modules=["q_proj", "v_proj"]` specifies the target modules for LoRA.

In [8]:
%%time

# Apply LoRA configuration
lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    lora_dropout=0.1,
    task_type=TaskType.SEQ_CLS,
    target_modules=["q_proj", "v_proj"]
)
model = get_peft_model(model, lora_config)

CPU times: user 53.4 ms, sys: 0 ns, total: 53.4 ms
Wall time: 54.4 ms


## Section 4: Prepare Data for Training

### Subset Data for Debugging
- **Purpose**: Subset the dataset for faster debugging and testing.
- **Notes**:
  - `train_dataset.shuffle(seed=42).select(range(subset_size))` shuffles and selects a subset of the training data.
  - `eval_dataset.shuffle(seed=42).select(range(subset_size))` does the same for the evaluation data.


In [9]:
%%time

# subset data to debug
subset_size = 900  # Number of examples to use for training
train_dataset = train_dataset.shuffle(seed=42).select(range(subset_size))
subset_size = int(subset_size*2/8)  # Number of examples to use for test, keep its ratio 8 to 2
eval_dataset = eval_dataset.shuffle(seed=42).select(range(subset_size))

CPU times: user 16.9 ms, sys: 873 µs, total: 17.8 ms
Wall time: 18.1 ms


### Tokenize Dataset

- **Purpose**: Tokenize the dataset for model input.
- **Notes**:
  - `tokenizer` tokenizes the input text with padding and truncation.
  - `max_length=512` ensures the input length does not exceed 512 tokens.

In [10]:
%%time
train_dataset = train_dataset.map(lambda x: tokenizer(x['input_text'], padding=True, truncation=True, max_length=512), batched=True)
eval_dataset = eval_dataset.map(lambda x: tokenizer(x['input_text'], padding=True, truncation=True, max_length=512), batched=True)

Map:   0%|          | 0/900 [00:00<?, ? examples/s]

Map:   0%|          | 0/225 [00:00<?, ? examples/s]

CPU times: user 5.88 s, sys: 135 ms, total: 6.01 s
Wall time: 2.34 s


### Set Dataset Format

- **Purpose**: Set the dataset format to PyTorch tensors.
- **Notes**:
  - `set_format` converts the dataset to PyTorch tensors for training.

In [11]:
%%time
train_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])
eval_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])

CPU times: user 984 µs, sys: 0 ns, total: 984 µs
Wall time: 879 µs


## Section 5: Define Metrics and Training Arguments

### Define Metrics

- **Purpose**: Define metrics for model evaluation.
- **Notes**:
    The `compute_metrics` function calculates evaluation metrics for the model:
    - `accuracy`: How many predictions match the ground truth.
    - `precision`: Correctly predicted positives / All predicted positives.
    - `recall`: Correctly predicted positives / All actual positives.
    - `f1`: Harmonic mean of precision and recall.

In [12]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = torch.argmax(torch.tensor(logits), dim=1).numpy()
    precision, recall, f1, _ = precision_recall_fscore_support(labels, predictions, average="binary")
    acc = accuracy_score(labels, predictions)
    return {"accuracy": acc, "precision": precision, "recall": recall, "f1": f1}

### Define Training Arguments

- **Purpose**: Configure training arguments.
- **Notes**:
  - `TrainingArguments` configures various training parameters:
    - `output_dir="./results"` sets the output directory.
    - `eval_strategy="epoch"` evaluates the model at the end of each epoch.
    - `learning_rate=2e-4` sets the learning rate.
    - `per_device_train_batch_size=2` sets the batch size for training.
    - `num_train_epochs=3` sets the number of training epochs.
    - `weight_decay=0.01` sets the weight decay for regularization.
    - `fp16=torch.cuda.is_available()` enables mixed precision training if a GPU is available.

In [13]:
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    learning_rate=2e-4,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    num_train_epochs=3,
    weight_decay=0.01,
    save_strategy="epoch",
    logging_dir="./logs",
    logging_steps=100,
    save_total_limit=2,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    greater_is_better=True,
    fp16=torch.cuda.is_available(),
    # fp16_opt_level="O2",
    report_to="none",
    warmup_steps=500,
    save_steps=500,
    gradient_checkpointing=True
)

## Section 6: Initialize Trainer and Train

### Initialize Trainer

- **Purpose**: Initialize the Trainer for model training.
- **Notes**:
  - `DataCollatorWithPadding` ensures the input data is padded to the same length.
  - `Trainer` handles the training loop, evaluation, and saving of the model.

In [14]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

### Clear GPU Cache

- **Purpose**: Clear GPU memory before training.
- **Notes**: Ensures that there is enough memory available for training.

In [15]:
torch.cuda.empty_cache()

### Train the Model


- **Purpose**: Train the model using the configured Trainer.
- **Notes**: The model is trained for the specified number of epochs.

In [16]:
%%time
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.9423,1.050014,0.488889,0.512821,0.34188,0.410256
2,0.8588,0.989795,0.484444,0.504202,0.512821,0.508475
3,0.6088,0.983975,0.466667,0.486486,0.461538,0.473684


CPU times: user 17min 19s, sys: 1min 39s, total: 18min 59s
Wall time: 18min 58s


TrainOutput(global_step=1350, training_loss=0.8322746955023872, metrics={'train_runtime': 1138.6269, 'train_samples_per_second': 2.371, 'train_steps_per_second': 1.186, 'total_flos': 1.64462541668352e+16, 'train_loss': 0.8322746955023872, 'epoch': 3.0})

In [17]:
torch.cuda.empty_cache()

## Section 7: Save Model

### Save Model and Tokenizer

- **Purpose**: Save the trained model and tokenizer for future use.
- **Notes**: The model and tokenizer are saved to the specified directory.

In [18]:
# model.save_pretrained("./trained_gemma_predictor_model")
# tokenizer.save_pretrained("./trained_gemma_predictor_model")

## Section 8: Prediction Function

### Define Prediction Function


- **Purpose**: Define a function to predict the better response given a prompt and two responses, based on highest probability.
- **Notes**:
  - `tokenizer` tokenizes the input text.
  - `model(**inputs)` generates predictions.
  - `torch.nn.functional.softmax` converts logits to probabilities.

In [19]:
def predict(prompt, response_a, response_b):
    input_text = f"Prompt: {prompt} | Response A: {response_a} | Response B: {response_b}"
    inputs = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True, max_length=512).to(device)
    outputs = model(**inputs)
    probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
    return "model_a" if probabilities[0][0] > probabilities[0][1] else "model_b"

## Section 9: Evaluate Responses

### Evaluate Responses
- **Purpose**: Evaluate responses on the test dataset.
- **Notes**:
  - `evaluate_responses` iterates over the test data and predicts the better response for each row.

In [20]:
def evaluate_responses(test_data):
    results = []
    for _, row in test_data.iterrows():
        results.append({
            'id': row["id"],
            'winner': predict(row["prompt"], row["response_a"], row["response_b"])
        })
    return pd.DataFrame(results)

## Section 10: Generate Submission

### Generate and Save Submission
- **Purpose**: Generate predictions for the test dataset and save them to a CSV file.
- **Notes**:
  - `results_df.to_csv("submission.csv", index=False)` saves the results to a CSV file.
  - `print(results_df.head())` prints the first few rows of the results.

In [21]:
results_df = evaluate_responses(test_data)
results_df.to_csv("submission.csv", index=False)
print(results_df.head())

        id   winner
0   327228  model_a
1  1139415  model_a
2  1235630  model_a
