## Commonsense MCQA

This notebook provides a walkthrough of building a benchmark for steering improved performance on the [CommonsenseQA](https://huggingface.co/datasets/tau/commonsense_qa) problem set. The benchmark will compare three steering pipelines: the unsteered behavior (baseline model), few shot steering, and steering via a LoRA adapter.

For convenience, change the current directory to the notebook if necessary:

In [1]:
import os
os.chdir("./notebooks/benchmarks/commonsense_mcqa/")

## Building the use case

The use case of interest has already been constructed via the [use case](../../docs/add_new_use_case.md) tutorial and is available at `aisteer360/evaluation/use_cases/commonsense_mcqa/use_case.py`. It is initialized as follows:

In [2]:
from aisteer360.evaluation.use_cases.commonsense_mcqa.use_case import CommonsenseMCQA
from aisteer360.evaluation.metrics.custom.commonsense_mcqa.mcqa_accuracy import MCQAAccuracy
from aisteer360.evaluation.metrics.custom.commonsense_mcqa.mcqa_positional_bias import MCQAPositionalBias

commonsense_mcqa = CommonsenseMCQA(
    evaluation_data="./data/evaluation_qa.jsonl",
    evaluation_metrics=[
        MCQAAccuracy(),
        MCQAPositionalBias(),
    ],
    num_shuffling_runs=20,
    num_samples=500  # optional
)

  from .autonotebook import tqdm as notebook_tqdm


Two custom metrics have been created for the use case: `MCQAAccuracy` which measures the accuracy statistics of each question (across trials), and `MCQAPositionalBias` which measures the positional bias (via deviation from the uniform distribution across runs). To facilitate computation of these statistics, the use case accepts a keyword argument `num_shuffling_runs` dictating how many times each question should be presented to the (steered) model under a randomized ordering of the choices. The `num_samples` parameter dictates how many entries from `evaluation_data` are used during benchmarking.

## Defining the controls

The benchmark aims to compare two controls using common steering data.

In [3]:
import json

steering_data_path = "data/steer_qa.jsonl"

with open(steering_data_path, "r") as f:
    steering_data = [json.loads(line) for line in f]

steering_data[0]

{'id': '01beaf20-82aa-40b0-8b08-ee08b94e6666',
 'question': 'The spirit ascended to the after life, so what was it leaving?',
 'answer_chosen': 'human being',
 'answer_rejected': 'cemetary'}

The steering data consists of triples `(question, answer_chosen, answer_rejected)` extracted from the CommonsenseQA dataset where `answer_chosen` is the ground-truth answer and `answer_rejected` is a randomly selected incorrect answer. Both controls (`FewShot` and `LoRA`) are based on the same steering data.

### Defining the few shot control

The `FewShot` control requires specification of example pools. As shown below, each positive example is given by the pair (`question`,`answer_chosen`) whereas each negative example is given by the pair (`question`,`answer_rejected`).

In [4]:
positive_pool = []
negative_pool = []
for row in steering_data:
    positive_pool.append({
        "question": row["question"],
        "answer": row["answer_chosen"]
    })
    negative_pool.append({
        "question": row["question"],
        "answer": row["answer_rejected"]
    })

These pools are then passed in to the `FewShot` class upon instantiation, along with the name of the example selector (how examples are drawn from the pools; defaults to `random`), and the counts for how many positive and negative examples the selector should draw from the pool.

In [5]:
from aisteer360.algorithms.input_control.few_shot.control import FewShot

few_shot = FewShot(
    selector_name="random",
    positive_example_pool=positive_pool,
    negative_example_pool=negative_pool,
    k_positive=25,
    k_negative=25
)

### Defining the DPO (with LoRA) control



In [6]:
from datasets import Dataset
from peft import PeftType
from aisteer360.algorithms.structural_control.wrappers.trl.dpotrainer.control import DPO


train_examples = []
for row in steering_data:
    train_examples.append({
        "prompt": row['question'],
        "chosen": row['answer_chosen'],  
        "rejected": row['answer_rejected']
    })
train_ds = Dataset.from_list(train_examples)

# instantiate dpo control
dpo_lora = DPO(
    train_dataset=train_ds,
    use_peft=True,
    peft_type=PeftType.LORA,
    **{
        "per_device_train_batch_size": 4,
        "num_train_epochs": 2,
        "learning_rate": 2e-5,
        "output_dir": "trl_models/Qwen2.5-0.5B-DPO-Lora-Steer",
        "logging_steps": 100,
        "save_strategy": "no",
    },
)

## Instantiating (and running) the benchmark

Given the controls, the benchmark can now be run on any control pipelines, i.e., sequence of controls. In the following benchmark, we compare the unsteered baseline behavior (no control) with few-shot and DPO (with LoRA).

In [None]:
import transformers
from aisteer360.evaluation.benchmark import Benchmark
transformers.logging.set_verbosity_error()

benchmark = Benchmark(
    use_case=commonsense_mcqa,
    base_model_name_or_path="Qwen/Qwen2.5-1.5B-Instruct",
    steering_pipelines={
        "baseline": [],  # no steering
        "few_shot": [few_shot],
        "dpo_lora": [dpo_lora],
    },
    gen_kwargs={
        "max_new_tokens": 300,
        "do_sample": True,
        "temperature": 0.7,
    },
    device_map="auto"
)

# run and plot/export
profiles = benchmark.run()

Running pipeline: baseline...


done.
Running pipeline: few_shot...
done.
Running pipeline: dpo_lora...


Extracting prompt in train dataset: 100%|██████████| 4871/4871 [00:00<00:00, 17093.58 examples/s]
Applying chat template to train dataset: 100%|██████████| 4871/4871 [00:00<00:00, 19933.53 examples/s]
Tokenizing train dataset: 100%|██████████| 4871/4871 [00:01<00:00, 3318.00 examples/s]


{'loss': 0.6844, 'grad_norm': 1.3753544092178345, 'learning_rate': 1.9187192118226602e-05, 'rewards/chosen': 0.05939213186502457, 'rewards/rejected': 0.04117845371365547, 'rewards/accuracies': 0.6549999713897705, 'rewards/margins': 0.018213678151369095, 'logps/chosen': -42.31070327758789, 'logps/rejected': -45.085453033447266, 'logits/chosen': 0.8083691596984863, 'logits/rejected': 0.904447078704834, 'epoch': 0.08210180623973727}
{'loss': 0.649, 'grad_norm': 1.7281113862991333, 'learning_rate': 1.836617405582923e-05, 'rewards/chosen': 0.3236657977104187, 'rewards/rejected': 0.21876558661460876, 'rewards/accuracies': 0.6924999952316284, 'rewards/margins': 0.10490025579929352, 'logps/chosen': -39.682899475097656, 'logps/rejected': -43.24220657348633, 'logits/chosen': 0.7636573910713196, 'logits/rejected': 0.8838378190994263, 'epoch': 0.16420361247947454}
{'loss': 0.5743, 'grad_norm': 2.3698391914367676, 'learning_rate': 1.7545155993431858e-05, 'rewards/chosen': 0.599941074848175, 'reward

In [9]:
benchmark.export(profiles, save_dir="./profiles/")

## Inspecting the profiles

Each control pipeline in the benchmark yields an evaluation profile. Each evaluation profile contains metric values as computed by the metrics passed in to the use case, in this case `MCQAAccuracy` and `MCQAPositionalBias`.

In [10]:
import json
print(json.dumps(profiles['baseline']['evaluations'], indent=2))

{
  "MCQAAccuracy": {
    "trial_mean": 0.5058,
    "trial_std": 0.4999913590612424,
    "question_mean": 0.446,
    "question_std": 0.4975732692947166
  },
  "MCQAPositionalBias": {
    "mean": 0.10532592592592592,
    "std": 0.031184296794208543
  }
}


In [11]:
print(json.dumps(profiles['few_shot']['evaluations'], indent=2))

{
  "MCQAAccuracy": {
    "trial_mean": 0.7007,
    "trial_std": 0.4579743268441779,
    "question_mean": 0.716,
    "question_std": 0.45138841700470494
  },
  "MCQAPositionalBias": {
    "mean": 0.014360000000000001,
    "std": 0.03518742260266924
  }
}


In [12]:
print(json.dumps(profiles['dpo_lora']['evaluations'], indent=2))

{
  "MCQAAccuracy": {
    "trial_mean": 0.5352,
    "trial_std": 0.4987843608051961,
    "question_mean": 0.482,
    "question_std": 0.5001763216161011
  },
  "MCQAPositionalBias": {
    "mean": 0.07606851211072664,
    "std": 0.006019957412877974
  }
}


We can see that `FewShot` (using 25 positive/negative examples) yields the best improvement over baseline. The `DPO` (with LoRA) control yields a marginal improvement over the baseline, likely because of the small (5k) steering dataset.