# Intro: In-Context Learning Experiment

reference repo: https://github.com/uds-lsv/llmft/tree/main

- Experiment: In-Context Learning
- Model: opt-125M/350M
- Datasets: RTE for in-domain and HANS for out-of-domain
- vary on num of shots: 2, 32, 128
- prompt format: GPT type
```plain
{text1} question: {text2} Yes or No?
answer: [Yes/No]
```

what does in-context learning (ICL) here mean:

- Instead of updateing the pre-trained model's weight, ICL solve tasks by conditioning on a sequence of demonstrations (i.e. input _x_ and its ground truth _y_ combined by specific pattern).

- ICL thus feeds the model a sequence of such demonstrations, followed by the test input (modified by applying the pattern transformation). The language model is then expected to predict the label of this final data point.

bash code for experiment: `bash $PROJECT_DIR/scripts/in_context/mnli/run_gpt3.sh rte 2 facebook/opt-30b 1 60000`

- Task name: "rte"
- Number of shots: 2
- Model: facebook/opt-30b
- GPU: 1
- Port: 60000

Other Hyperparameters (all same as the original paper):

- fixed context size: 2048 tokens

Experiment Process:

- in-domain: we measure indomain generalization by measuring accuracy on the validation set of each dataset. So in this experiment, the demonstrations are from RTE's training dataset. And the test dataset is RTE's test one.

- out-of-domain: we focus on generalization to challenge datasets, designed to test whether models adopt a particular heuristic, or make predictions based on spurious correlations during inference. So in this experiment, the demonstrations are also from RTE's training dataset. And And the test dataset is HANS validation dataset.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
%cd /content/drive/MyDrive/cs7643-group-project/notebooks

/content/drive/MyDrive/cs7643-group-project/notebooks


In [3]:
ls

few_shot_context_distillation_mnli.ipynb        [0m[01;34moffload_folder[0m/
few_shot_context_distillation_rte.ipynb         [01;34mresults[0m/
few_shot_context_distillation_rts.ipynb         vanilla_cola_baseline.ipynb
few_shot_ICL_rte_baseline_results_opt-125M.csv  [01;34mwandb[0m/
ICL_rte.ipynb


In [4]:
!pip install -q transformers accelerate bitsandbytes datasets

In [5]:
!export CUDA_LAUNCH_BLOCKING=1

# Dependency and Config

In [6]:
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"

import torch
from transformers import AutoConfig, AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments, EvalPrediction
from datasets import load_dataset, ClassLabel
import logging
from torch.utils.data import DataLoader
from tqdm import tqdm
import numpy as np

import time
import pandas as pd

In [7]:
torch.cuda.empty_cache()

# for reproducibility
np.random.seed(42)

torch.manual_seed(42)

if torch.cuda.is_available():
  torch.cuda.manual_seed_all(42)

In [8]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

# Prep

In [9]:
# originally in task_utils.py
task_to_keys = {
    # labels are: 0 (entailment), 1 (contradiction)
    "rte": ("sentence1", "sentence2"),
    "mnli": ("premise", "hypothesis"),
    "mnli-original": ("premise", "hypothesis"),
    "mnli-mismatched": ("premise", "hypothesis"),
    "hans": ("premise", "hypothesis"),

    # labels are: 0 (not_duplicate), 1 (duplicate)
    "qqp": ("question1", "question2"),
    "paws-qqp": ("sentence1", "sentence2"),

    # labels are: 0 (not acceptable), 1 (acceptable)
    "cola": ("sentence", None),
    "cola-ood": ("sentence", None),
}

## In-Context Leearnig Data Preprocess

In [10]:
def _select_subset_by_ids(dataset, indices):
  # originally in eval_utils.py
    subset = dataset.select(indices)
    return subset

In [11]:
def get_balanced_subsets(dataset):
  # originally in eval_utils.py
    subset_per_label = {}
    for label_idx, _ in enumerate(dataset.features["label"].names):
        subset_per_label[label_idx] = dataset.filter(
            lambda s: s["label"] == label_idx)
    return subset_per_label

In [12]:
def _select_random_subset(dataset, num_shots, balanced: bool, seed: int):
  # originally in eval_utils.py
    # fix seed
    np.random.seed(seed)

    if num_shots < 1:
        return [], []

    if balanced:
        assert num_shots % 2 == 0, "a balanced context requires at least one demonstartion per label"
        # select the same number of samples from every label
        indices = []  # we collect all indices here
        subset_per_label = get_balanced_subsets(dataset)

        for _, samples in subset_per_label.items():
            subset_indices = samples["idx"]
            # select num_shots // 2 samples
            subset_indices = np.random.choice(
                subset_indices, size=num_shots // 2, replace=False)
            indices += list(subset_indices)
        assert len(indices) == num_shots
    else:
        # just select a random subset of samples
        indices = np.random.choice(
            range(len(dataset)), size=num_shots, replace=False)

    # return _select_subset_by_ids(dataset, indices), indices
    return _select_subset_by_idx(dataset, indices), indices

In [13]:
def _select_subset_by_idx(dataset, indices):
  # originally in eval_utils.py
    dataset = dataset.filter(
        lambda s: s["idx"] in indices)
    return dataset

In [14]:
def create_few_shot_context(
    dataset_name,
    dataset,
    num_shots,
    pattern,
    label_to_tokens,
    separate_shots_by=" ",
    description="",
    target_prefix="",
    from_indices=None,
    balanced=False,
    shuffle=False,
    seed=123
):
    assert pattern is not None
    assert label_to_tokens is not None

    # select samples from which the context will be constructed
    if from_indices is not None:
        demonstrations, indices = _select_subset_by_ids(dataset, from_indices)
    else:
        demonstrations, indices = _select_random_subset(
            dataset, num_shots, balanced, seed)

    if shuffle:
        if len(demonstrations) > 0:
            demonstrations = demonstrations.shuffle(seed)

    # create context
    context = "" if description == "" else f"{description}{separate_shots_by}"

    for sample in demonstrations:
        formated_sample = pattern.format(
            text1=sample[task_to_keys[dataset_name][0]],
            text2=sample[task_to_keys[dataset_name][1]
                         ] if task_to_keys[dataset_name][1] is not None else None
        )
        verbalized_label = label_to_tokens[sample["label"]]
        if verbalized_label.startswith("Ä "):
            # we need to remove the leading whitespace from the target token in the context
            verbalized_label = verbalized_label[1:]
        elif verbalized_label.startswith("â"):
            # we need to remove the leading whitespace from the target token in the context
            verbalized_label = verbalized_label[1:]

        context += f"{formated_sample}{target_prefix}{verbalized_label}{separate_shots_by}"

    return context, indices

In [15]:
def add_context_to_dataset(dataset_name, dataset, pattern, context):
    def _add_context(samples):
        result = {}
        modified_inputs = []
        key1, key2 = task_to_keys[dataset_name]

        for idx in range(len(samples[key1])):
            modified_input = f"{context}{pattern.format(text1=samples[key1][idx], text2=samples[key2][idx])}"
            modified_inputs.append(modified_input)

        result["modified_input"] = modified_inputs

        return result

    dataset = dataset.map(_add_context, batched=True, batch_size=8)

    return dataset


## Result Processing

In [16]:
def simple_accuracy(preds, labels):
    """
    Calculate simple accuracy metric.

    Args:
        preds (np.ndarray): Predictions array
        labels (np.ndarray): Ground truth labels array

    Returns:
        float: Accuracy score
    """
    return float((preds == labels).mean())

In [17]:
def compute_metrics(predictions, labels):
    """
    Compute evaluation metrics for model predictions.

    Args:
        predictions (List[int]): List of model predictions
        labels (List[int]): List of true labels

    Returns:
        dict: Dictionary containing the computed metrics
    """
    # Convert predictions and labels to numpy arrays if they aren't already
    predictions = np.array(predictions)
    labels = np.array(labels)

    # For RTE and HANS tasks, we use simple accuracy as the metric
    # This aligns with the GLUE metric implementation for these tasks
    accuracy = simple_accuracy(predictions, labels)

    return {
        "accuracy": accuracy
    }


# Run Experiment

In [18]:
# Configuration
seed = 42
model_name = "facebook/opt-125m"
task_name = "rte"

# set prompt pattern
pattern = "{text1} question: {text2} Yes or No?"
target_tokens = ["Ä Yes", "Ä No"] # Using Ä as special token
target_prefix = " answer: "
separate_shots_by = "\n\n"

num_shots = [2, 32, 128]  # Number of few-shot examples

# Load datasets
train_dataset = load_dataset("glue", task_name, split="train")
eval_datasets = {"rte": load_dataset("glue", "rte", split="validation"),
                 "hans": load_dataset("hans", split="validation")}

# Initialize model and tokenizer
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
model.to(device)

# initialize results
results = {}


# In-Context Learning Setup:
for n in num_shots:
    print(f"---- start with num of shots: {n} ---")
    # Create prompt with examples of num_shots
    context, context_indices = create_few_shot_context(
        dataset_name=task_name,
        dataset=train_dataset,
        num_shots=n,
        pattern=pattern,
        label_to_tokens={0: "No", 1: "Yes"},
        separate_shots_by=separate_shots_by,
        target_prefix=target_prefix,
        balanced=True,
        shuffle=True
    )

    # Add context to evaluation dataset
    for eval_task_name, eval_dataset in eval_datasets.items():
      print("eval task name: ", eval_task_name)
      eval_dataset_with_context = add_context_to_dataset(
          dataset_name=eval_task_name,
          dataset=eval_dataset,
          pattern=pattern,
          context=context
      )

      # Evaluation loop
      model.eval()
      predictions = []
      labels = []

      with torch.no_grad():
          for example in eval_dataset_with_context:
              # Tokenize input
              inputs = tokenizer(
                  example["modified_input"] + target_prefix,
                  return_tensors="pt",
                  truncation=True,
                  max_length=1024
              ).to(device)

              # Get model predictions
              outputs = model.generate(
                  **inputs,
                  max_new_tokens=5,
                  num_return_sequences=1,
                  pad_token_id=tokenizer.eos_token_id
              )

              # Decode prediction
              pred_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
              pred = 1 if "Yes" in pred_text else 0

              predictions.append(pred)
              labels.append(example["label"])

      # Compute metrics
      metrics = compute_metrics(predictions, labels)

      # Initialize results structure if needed
      if eval_task_name not in results:
          results[eval_task_name] = {}
      if n not in results[eval_task_name]:
          results[eval_task_name][n] = []

      results[eval_task_name][n].append(metrics)

      print(f"Num shots: {n}")  # Changed to n
      print(f"Accuracy: {metrics['accuracy']}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Using the latest cached version of the dataset since glue couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'rte' at /root/.cache/huggingface/datasets/glue/rte/0.0.0/bcdcba79d07bc864c1c254ccfcedcce55bcc9a8c (last modified on Wed Dec 11 21:31:48 2024).


---- start with num of shots: 2 ---
eval task name:  rte
Num shots: 2
Accuracy: 0.4729241877256318
eval task name:  hans
Num shots: 2
Accuracy: 0.5
---- start with num of shots: 32 ---
eval task name:  rte
Num shots: 32
Accuracy: 0.4729241877256318
eval task name:  hans


Map:   0%|          | 0/30000 [00:00<?, ? examples/s]

Num shots: 32
Accuracy: 0.5
---- start with num of shots: 128 ---


Filter:   0%|          | 0/2490 [00:00<?, ? examples/s]

eval task name:  rte


Map:   0%|          | 0/277 [00:00<?, ? examples/s]

Num shots: 128
Accuracy: 0.4729241877256318
eval task name:  hans


Map:   0%|          | 0/30000 [00:00<?, ? examples/s]

KeyboardInterrupt: 

In [None]:
results

In [None]:
pd.DataFrame(results).to_csv(f"./few_shot_ICL_rte_baseline_results_{model_name}.csv")