# Intro: In-Context Learning Experiment

reference repo: https://github.com/uds-lsv/llmft/tree/main

- Experiment: In-Context Learning
- Model: opt-125M/350M
- Datasets: RTE for in-domain and HANS for out-of-domain
- vary on num of shots: 2, 32, 128
- prompt format: GPT type
```plain
{text1} question: {text2} Yes or No?
answer: [Yes/No]
```

what does in-context learning (ICL) here mean:

- Instead of updateing the pre-trained model's weight, ICL solve tasks by conditioning on a sequence of demonstrations (i.e. input _x_ and its ground truth _y_ combined by specific pattern).

- ICL thus feeds the model a sequence of such demonstrations, followed by the test input (modified by applying the pattern transformation). The language model is then expected to predict the label of this final data point.

bash code for experiment: `bash $PROJECT_DIR/scripts/in_context/mnli/run_gpt3.sh rte 2 facebook/opt-30b 1 60000`

- Task name: "rte"
- Number of shots: 2
- Model: facebook/opt-30b
- GPU: 1
- Port: 60000

Other Hyperparameters (all same as the original paper):

- fixed context size: 2048 tokens

Experiment Process:

- in-domain: we measure indomain generalization by measuring accuracy on the validation set of each dataset. So in this experiment, the demonstrations are from RTE's training dataset. And the test dataset is RTE's test one.

- out-of-domain: we focus on generalization to challenge datasets, designed to test whether models adopt a particular heuristic, or make predictions based on spurious correlations during inference. So in this experiment, the demonstrations are also from RTE's training dataset. And And the test dataset is HANS validation dataset.

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
%cd /content/drive/MyDrive/cs7643-group-project/notebooks

/content/drive/MyDrive/cs7643-group-project/notebooks


In [4]:
ls

few_shot_context_distillation_mnli.ipynb        [0m[01;34moffload_folder[0m/
few_shot_context_distillation_rte.ipynb         [01;34mresults[0m/
few_shot_context_distillation_rts.ipynb         vanilla_cola_baseline.ipynb
few_shot_ICL_rte_baseline_results_opt-125M.csv  [01;34mwandb[0m/
ICL_rte.ipynb


In [5]:
!pip install -q transformers accelerate bitsandbytes datasets

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.1/69.1 MB[0m [31m31.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m39.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.3/179.3 kB[0m [31m17.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m18.3 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gcsfs 2024.10.0 requires fsspec==2024.10.0, but you have fsspec 2024.9.0 which is incompat

# Dependency and Config

In [11]:
import torch
from transformers import AutoConfig, AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments, EvalPrediction
from datasets import load_dataset, ClassLabel
import logging
from torch.utils.data import DataLoader
from tqdm import tqdm
import numpy as np

import time
import pandas as pd

In [7]:
torch.cuda.empty_cache()

# for reproducibility
np.random.seed(42)

torch.manual_seed(42)

if torch.cuda.is_available():
  torch.cuda.manual_seed_all(42)

In [8]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

# Prep

In [None]:
# originally in task_utils.py
task_to_keys = {
    # labels are: 0 (entailment), 1 (contradiction)
    "rte": ("sentence1", "sentence2"),
    "mnli": ("premise", "hypothesis"),
    "mnli-original": ("premise", "hypothesis"),
    "mnli-mismatched": ("premise", "hypothesis"),
    "hans": ("premise", "hypothesis"),

    # labels are: 0 (not_duplicate), 1 (duplicate)
    "qqp": ("question1", "question2"),
    "paws-qqp": ("sentence1", "sentence2"),

    # labels are: 0 (not acceptable), 1 (acceptable)
    "cola": ("sentence", None),
    "cola-ood": ("sentence", None),
}

## Data Loading

In [None]:
def load_glue_datasets(task_name):
    """
    Get the datasets: specify a GLUE benchmark task (the dataset will be downloaded automatically from the datasets Hub).

    In distributed training, the load_dataset function guarantee that only one local process can concurrently download the dataset.

    originally in task_utils.py
    """
    if task_name is not None:
        if task_name == "mnli":
            # convert to binary format (remove neutral class)
            raw_datasets = load_dataset(
                "glue", task_name)

            raw_datasets = raw_datasets.filter(
                lambda example: example["label"] != 1)

            # change labels of contradiction examples from 2 to 1
            def change_label(example):
                example["label"] = 1 if example["label"] == 2 else example["label"]
                return example
            raw_datasets = raw_datasets.map(change_label)

            # change features to reflect the new labels
            features = raw_datasets["train"].features.copy()
            features["label"] = ClassLabel(
                num_classes=2, names=['entailment', 'contradiction'], id=None)
            raw_datasets = raw_datasets.cast(
                features)  # overwrite old features

        elif task_name == "mnli-original":
            # convert to binary format (merge neutral and contradiction class)
            raw_datasets = load_dataset(
                path="glue", name="mnli")

            # change labels of contradiction examples from 2 to 1
            def change_label(example):
                example["label"] = 1 if example["label"] == 2 else example["label"]
                return example
            raw_datasets = raw_datasets.map(change_label)

            # change features to reflect the new labels
            features = raw_datasets["train"].features.copy()
            features["label"] = ClassLabel(
                num_classes=2, names=['entailment', 'contradiction'], id=None)
            raw_datasets = raw_datasets.cast(
                features)  # overwrite old features

        else:
            # Downloading and loading a dataset from the hub.
            raw_datasets = load_dataset(
                "glue",
                task_name
            )

            if task_name == "qqp":
                # we subsample qqp already here because its really big
                # make sure we fix the seed here
                for split in raw_datasets.keys():
                    raw_datasets[split] = raw_datasets[split].select(np.random.choice(
                        np.arange(len(raw_datasets[split])), size=1000, replace=False
                    ))

    # Determine number of labels
    is_regression = task_name == "stsb"
    if not is_regression:
        label_list = raw_datasets["train"].features["label"].names
        num_labels = len(label_list)
    else:
        num_labels = 1

    return raw_datasets, label_list, num_labels, is_regression

In [None]:
def load_hans_dataset(heuristic=None, subcase=None, label=None):
    """
    heuristic = {lexical_overlap, subsequence, constituent}
    subcase = see HANS_SUBCASES
    label = {0 (entailment), 1 (contradiction)}

    originally in task_utils.py
    """
    subset = "hans"
    dataset = load_dataset(
        "hans", split="validation")

    # hans comes without indices, so we add them
    indices = list(range(len(dataset)))
    dataset = dataset.add_column(name="idx", column=indices)

    if heuristic is not None:  # filter dataset based on heuristic
        dataset = dataset.filter(
            lambda example: example["heuristic"] == heuristic)
        subset = f"{subset}-{heuristic}"

    if subcase is not None:  # filter dataset based on subcase
        dataset = dataset.filter(
            lambda example: example["subcase"] == subcase)
        subset = f"{subset}-{subcase}"

    if label is not None:  # filter dataset based on label
        dataset = dataset.filter(
            lambda example: example["label"] == label)
        subset = f"{subset}-{'entailment' if label == 0 else 'contradiction'}"

    return dataset, subset

## In-Context Leearnig Data Preprocess

In [None]:
def _select_subset_by_ids(dataset, indices):
  # originally in eval_utils.py
    subset = dataset.select(indices)
    return subset

In [None]:
def get_balanced_subsets(dataset):
  # originally in eval_utils.py
    subset_per_label = {}
    for label_idx, _ in enumerate(dataset.features["label"].names):
        subset_per_label[label_idx] = dataset.filter(
            lambda s: s["label"] == label_idx)
    return subset_per_label

In [None]:
def _select_random_subset(dataset, num_shots, balanced: bool, seed: int):
  # originally in eval_utils.py
    # fix seed
    np.random.seed(seed)

    if num_shots < 1:
        return [], []

    if balanced:
        assert num_shots % 2 == 0, "a balanced context requires at least one demonstartion per label"
        # select the same number of samples from every label
        indices = []  # we collect all indices here
        subset_per_label = get_balanced_subsets(dataset)

        for _, samples in subset_per_label.items():
            subset_indices = samples["idx"]
            # select num_shots // 2 samples
            subset_indices = np.random.choice(
                subset_indices, size=num_shots // 2, replace=False)
            indices += list(subset_indices)
        assert len(indices) == num_shots
    else:
        # just select a random subset of samples
        indices = np.random.choice(
            range(len(dataset)), size=num_shots, replace=False)

    # return _select_subset_by_ids(dataset, indices), indices
    return _select_subset_by_idx(dataset, indices), indices

In [None]:
def _select_subset_by_idx(dataset, indices):
  # originally in eval_utils.py
    dataset = dataset.filter(
        lambda s: s["idx"] in indices)
    return dataset

In [None]:
def create_few_shot_context(
    dataset_name,
    dataset,
    num_shots,
    pattern,
    label_to_tokens,
    separate_shots_by=" ",
    description="",
    target_prefix="",
    from_indices=None,
    balanced=False,
    shuffle=False,
    seed=123
):
  # originally in eval_utils.py
    assert pattern is not None
    assert label_to_tokens is not None

    # select samples from which the context will be constructed
    if from_indices is not None:
        demonstrations, indices = _select_subset_by_ids(dataset, from_indices)
    else:
        demonstrations, indices = _select_random_subset(
            dataset, num_shots, balanced, seed)

    if shuffle:
        if len(demonstrations) > 0:
            demonstrations = demonstrations.shuffle(seed)

    # create context
    context = "" if description == "" else f"{description}{separate_shots_by}"

    for sample in demonstrations:
        formated_sample = pattern.format(
            text1=sample[task_to_keys[dataset_name][0]],
            text2=sample[task_to_keys[dataset_name][1]
                         ] if task_to_keys[dataset_name][1] is not None else None
        )
        verbalized_label = label_to_tokens[sample["label"]]
        if verbalized_label.startswith("Ġ"):
            # we need to remove the leading whitespace from the target token in the context
            verbalized_label = verbalized_label[1:]

        elif verbalized_label.startswith("▁"):
            # we need to remove the leading whitespace from the target token in the context
            verbalized_label = verbalized_label[1:]

        context += f"{formated_sample}{target_prefix}{verbalized_label}{separate_shots_by}"

    return context, indices

In [None]:
def add_context_to_dataset(dataset_name, dataset, pattern, context):
    def _add_context(samples):
        result = {}
        modified_inputs = []
        key1, key2 = task_to_keys[dataset_name]

        for idx in range(len(samples[key1])):
            modified_input = f"{context}{pattern.format(text1=samples[key1][idx], text2=samples[key2][idx])}"
            modified_inputs.append(modified_input)

        result["modified_input"] = modified_inputs

        return result

    dataset = dataset.map(_add_context, batched=True, batch_size=42)

    return dataset


In [None]:
def preprocess_function(examples, tokenizer, pattern, target_tokens, dataset_name, max_length, target_prefix):
    """
    Formats inputs in GPT-3 style using a specific pattern
    Tokenizes the formatted inputs
    Adds labels for evaluation

    originally in eval.py
    """

    # Get the appropriate keys based on the task
    if dataset_name == "rte":
      text1 = examples["sentence1"]
      text2 = examples["sentence2"]
    elif dataset_name == "hans":
      text1 = examples["premise"]
      text2 = examples["hypothesis"]
    else:
      raise ValueError(f"Unsupported dataset: {dataset_name}")

    # Set GPT pattern
    id_to_target_token = {idx: t for idx, t in enumerate(target_tokens)}

    # Format examples
    pattern_examples  = [
        pattern.format(
            text1=text1[i],
            text2=text2[i]
        )
        for i in range(len(text1))
    ]
    args = (pattern_examples,)
    result = tokenizer(*args, padding="max_length",
                    max_length=max_length, truncation=True)

    # Get tokens
    result["input_tokens"] = [tokenizer.convert_ids_to_tokens(
        ids) for ids in result["input_ids"]]

    # Decode input
    result["input_text"] = [tokenizer.decode(
        ids) for ids in result["input_ids"]]

    # Replace labels by target tokens indices when using lm_head
    target_tokens_ids = tokenizer.convert_tokens_to_ids(target_tokens)
    result["label"] = [target_tokens_ids[l] for l in examples["label"]]
    result["label_text"] = [id_to_target_token[l] if l != -1 else "unlabeled"
                            for l in examples["label"]]

    return result

## Model Loading

In [None]:
def _load_model(model_name_config: str):
    config = AutoConfig.from_pretrained(
        f"facebook/{model_name_config}",
        num_labels=2,
        hidden_dropout_prob=0.1,
        attention_probs_dropout_prob=0.1
    )

    tokenizer = AutoTokenizer.from_pretrained(
        f"facebook/{model_name_config}"
    )

    model = AutoModelForSequenceClassification.from_pretrained(
        f"facebook/{model_name_config}",
        config=config,
    )

    return config, tokenizer, model

## Result Processing

In [None]:
def evaluate_model(model, dataset, tokenizer, task_name, num_shots, pattern, max_length, target_prefix):
    model.eval()
    all_predictions = []
    all_labels = []

    # Create DataLoader
    dataloader = torch.utils.data.DataLoader(
        dataset,
        num_shots=num_shots,
        shuffle=False
    )

    with torch.no_grad():
        for num_shots in dataloader:
            inputs = preprocess_function(num_shots, tokenizer, task_name, pattern, max_length, target_prefix)
            outputs = model(**inputs)
            predictions = outputs.logits.argmax(dim=-1)

            all_predictions.extend(predictions.cpu().numpy())
            all_labels.extend(inputs["labels"].cpu().numpy())

    # Calculate accuracy
    accuracy = sum(p == l for p, l in zip(all_predictions, all_labels)) / len(all_labels)
    return accuracy

In [None]:
# In both eval.py and ft.py, but with slightly different implementations
def compute_metrics(p: EvalPrediction, task_name):
    preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
    preds = np.squeeze(preds) if is_regression else np.argmax(preds, axis=1)

    if task_name is not None:
        result = metric.compute(predictions=preds, references=p.label_ids)
        return result
    elif is_regression:
        return {"mse": ((preds - p.label_ids) ** 2).mean().item()}
    else:
        return {"accuracy": (preds == p.label_ids).astype(np.float32).mean().item()}

In [None]:
def _add_args_to_results(args, results):
    # Save results in a dataframe
    results["task_description"] = args.task_description if args.task_description is not None else " "
    results["pattern"] = args.pattern
    results["target_tokens"] = args.target_tokens
    results["num_shots"] = args.num_shots
    results["separate_shots_by"] = args.separate_shots_by
    results["balanced"] = args.balanced
    results["shuffle"] = args.shuffle
    results["target_prefix"] = args.target_prefix
    results["group"] = args.group

    return results

In [None]:
def _create_df(results):
    data = {k: [v] for k, v in results.items()}
    df = pd.DataFrame.from_dict(data)
    return df


# Run Experiment

In [None]:
# Configuration
seed = 42
model_name_config = "opt-125M"
task_name = "rte"
eval_task_name = "hans"

# set prompt pattern
pattern = "{text1} question: {text2} Yes or No?"
target_tokens = "Ä Yes,Ä No"
target_prefix = " answer: "

num_shots = [2, 32, 128]  # Number of few-shot examples

# Load datasets
raw_datasets, label_list, num_labels, is_regression = load_glue_datasets("rte")

# Loads HANS dataset as additional evaluation data

additional_evaluation_datasets = {}
for heuristic in ["lexical_overlap"]:
    # for heuristic in ["lexical_overlap", "subsequence", "constituent"]:
    # Load HANS subsets as additional validation data
    for label in [0, 1]:
        hans_subset, subset_name = load_hans_dataset(
            heuristic=heuristic, subcase=None, label=label)
        additional_evaluation_datasets[subset_name] = hans_subset

# Model Loading and Configuration:
config, tokenizer, model = _load_model(model_name_config)

# In-Context Learning Setup:
for n in num_shots:
  # Create prompt with examples of num_shots
  context, context_indices = create_few_shot_context(
      dataset_name="rte",
      dataset=raw_datasets["train"],
      num_shots=n,
      pattern=pattern,
      target_prefix=target_prefix,
      target_tokens=target_tokens,
      balanced=True,
      shuffle=True
  )

  # Get evaluation datasets
  eval_dataset = raw_datasets["validation"]
  eval_dataset = eval_dataset.map(preprocess_function, batched=True)
  for name, dataset in additional_evaluation_datasets.items():
      dataset = dataset.map(preprocess_function, batched=True)

  # Evaluation Process:
  total_steps = (len(raw_datasets) // 32) * 40
  training_args = TrainingArguments(
          output_dir="./results",
          overwrite_output_dir=True,
          num_train_epochs=40,
          per_device_train_batch_size=32,
          learning_rate=1e-5,
          weight_decay=0.0,
          save_steps=10_000,
          save_total_limit=2,
          warmup_steps=int(0.1 * total_steps),
          )

  # Initialize trainer
  trainer = Trainer(
            model = model,
            args=training_args,
            train_dataset = None,
            eval_dataset=None,
            compute_metrics = compute_metrics,
        )

  trainer.eval()

  # Run evaluation for each dataset
  eval_task_names = ["rte"] + list(additional_evaluation_datasets.keys())
  eval_datasets = [eval_dataset] + list(additional_evaluation_datasets.values())

  all_results = {}
  for task_name, dataset in zip(eval_task_names, eval_datasets):
      outputs = trainer.predict(
          dataset,
          metric_key_prefix=task_name
      )
      metrics = outputs.metrics
      all_results.update(metrics)

  # Results Processing and Saving:
    # Add experiment details to results
  all_results = _add_args_to_results(in_context_args, all_results)
  all_results["indices"] = contex_indices
  all_results["context"] = context
  all_results["data_seed"] = seed

  # Save to CSV
  df = _create_df(all_results)
  output_file = os.path.join(training_args.output_dir, f"{file_name}.csv")
  df.to_csv(output_file)

In [None]:
    dataset_name,
    dataset,
    num_shots,
    pattern,
    label_to_tokens,
    separate_shots_by=" ",
    description="",
    target_prefix="",
    from_indices=None,
    balanced=False,
    shuffle=False,
    seed=123

In [None]:
results

{2: {'rte_accuracy': 0.4729241877256318, 'hans_accuracy': 0.5},
 32: {'rte_accuracy': 0.4729241877256318, 'hans_accuracy': 0.5},
 128: {'rte_accuracy': 0.4729241877256318, 'hans_accuracy': 0.5}}

In [None]:
pd.DataFrame(results).to_csv(f"./few_shot_ICL_rte_baseline_results_{model_name_config}.csv")