# Benchmarking Pythia $160\text{M}$ pre-trained on The Pile vs. Pythia $160\text{M}$ trained on MiniPile

Objectives:
- [x] Prepare (Download) two models - **Pythia $160M$ Untrained** and **Pythia $160M$ fully Pile-trained**
- [x] Load MiniPile Dataset from disk
- [ ] Train **Pythia $160M$ Untrained** on MiniPile (according to the MiniPile paper) *and save the model*
- [ ] Evaluate the performance of **Pythia $160M$ Pile-trained** and **Pythia $160M$ Untrained** on MMLU and ARC benchmarks (decoder-only-applicable benchmarks)

In [None]:
! pip install transformers datasets torch accelerate evaluate

In [None]:
import os
import json
import torch
import evaluate
import numpy as np
import transformers
from pathlib import Path
from datasets import load_dataset
from huggingface_hub import snapshot_download
from transformers import AutoModelForSequenceClassification, pipeline, EvalPrediction
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments

In [2]:
base_dir = "/mnt/data"

---

## Download Pythia $160\text{M}$ Untrained and Pythia $160\text{M}$ Pile-Trained

In [3]:
def download_model(down_dir: str, target_folder: str, cache_folder: str, repo_id: str, branch: str = "main") -> None:
    down_dir = Path(down_dir)
    target_dir = down_dir / target_folder
    cache_dir = down_dir / cache_folder

    os.makedirs(target_dir, exist_ok=True)
    os.makedirs(cache_dir, exist_ok=True)

    print(f"Downloading {repo_id}/{branch}...")

    while True:
        try:
            snapshot_download(
                repo_id,
                repo_type="model",
                revision=branch,
                cache_dir=str(cache_dir),
                local_dir=str(target_dir)
            )
            break
        except Exception as e:
            print(f"Download attempt failed: {e}")
            continue

In [4]:
download_model(down_dir=base_dir, target_folder="pythia160m_dedup_untrained", 
               cache_folder="pythia160m_dedup_untrained_Cache",
               repo_id="EleutherAI/pythia-160m-deduped", branch="step0")

Downloading EleutherAI/pythia-160m-deduped/step0...


Fetching 6 files:   0%|          | 0/6 [00:00<?, ?it/s]

In [5]:
# https://huggingface.co/EleutherAI/pythia-160m/blob/main/README.md states:
# "[...] final step 143000 corresponds exactly to the model checkpoint on the main branch of each model."
download_model(down_dir=base_dir, target_folder="pythia160m_dedup_pile", 
               cache_folder="pythia160m_dedup_pile_Cache",
               repo_id="EleutherAI/pythia-160m-deduped", branch="main")

Downloading EleutherAI/pythia-160m-deduped/main...


Fetching 8 files:   0%|          | 0/8 [00:00<?, ?it/s]

---

## Load MiniPile Dataset from Disk

We expect the MiniPile dataset to already have been downloaded to disk at an earlier point.<br>
The logic for this can be found in the `01_get_piles` notebook.

In [None]:
# Loading minipile from the local directory 
# https://stackoverflow.com/questions/77020278/how-to-load-a-huggingface-dataset-from-local-path
# https://github.com/MK2112/mobileYOLOv3/blob/main/mobileyolov3-cocotext.ipynb
# Split is named exactly like with the original dataset https://huggingface.co/datasets/JeanKaddour/minipile

base_path = Path(base_dir)

print('Loading MiniPile train + val datasets...')
minipile_train = load_dataset("parquet",
                              data_files={
                                  "train": str(base_path / "MiniPile" / "data" / "train-*.parquet"),
                                  "validation": str(base_path / "MiniPile" / "data" / "validation-*.parquet"),
                                  "test": str(base_path / "MiniPile" / "data" / "test-*.parquet")
                              },
                              cache_dir=str(base_path / "MiniPile_Cache"),
                              split="train")

minipile_val = load_dataset("parquet",
                            data_files={
                                "train": str(base_path / "MiniPile" / "data" / "train-*.parquet"),
                                "validation": str(base_path / "MiniPile" / "data" / "validation-*.parquet"),
                                "test": str(base_path / "MiniPile" / "data" / "test-*.parquet")
                            },
                            cache_dir=str(base_path / "MiniPile_Cache"),
                            split="validation")

---

## Hyperparameters for Pythia $160\text{M}$ Untrained on MiniPile

**Training Parameters for $160M$ for The Pile Deduplicated (*not* MiniPile)**<br>
See [Pythia Paper](https://arxiv.org/abs/2304.01373) (p. 22) and [Pythia GitHub](https://github.com/EleutherAI/pythia/blob/main/models/160M/pythia-160m-deduped.yml).
- Each model gets exposed to $299,892,736,000 \approx 300B$ tokens through training ($\approx 1.5$ epochs on The Pile)
- Batch size of $1024$ samples
- Sequence length of $2048$
- Adam optimizer with $\beta_1 = 0.9$, $\beta_2 = 0.95$, $\epsilon = 1 \times 10^{-8}$
- Learning rates vary by model size:
    - $70M$ model:  $10.0 \times 10^{-4}$
    - $160M$ model: $6.0 \times 10^{-4}$
    - $410M$ model: $3.0 \times 10^{-4}$
    - $1.0B$ model: $3.0 \times 10^{-4}$
    - $1.4B$ model: $2.0 \times 10^{-4}$
    - $2.8B$ model: $1.6 \times 10^{-4}$
    - $6.9B$ model: $1.2 \times 10^{-4}$
    - $12B$ model:  $1.2 \times 10^{-4}$
- train-iters $143000$
- lr-decay-iters $143000$
- lr-decay-style $\text{cosine}$
- lr-warmup $0.01$
- weight-decay $0.01$
- gradient-clipping $1.0$
- lr-min $0.1 \times \text{optimizer.params.lr}$ (which isn't in the paper)
- synchronize-each-layer $\text{True}$ (i.e. gradients across all GPUs after each layer synced)
- LR Scheduling: Decays to a minimum of $0.1\times$ the maximum learning rate for all models
- (Tokenizer is loaded as the same as for GPT-NeoX-20B)

**Training Parameters for $160M$ for MiniPile**<br>
See [MiniPile paper](https://arxiv.org/abs/2304.08442)
- $1M/500/10k$ training/validation/test examples
    - Vocab size: $32309614$
    - Median document length: $294$
    - Longest document length: $929633$

**BERT Training Parameters for MiniPile**
- Adam, $\beta_1 = 0.9$, $\beta_2 = 0.98$, $\epsilon = 1 \times 10^{-12}$
- weight-decay $0.001$
- One cycle policy with peak learning rate of $1 \times 10^{-3}$
- gradient-clipping $0.5$
- Progressive batch size from $128$ to $4096$ with a linear increase over the course of training up to $300k$ steps, no warmup
- $800k$ total training steps
- weight averaging of the $k = 5$ latest checkpoints and $1k$ steps distance between them

**T5 Training Parameters for MiniPile**
- AdamW, matrix-wise LR scaling by its root mean square (RMS), no weight decay
- base learning rate $0.02$
- cosine schedule with final of $1 \times 10^{-5}$
- gradient-clipping $1.0$
- batch size $288$
- $10k$ warmup steps, $65536$ total training steps
- weight averaging of the $k = 5$ latest checkpoints and $1k$ steps distance between them (akin to BERT)

These training parameters are a good start, but they can only be interpreted as at most guiding, because they were applied for decoder-only and encoder-decoder models, yet not for pure decoder-only models like Pythia. Thus, if possible, one should look for approaches trained solely on MiniPile following the decoder-only paradigm for a more accurate guide to our own approach with Pythia. 

Luckily there exists a [GPT NeoX 122M MiniPile](https://huggingface.co/euclaise/gpt-neox-122m-minipile-digits) model that can be reverse-engineered for our purposes.

In [None]:
# Load the training arguments from the minipile-trained decode reference model GPT-NeoX-122M:
# https://huggingface.co/euclaise/gpt-neox-122m-minipile-digits

# Newer versions fail for missing attributes, 4.30.0 is documented to have been used
if str(transformers.__version__) == "4.30.0":
    training_args = torch.load(base_path / 'training_args_gptNEO122m.bin', weights_only=False)
    output_file = 'train_args_gptNEO122m_minipile.txt'
    try: 
        with open(output_file, 'w') as f:
            f.write("TrainingArguments attributes:\n")
            for attr in dir(training_args):
                if hasattr(training_args, attr) and not attr.startswith('_'):
                    value = getattr(training_args, attr)
                    f.write(f"- {attr}: {value}\n")
    except NameError as _:
        pass # Fully ignore NameError, appears every time
else:
    print('Skipped for version mismatch.')

The GPTNeoX model card is a bit misleading, as it is stated that this model was trained exclusively on MiniPile. The tiny learning rate $5 \times 10^{-6}$ with no weight decay implies a fine-tuning approach.

I did this mostly to get a feeling for much the encoder-based model params deviate from the decoder-based model params.<br>
I interpret the results as not too far off, e.g. we use the exact same learning rate and optimizer.<br>

This implies that the training params on The Pile for Pythia $160M$ are a good starting point and we can scale these to accommodate the MiniPile dataset size and expect appropriate training effects.

Core parameters are however not directly transferable: `train-iters` and therefore also `lr-decay-iters`.<br>
For Pile deduplicated this was $143000$, but we have to scale this to the MiniPile dataset size, as the number of tokens processed by the model is crucial for the training process and could lead to overfitting and not accurately reflecting dataset knowledge retention capabilities if not adjusted properly.

In other words, overshooting distorts dataset knowledge, while undershooting leads to underfitting and insufficient representation of the dataset.

In [None]:
# I use the byte sizes as proxy for the number of tokens, as both datasets will get tokenized with the same tokenizer
minipile_train_bytes = 5906108510 # see https://huggingface.co/datasets/JeanKaddour/minipile/blob/main/README.md
pile_train_bytes = 824546807506   # see https://huggingface.co/datasets/EleutherAI/the_pile_deduplicated/blob/main/dataset_infos.json
pile_effective_epochs = 1.5       # this many epochs are actually trained in the original model (calculation isn't affected, training params below are)

scale_factor = (pile_train_bytes * pile_effective_epochs) / (minipile_train_bytes * pile_effective_epochs)
print(f"Byte-based scale factor: {scale_factor:10.6f}x")
print(f"MiniPile (scaled) Train-Iters/LR-Decay-Iters: {143000 / scale_factor:.3f} ~ {round(143000 / scale_factor)}")

At this point the $1024$ for training iterations may seem awkwardly small.<br>
But, to reiterate, we strictly scaled it down iterations according to dataset size difference.

While this may seem horrible in most other cases, as we thoroughly neuter exposure to data, this scale-correct limiting and overall lower exposure is exactly what we need here to operate relative to the original Pythia training. After all, the goal is to compare knowledge retention and generalization capabilities achievable on `The Pile Deduplicated` vs. the 'distilled' `MiniPile` under size-appropriate, similar conditions. Therefore, scaling the `train-iters` and therefore also `lr-decay-iters` using byte sizes as a proxy is actually appropriate here.

We can now lay out the complete parameters:<br>
With the three approach descriptions retrieved, we can take a more educated guess at the training params for Pythia $160M$ on MiniPile:

- Adam optimizer (GPT NeoX and T5-Base MiniPile suggest the 'generally more stable' AdamW, but Pythia uses Adam so we keep it most similar)
    - $\beta_1 = 0.9$, $\beta_2 = 0.95$, (Pythia)
    - $\epsilon = 1 \times 10^{-8}$ (GPT NeoX and Pythia)
    - learning rate $6 \times 10^{-4}$ (Pythia)
    - lr-schedule $\text{cosine annealing}$ (Pythia)
    - lr-warmup $0.01$ of total steps (Pythia)
    - lr-min $0.1 \times \text{lr}$ (Pythia)
    - weight-decay $1 \times 10^{-2}$ (Pythia)
- gradient-clipping $1.0$ (Pythia)
- batch size $1024$ (Pythia, probably grad accum needed, expect multi-GPU)
- sequence length $2048$ (Pythia)
- **train-iters: $1024$ (MiniPile-specific)**
- **lr-decay-iters: same as train-iters (MiniPile-specific)**
- (won't do mixed precision for sake of most similar training conditions to Pile-trained Pythia)
- (won't do weight averaging)
- **Same GPT-NeoX-20B tokenizer as for Pythia-Pile**

We can start training Pythia $160\text{M}$ on MiniPile.

---

## Train Pythia $160\text{M}$ Untrained on MiniPile

In [None]:
# Load the untrained Pythia 160M tokenizer and model
# https://stackoverflow.com/questions/64001128/load-a-pre-trained-model-from-disk-with-huggingface-transformers
tokenizer = AutoTokenizer.from_pretrained(base_path / "pythia160m_dedup_untrained", use_fast=True, local_files_only=True)
empty_model = AutoModelForCausalLM.from_pretrained(base_path / "pythia160m_dedup_untrained", local_files_only=True)

In [None]:
def tokenize(example): # seq_len = max_length = 2048 (always)
    return tokenizer(example["text"], truncation=True, padding="max_length", max_length=2048)

minipile_train_tokenized = minipile_train.map(tokenize, batched=True)
minipile_train_tokenized.set_format(type="torch", columns=["input_ids", "attention_mask"]) # new fields from tokenizing

# Not really needed, but we have it, might as well make it serve as a reference for investigation of the model's performance
minipile_val_tokenized = minipile_val.map(tokenize, batched=True)
minipile_val_tokenized.set_format(type="torch", columns=["input_ids", "attention_mask"])  # new fields from tokenizing

In [None]:
if torch.cuda.is_available() and torch.cuda.device_count() > 1:
    torch.distributed.init_process_group("nccl")

In [None]:
output_dir = str(base_path / "pythia160m_minipile_trained")
log_dir = str(base_path / "160m_minipile_logs")
os.makedirs(output_dir, exist_ok=True)
os.makedirs(log_dir, exist_ok=True)

# https://huggingface.co/docs/transformers/v4.46.0/en/main_classes/trainer#transformers.TrainingArguments
training_args = TrainingArguments(
    output_dir=output_dir,
    overwrite_output_dir=True,
    num_train_epochs=1.5,            # Since train_iters gets set, use num_train_epochs=1.5 like for The Pile
    per_device_train_batch_size=8,   # Gives an effective batch size of 1024 after grad accum
    per_device_eval_batch_size=8,    # Same as training batch size
    gradient_accumulation_steps=128, # Achieve a batch size of 1024
    learning_rate=6e-4,              # Default Pythia 160M
    weight_decay=0.01,               # Default Pythia 160M
    max_steps=1024,                  # Adjusted for MiniPile
    lr_scheduler_type="cosine",      # As per Pythia 160M paper
    warmup_steps=int(0.01 * 1024),   # 1% of total steps for warmup
    logging_dir=log_dir,
    logging_steps=10,
    evaluation_strategy="steps",
    eval_steps=100,     # Frequency for evaluation during training
    save_steps=1024,    # Save at the end of training
    save_total_limit=1, # Only keep the most recent checkpoint
    fp16=False,         # Not using mixed precision for comparable conditions
    report_to="none",   # Noting this for later iterations, maybe set this as "wandb", "tensorboard" or smth
    ddp_find_unused_parameters=False, # see https://discuss.pytorch.org/t/how-to-change-ddp-parameter-find-unused-parameters-true-to-false-during-training/130763
)

# Ensure training across multiple GPUs if available
device = "cuda" if torch.cuda.is_available() else "cpu"
empty_model = empty_model.to(device)

In [None]:
# Train Pythia 160M Untrained on MiniPile
# https://huggingface.co/docs/transformers/v4.46.0/en/main_classes/trainer
trainer = Trainer(model=empty_model,
                  args=training_args,
                  train_dataset=minipile_train_tokenized,
                  eval_dataset=minipile_val_tokenized)

trainer.train()  # TODO: Export this to a script, then run as `torchrun --nproc_per_node=>>NUM GPUs<< <script_something_160m>.py`

# Why is this a two-step process?!
trainer.save_model(str(base_path / "pythia160m_minipile_trained")) # This saves the model weights
tokenizer.save_pretrained(str(base_path / "pythia160m_minipile_trained")) # This saves the tokenizer (don't know if needed, better save than sorry)

---

## Evaluate Pythia $160\text{M}$ MiniPile vs. Pythia $160\text{M}$ Pile-Trained on MMLU and ARC Benchmarks

In [13]:
def download_dataset(down_dir: str, target_folder: str, cache_folder: str, repo_id: str) -> None:
    down_dir = Path(down_dir)
    target_dir = down_dir / target_folder
    cache_dir = down_dir / cache_folder

    os.makedirs(target_dir, exist_ok=True)
    os.makedirs(cache_dir, exist_ok=True)

    print(f"Downloading {repo_id}...")

    # I tried fiddling with the os.environs, cause I wanted to use the load_dataset function
    # but we actually don't need that, snapshot_download suffices fully
    while True:
        try:
            snapshot_download(repo_id, repo_type="dataset", cache_dir=str(cache_dir), local_dir=str(target_dir))
            break
        except Exception as _:
            continue

In [14]:
download_dataset(down_dir=base_dir, target_folder="MMLU", cache_folder="MMLU_Cache",
                 repo_id="cais/mmlu")

Downloading cais/mmlu...


Fetching 181 files:   0%|          | 0/181 [00:00<?, ?it/s]

In [16]:
download_dataset(down_dir=base_dir, target_folder="ARC", cache_folder="ARC_Cache",
                 repo_id="allenai/ai2_arc")

Downloading allenai/ai2_arc...


Fetching 8 files:   0%|          | 0/8 [00:00<?, ?it/s]

In [None]:
pythia_minipile = AutoModelForCausalLM.from_pretrained(base_path / "minipile_trained_pythia160m", local_files_only=True)
pythia_pile = AutoModelForCausalLM.from_pretrained(base_path / "pythia160m_dedup_pile", local_files_only=True)
tokenizer = AutoTokenizer.from_pretrained(base_path / "pythia160m_dedup_untrained", use_fast=True, local_files_only=True)

pythia_minipile.to(device)
pythia_pile.to(device)

# References https://github.com/hendrycks/test/blob/master/evaluate.py (TODO: References maybe too strongly. Check this thoroughly.)
# {'answer': 1, 'choices': ['0', '4', '2', '6'], 'question': 'Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q.', 'subject': 'abstract_algebra'}
def choice_log_probs_mmlu(model, tokenizer, question, options):
    log_probs = []
    for choice in options:
        input_text = f"{question} {choice}"
        inputs = tokenizer(input_text, return_tensors="pt").to(device)
        with torch.no_grad():
            outputs = model(**inputs)
        # Extract log probability of last token of the choice
        logits = outputs.logits[0, -1, :] # Get logits for the last token
        log_prob = torch.log_softmax(logits, dim=-1)  # Softmax -> convert logits to probabilities
        choice_log_prob = log_prob[tokenizer.convert_tokens_to_ids(choice)].item()  # Get log prob of choice
        log_probs.append(choice_log_prob)
    best_choice_idx = int(np.argmax(log_probs)) # Choose answer with highest log prob
    return best_choice_idx

def bench_mmlu(model, tokenizer, dataset):
    correct = 0
    total = 0
    for example in dataset:
        question = example["question"]
        options = example["choices"]
        correct_answer = example["answer"]
        predicted_choice_idx = choice_log_probs_mmlu(model, tokenizer, question, options)
        predicted_answer = chr(65 + predicted_choice_idx)
        if predicted_answer == str(correct_answer).strip():
            correct += 1
        total += 1
    accuracy = correct / total if total > 0 else 0
    return accuracy

In [None]:
# Structuring of ARC and MMLU seem fairly associable; try to use a similar log prob approach
# {'id': 'Mercury_7175875', 'question': 'An astronomer observes that a planet rotates faster after a meteorite impact. Which is the most likely effect of this increase in rotation?', 'choices': {'text': ['Planetary density will decrease.', 'Planetary years will become longer.', 'Planetary days will become shorter.', 'Planetary gravity will become stronger.'], 'label': ['A', 'B', 'C', 'D']}, 'answerKey': 'C'}
def choice_log_probs_arc(model, tokenizer, question, options):
    log_probs = []
    for choice in options:
        input_text = f"{question} {choice}"
        inputs = tokenizer(input_text, return_tensors="pt").to(device)
        with torch.no_grad():
            outputs = model(**inputs)
        logits = outputs.logits[0, -1, :]
        log_prob = torch.log_softmax(logits, dim=-1)
        choice_log_prob = log_prob[tokenizer.convert_tokens_to_ids(choice)].item()
        log_probs.append(choice_log_prob)
    best_choice_idx = int(np.argmax(log_probs)) # Choose answer with highest log prob
    return best_choice_idx

def bench_arc_challenge(model, tokenizer, dataset):
    correct = 0
    total = 0
    for example in dataset:
        question = example["question"]
        options = example["choices"]["text"]
        correct_answer = example["answerKey"]
        
        predicted_choice_idx = choice_log_probs_arc(model, tokenizer, question, options)
        predicted_answer = chr(65 + predicted_choice_idx)  # Convert index to label ('A', 'B', 'C', 'D')
        
        if predicted_answer == correct_answer:
            correct += 1
        total += 1
    
    accuracy = correct / total if total > 0 else 0
    return accuracy

In [None]:
print('Loading MMLU test dataset...')
mmlu_all = load_dataset("parquet",
                        data_files={
                            "train": str(base_path / "MMLU" / "all" / "auxiliary_train-*.parquet"),
                            "dev": str(base_path / "MMLU" / "all" / "dev-*.parquet"),
                            "validation": str(base_path / "MMLU" / "all" / "validation-*.parquet"),
                            "test": str(base_path / "MMLU" / "all" / "test-*.parquet"),
                        },
                        cache_dir=str(base_path / "MMLU_Cache"),
                        split="test")

print('Loading ARC test dataset...')
arc_challenge = load_dataset("parquet",
                            data_files={
                                "train": str(base_path / "ARC" / "ARC-Challenge" / "train-*.parquet"),
                                "validation": str(base_path / "ARC" / "ARC-Challenge" / "validation-*.parquet"),
                                "test": str(base_path / "ARC" / "ARC-Challenge" / "test-*.parquet"),
                            },
                            cache_dir=str(base_path / "ARC_Cache"),
                            split="test")

# Evaluate accuracy on MMLU and ARC datasets
mmlu_accuracy_minipile = bench_mmlu(pythia_minipile, tokenizer, mmlu_all)
mmlu_accuracy_pile = bench_mmlu(pythia_pile, tokenizer, mmlu_all)
arc_accuracy_minipile = bench_arc_challenge(pythia_minipile, tokenizer, arc_challenge)
arc_accuracy_pile = bench_arc_challenge(pythia_pile, tokenizer, arc_challenge)

# Save results to JSON
with open(str(base_dir / "benchmark_results.json"), "w") as f:
    json.dump({
        "MMLU": {
            "minipile": mmlu_accuracy_minipile,
            "pile": mmlu_accuracy_pile
        },
        "ARC": {
            "minipile": arc_accuracy_minipile,
            "pile": arc_accuracy_pile
        }
    }, f)