# Benchmarking Pythia $160\text{M}$ pre-trained on The Pile vs. Pythia $160\text{M}$ trained on MiniPile

Objectives:
- [x] Prepare (Download) two models - **Pythia $160M$ Untrained** and **Pythia $160M$ fully Pile-trained**
- [x] Load MiniPile Dataset from disk
- [x] Train **Pythia $160M$ Untrained** on MiniPile (according to the MiniPile paper) *and save the model* (`pythia160m_minipile_trained`)
- [x] Evaluate the performance of **Pythia $160M$ Pile-trained** on MMLU, ARC, WinoGrande, HellaSwag, Lambada benchmarks
- [x] Evaluate the performance of **Pythia $160M$ Untrained** on MMLU, ARC, WinoGrande, HellaSwag, Lambada benchmarks

In [2]:
#! pip install transformers datasets torch accelerate evaluate wandb
! pip install lm-eval

[33mDEPRECATION: Loading egg at /mnt/storage/miniconda3/envs/minipile/lib/python3.12/site-packages/huggingface_hub-0.26.2-py3.8.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m
[0mCollecting lm-eval
  Downloading lm_eval-0.4.5-py3-none-any.whl.metadata (44 kB)
Collecting jsonlines (from lm-eval)
  Downloading jsonlines-4.0.0-py3-none-any.whl.metadata (1.6 kB)
Collecting peft>=0.2.0 (from lm-eval)
  Downloading peft-0.13.2-py3-none-any.whl.metadata (13 kB)
Collecting pybind11>=2.6.2 (from lm-eval)
  Downloading pybind11-2.13.6-py3-none-any.whl.metadata (9.5 kB)
Collecting pytablewriter (from lm-eval)
  Downloading pytablewriter-1.2.0-py3-none-any.whl.metadata (37 kB)
Collecting rouge-score>=0.0.4 (from lm-eval)
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting sacrebleu>=1

In [3]:
import os
import json
import torch
import evaluate
import numpy as np
import transformers
from tqdm import tqdm
from pathlib import Path
from torch.optim import Adam
from datasets import load_dataset
from lm_eval import tasks, evaluator, utils
from huggingface_hub import snapshot_download
from transformers import AutoModelForSequenceClassification, pipeline, EvalPrediction
from transformers import DataCollatorForLanguageModeling
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments, get_scheduler

In [4]:
base_dir = "/mnt/data"
base_path = Path(base_dir)

---

## Download Pythia $160\text{M}$ Untrained and Pythia $160\text{M}$ Pile-Trained

In [5]:
def download_model(down_dir: str, target_folder: str, cache_folder: str, repo_id: str, branch: str = "main") -> None:
    down_dir = Path(down_dir)
    target_dir = down_dir / target_folder
    cache_dir = down_dir / cache_folder

    os.makedirs(target_dir, exist_ok=True)
    os.makedirs(cache_dir, exist_ok=True)

    print(f"Downloading {repo_id}/{branch}...")

    while True:
        try:
            snapshot_download(
                repo_id,
                repo_type="model",
                revision=branch,
                cache_dir=str(cache_dir),
                local_dir=str(target_dir)
            )
            break
        except Exception as e:
            print(f"Download attempt failed: {e}")
            continue

In [6]:
download_model(down_dir=base_dir, target_folder="pythia160m_dedup_untrained", 
               cache_folder="pythia160m_dedup_untrained_Cache",
               repo_id="EleutherAI/pythia-160m-deduped", branch="step0")

Downloading EleutherAI/pythia-160m-deduped/step0...


Fetching 6 files:   0%|          | 0/6 [00:00<?, ?it/s]

In [7]:
# https://huggingface.co/EleutherAI/pythia-160m/blob/main/README.md states:
# "[...] final step 143000 corresponds exactly to the model checkpoint on the main branch of each model."
download_model(down_dir=base_dir, target_folder="pythia160m_dedup_pile", 
               cache_folder="pythia160m_dedup_pile_Cache",
               repo_id="EleutherAI/pythia-160m-deduped", branch="main")

Downloading EleutherAI/pythia-160m-deduped/main...


Fetching 8 files:   0%|          | 0/8 [00:00<?, ?it/s]

---

## Load MiniPile Dataset from Disk

We expect the MiniPile dataset to already have been downloaded to disk at an earlier point.<br>
The logic for this can be found in the `01_get_piles` notebook.

In [6]:
# Loading minipile train + val splits from the local directory 
# https://stackoverflow.com/questions/77020278/how-to-load-a-huggingface-dataset-from-local-path
# https://github.com/MK2112/mobileYOLOv3/blob/main/mobileyolov3-cocotext.ipynb
# Split is named exactly like with the original dataset https://huggingface.co/datasets/JeanKaddour/minipile
minipile_train = load_dataset("parquet",
                              data_files={
                                  "train": str(base_path / "MiniPile" / "data" / "train-*.parquet"),
                                  "validation": str(base_path / "MiniPile" / "data" / "validation-*.parquet"),
                                  "test": str(base_path / "MiniPile" / "data" / "test-*.parquet")
                              },
                              cache_dir=str(base_path / "MiniPile_Cache"),
                              split="train")

minipile_val = load_dataset("parquet",
                            data_files={
                                "train": str(base_path / "MiniPile" / "data" / "train-*.parquet"),
                                "validation": str(base_path / "MiniPile" / "data" / "validation-*.parquet"),
                                "test": str(base_path / "MiniPile" / "data" / "test-*.parquet")
                            },
                            cache_dir=str(base_path / "MiniPile_Cache"),
                            split="validation")

---

## Hyperparameters for Pythia $160\text{M}$ Untrained on MiniPile

**Training Parameters for $160M$ for The Pile Deduplicated (*not* MiniPile)**<br>
See [Pythia Paper](https://arxiv.org/abs/2304.01373) (p. 22) and [Pythia GitHub](https://github.com/EleutherAI/pythia/blob/main/models/160M/pythia-160m-deduped.yml):

![](./img/pythia_train_params.png)

- Each model gets exposed to $299,892,736,000 \approx 300B$ tokens through training ($\approx 1.5$ epochs on The Pile)
- Batch size of $1024$ samples
- Sequence length of $2048$
- Adam optimizer with $\beta_1 = 0.9$, $\beta_2 = 0.95$, $\epsilon = 1 \times 10^{-8}$
- Learning rates vary by model size:
    - $70M$ model:  $10.0 \times 10^{-4}$
    - $160M$ model: $6.0 \times 10^{-4}$
    - $410M$ model: $3.0 \times 10^{-4}$
    - $1.0B$ model: $3.0 \times 10^{-4}$
    - $1.4B$ model: $2.0 \times 10^{-4}$
    - $2.8B$ model: $1.6 \times 10^{-4}$
    - $6.9B$ model: $1.2 \times 10^{-4}$
    - $12B$ model:  $1.2 \times 10^{-4}$
- train-iters $143000$
- lr-decay-iters $143000$
- lr-decay-style $\text{cosine}$
- lr-warmup $0.01$
- weight-decay $0.01$
- gradient-clipping $1.0$
- lr-min $0.1 \times \text{optimizer.params.lr}$ (which isn't in the paper)
- synchronize-each-layer $\text{True}$ (i.e. gradients across all GPUs after each layer synced)
- LR Scheduling: Decays to a minimum of $0.1\times$ the maximum learning rate for all models
- (Tokenizer is loaded as the same as for GPT-NeoX-20B)

**Training Parameters for $160M$ for MiniPile**<br>
See [MiniPile paper](https://arxiv.org/abs/2304.08442)
- $1M/500/10k$ training/validation/test examples
    - Vocab size: $32309614$
    - Median document length: $294$
    - Longest document length: $929633$

**BERT Training Parameters for MiniPile**
- Adam, $\beta_1 = 0.9$, $\beta_2 = 0.98$, $\epsilon = 1 \times 10^{-12}$
- weight-decay $0.001$
- One cycle policy with peak learning rate of $1 \times 10^{-3}$
- gradient-clipping $0.5$
- Progressive batch size from $128$ to $4096$ with a linear increase over the course of training up to $300k$ steps, no warmup
- $800k$ total training steps
- weight averaging of the $k = 5$ latest checkpoints and $1k$ steps distance between them

**T5 Training Parameters for MiniPile**
- AdamW, matrix-wise LR scaling by its root mean square (RMS), no weight decay
- base learning rate $0.02$
- cosine schedule with final of $1 \times 10^{-5}$
- gradient-clipping $1.0$
- batch size $288$
- $10k$ warmup steps, $65536$ total training steps
- weight averaging of the $k = 5$ latest checkpoints and $1k$ steps distance between them (akin to BERT)

These training parameters are a good start, but they can only be interpreted as at most guiding, because they were applied for decoder-only and encoder-decoder models, yet not for pure decoder-only models like Pythia. Thus, if possible, one should look for approaches trained solely on MiniPile following the decoder-only paradigm for a more accurate guide to our own approach with Pythia. 

Luckily there exists a [GPT NeoX 122M MiniPile](https://huggingface.co/euclaise/gpt-neox-122m-minipile-digits) model that can be reverse-engineered for our purposes.

In [7]:
# Load the training arguments from the minipile-trained decode reference model GPT-NeoX-122M:
# https://huggingface.co/euclaise/gpt-neox-122m-minipile-digits

# Newer versions fail for missing attributes, 4.30.0 is documented to have been used
if str(transformers.__version__) == "4.30.0":
    training_args = torch.load(base_path / 'training_args_gptNEO122m.bin', weights_only=False)
    output_file = 'train_args_gptNEO122m_minipile.txt'
    try: 
        with open(output_file, 'w') as f:
            f.write("TrainingArguments attributes:\n")
            for attr in dir(training_args):
                if hasattr(training_args, attr) and not attr.startswith('_'):
                    value = getattr(training_args, attr)
                    f.write(f"- {attr}: {value}\n")
    except NameError as _:
        pass # Fully ignore NameError, appears every time
else:
    print('Skipped for version mismatch.')

Skipped for version mismatch.


The GPTNeoX model card is a bit misleading, as it is stated that this model was trained exclusively on MiniPile. The tiny learning rate $5 \times 10^{-6}$ with no weight decay implies a fine-tuning approach.

I did this mostly to get a feeling for much the encoder-based model params deviate from the decoder-based model params.<br>
I interpret the results as not too far off, e.g. we use the exact same learning rate and optimizer.<br>

This implies that the training params on The Pile for Pythia $160M$ are a good starting point and we can scale these to accommodate the MiniPile dataset size and expect appropriate training effects.

Core parameters are however not directly transferable: `train-iters` and therefore also `lr-decay-iters`.<br>
For Pile deduplicated this was $143000$, but we have to scale this to the MiniPile dataset size, as the number of tokens processed by the model is crucial for the training process and could lead to overfitting and not accurately reflecting dataset knowledge retention capabilities if not adjusted properly.

In other words, overshooting distorts dataset knowledge, while undershooting leads to underfitting and insufficient representation of the dataset.

In [8]:
# I use the byte sizes as proxy for the number of tokens, as both datasets will get tokenized with the same tokenizer
minipile_train_bytes = 5906108510 # see https://huggingface.co/datasets/JeanKaddour/minipile/blob/main/README.md
pile_train_bytes = 824546807506   # see https://huggingface.co/datasets/EleutherAI/the_pile_deduplicated/blob/main/dataset_infos.json
pile_effective_epochs = 1.5       # this many epochs are actually trained in the original model (calculation isn't affected, training params below are)

scale_factor = (pile_train_bytes * pile_effective_epochs) / (minipile_train_bytes * pile_effective_epochs)
print(f"Byte-based scale factor: {scale_factor:10.6f}x")
print(f"MiniPile (scaled) Train-Iters/LR-Decay-Iters: {143000 / scale_factor:.3f} ~ {round(143000 / scale_factor)}")

Byte-based scale factor: 139.609153x
MiniPile (scaled) Train-Iters/LR-Decay-Iters: 1024.288 ~ 1024


At this point the $1024$ for training iterations may seem awkwardly small.<br>
But, to reiterate, we strictly scaled it down iterations according to dataset size difference.

While this may seem horrible in most other cases, as we thoroughly neuter exposure to data, this scale-correct limiting and overall lower exposure is exactly what we need here to operate relative to the original Pythia training. After all, the goal is to compare knowledge retention and generalization capabilities achievable on `The Pile Deduplicated` vs. the 'distilled' `MiniPile` under size-appropriate, similar conditions. Therefore, scaling the `train-iters` and therefore also `lr-decay-iters` using byte sizes as a proxy is actually appropriate here.

We can now lay out the complete parameters:<br>
With the three approach descriptions retrieved, we can take a more educated guess at the training params for Pythia $160M$ on MiniPile:

- Adam optimizer (GPT NeoX and T5-Base MiniPile suggest the 'generally more stable' AdamW, but Pythia uses Adam so we keep it most similar)
    - $\beta_1 = 0.9$, $\beta_2 = 0.95$, (Pythia)
    - $\epsilon = 1 \times 10^{-8}$ (GPT NeoX and Pythia)
    - learning rate $6 \times 10^{-4}$ (Pythia)
    - lr-schedule $\text{cosine annealing}$ (Pythia)
    - lr-warmup $0.01$ of total steps (Pythia)
    - lr-min $0.1 \times \text{lr}$ (Pythia)
    - weight-decay $1 \times 10^{-2}$ (Pythia)
- gradient-clipping $1.0$ (Pythia)
- batch size $1024$ (Pythia, probably grad accum needed, expect multi-GPU)
- sequence length $2048$ (Pythia)
- **train-iters: $1024$ (MiniPile-specific)**
- **lr-decay-iters: same as train-iters (MiniPile-specific)**
- (won't do mixed precision for sake of most similar training conditions to Pile-trained Pythia)
- (won't do weight averaging)
- **Same GPT-NeoX-20B tokenizer as for Pythia-Pile**

We can start training Pythia $160\text{M}$ on MiniPile.

---

## Train Pythia $160\text{M}$ Untrained on MiniPile

In [8]:
# Load the untrained Pythia 160M tokenizer and model
# https://stackoverflow.com/questions/64001128/load-a-pre-trained-model-from-disk-with-huggingface-transformers
# Tokenizer is in fact a GPTNeoXTokenizer, only has a fast version available
tokenizer = AutoTokenizer.from_pretrained(base_path / "pythia160m_dedup_untrained", use_fast=True, local_files_only=True)
empty_model = AutoModelForCausalLM.from_pretrained(base_path / "pythia160m_dedup_untrained", local_files_only=True)

The Pythia paper states a standard configuration where individual examples consist of up to $2048$ tokens.<br>
This explains why the tokenizer doesn't contain a padding token, as the model is trained on variable-length sequences with this upper bound instead.

In [10]:
# Tokenizer doesn't have a pad token, use EOS as a substitute
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.pad_token_id = tokenizer.eos_token_id

def tokenize(example): 
    # seq_len = max_length = 2048 (as upper boundary, so not strict size -> no padding needed)
    return tokenizer(example["text"], 
                     truncation=True, 
                     max_length=2048,
                     return_special_tokens_mask=True)

if os.path.exists(base_path / "minipile_train_tokenized"):
    minipile_train_tokenized = load_dataset("arrow", data_files=str(base_path / "minipile_train_tokenized/*.arrow"), split="train")
    minipile_val_tokenized = load_dataset("arrow", data_files=str(base_path / "minipile_val_tokenized/*.arrow"), split="train")
else:
    minipile_train_tokenized = minipile_train.map(tokenize, batched=True, remove_columns=minipile_train.column_names) # retain only new fields from tokenization
    minipile_val_tokenized = minipile_val.map(tokenize, batched=True, remove_columns=minipile_val.column_names)
    minipile_train_tokenized.save_to_disk(base_path / "minipile_train_tokenized")
    minipile_val_tokenized.save_to_disk(base_path / "minipile_val_tokenized")

# Dynamic padding during training (mlm -> mask language model -> we're doing causal here)
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

In [9]:
seed = 42
torch.manual_seed(seed)
np.random.seed(seed)

if torch.cuda.is_available():
    torch.cuda.manual_seed_all(seed)
    device_count = torch.cuda.device_count()
    print(f"Available GPUs: {device_count}")
    for i in range(device_count):
        device = torch.device(f'cuda:{i}')
        device_properties = torch.cuda.get_device_properties(device)
        total_mem = device_properties.total_memory / (1024 ** 3)
        allocd_mem = torch.cuda.memory_allocated(device) / (1024 ** 3)
        free_mem = total_mem - allocd_mem
        print(f"\nGPU {i}:\t{device_properties.name}")
        print(f"\tTotal memory:\t\t{total_mem:.2f} GiB")
        print(f"\tAllocated memory:\t{allocd_mem:5.2f} GiB")
        print(f"\tFree memory:\t\t{free_mem:.2f} GiB")
else:
    print("No CUDA-capable GPUs available")

Available GPUs: 1

GPU 0:	NVIDIA GeForce RTX 3060
	Total memory:		11.76 GiB
	Allocated memory:	 0.00 GiB
	Free memory:		11.76 GiB


In [13]:
output_dir = str(base_path / "pythia160m_minipile_trained")
log_dir = str(base_path / "160m_minipile_logs")
os.makedirs(output_dir, exist_ok=True)
os.makedirs(log_dir, exist_ok=True)

# https://huggingface.co/docs/transformers/v4.46.0/en/main_classes/trainer#transformers.TrainingArguments
training_args = TrainingArguments(
    output_dir=output_dir,
    overwrite_output_dir=True,
    num_train_epochs=1.5,            # Since train_iters gets set, use num_train_epochs=1.5 like for The Pile
    per_device_train_batch_size=4,   # Gives an effective batch size of 1024 after grad accum
    per_device_eval_batch_size=4,    # Same as training batch size
    gradient_accumulation_steps=256, # Achieve a batch size of 1024
    learning_rate=6e-4,              # Default Pythia 160M
    weight_decay=0.01,               # Default Pythia 160M
    max_steps=1024,                  # Adjusted for MiniPile (https://discuss.huggingface.co/t/how-does-max-steps-affect-the-number-of-samples-the-model-sees/69681)
    lr_scheduler_type="cosine",      # As per Pythia 160M paper
    warmup_steps=int(0.01 * 1024),   # 1% of total steps for warmup
    logging_dir=log_dir,
    logging_steps=10,
    eval_strategy="steps",
    eval_steps=100,     # Frequency for evaluation during training
    save_steps=1024,    # Save at the end of training
    save_total_limit=1, # Only keep the most recent checkpoint
    fp16=True,         # Not using mixed precision for comparable conditions
    report_to=None,     # Noting this for later iterations, maybe set this as "wandb", "tensorboard" or smth
    ddp_find_unused_parameters=False, # see https://discuss.pytorch.org/t/how-to-change-ddp-parameter-find-unused-parameters-true-to-false-during-training/130763
    max_grad_norm=1.0,  # As per Pythia 160M paper
)

# Ensure training across multiple GPUs if available
device = "cuda" if torch.cuda.is_available() else "cpu"
empty_model = empty_model.to(device)

In [None]:
optimizer = Adam(empty_model.parameters(), lr=training_args.learning_rate, betas=(0.9, 0.95), eps=1e-8, weight_decay=0.01)

# Train Pythia 160M Untrained on MiniPile
# https://huggingface.co/docs/transformers/v4.46.0/en/main_classes/trainer
trainer = Trainer(model=empty_model,
                  args=training_args,
                  train_dataset=minipile_train_tokenized,
                  eval_dataset=minipile_val_tokenized,
                  data_collator=data_collator,
                  optimizers=(optimizer, None))

scheduler = get_scheduler(name=training_args.lr_scheduler_type,
                          optimizer=optimizer,
                          num_warmup_steps=training_args.warmup_steps,
                          num_training_steps=training_args.max_steps)

num_batches = len(trainer.get_train_dataloader())  # Number of batches
total_training_steps = num_batches * training_args.gradient_accumulation_steps * int(training_args.num_train_epochs)

# Training loop with manual minimum learning rate enforcement
for epoch in range(int(training_args.num_train_epochs)):
    with tqdm(total=total_training_steps, desc=f"Training Epoch {epoch + 1}/{int(training_args.num_train_epochs)}") as pbar:
        for _, batch in enumerate(trainer.get_train_dataloader()):
            trainer.training_step(trainer.model, batch)
            optimizer.step()
            scheduler.step()
            optimizer.zero_grad()
            for param_group in optimizer.param_groups:
                # Manually ... ensure lr doesn't go below min_lr (Pythia wants this)
                param_group['lr'] = max(param_group['lr'], 0.1 * training_args.learning_rate)
            pbar.update(1)

# Why is this a two-step process?!
trainer.save_model(str(base_path / "pythia160m_minipile_trained")) # This saves the model weights
tokenizer.save_pretrained(str(base_path / "pythia160m_minipile_trained")) # This saves the tokenizer (don't know if needed, better save than sorry)

Training of Pythia $160\text{M}$ on MiniPile was done on `gruenau8` with the `02_train_160M.py` script.

---

## Evaluate Pythia $160\text{M}$ MiniPile vs. Pythia $160\text{M}$ Pile-Trained on Benchmarks

### AI2 Reasoning Challenge (ARC-Challenge)
[Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge (Clark, et al. 2018)](https://arxiv.org/abs/1803.05457#)
- grade-school science questions, require reasoning efforts beyond just information retrieval
- tests specifically for reaction to new concepts through cross-domain generalization
- requires broader domain knowledge, reasoning capabilities and combinatory skills for responses
- will help evaluate assessing whether a MiniPile-trained model retains *the flexibility and cross-domain generalization capabilities* of The Pile-trained model

### Massive Multitask Language Understanding (MMLU)
[Measuring Massive Multitask Language Understanding (Hendrycks, et al. 2021)](https://arxiv.org/abs/2009.03300)
- comprehensive, inputs/examples span $57$ different tasks
- specifically testing across *all* subjects
- will help evaluate assessing whether a MiniPile-trained model retains *specifically the width of knowledge* of The Pile-trained model

### HellaSwag
[HellaSwag: Can a Machine *Really* Finish Your Sentence (Zellers, et al. 2019)](https://aclanthology.org/P19-1472.pdf)
- presenting models with a situation and asking them to choose the most plausible continuation
- I've seen in Karpathy's series, thus new it was applicable to decoder-only models and thus to Pythia
- tests for context understanding, conversational capabilities and generalization

### WinoGrande
[WinoGrande: An Adversarial Winograd Schema Challenge at Scale (Sakaguchi, et al. 2019)](https://arxiv.org/abs/1907.10641)
- testing the ability to determine pronouns based on commonsense reasoning
- checks whether beyond dataset contents, the model can just as deeply perceive/learn language understanding and reasoning capabilities
- helps us solve the question of whether reducing MiniPile to most relevant data with "topic-selective k-Means" still retains an overall language understanding

### Language Model Benchmark for Autoregressive Data Analysis (Lambada (OpenAI))
[The LAMBADA dataset: Word prediction requiring a broad discourse context (Paperno, et al. 2016)](https://arxiv.org/abs/1606.06031)
- evaluates a model's ability to predict a last word of some text, where generating the right word requires long-range context understanding
- tests for context understanding, conversational capabilities and generalization
- not as directly applicable to our problem set here, but helps calibrate the benchmark pipeline, because numbers were reported for Pythia $160M$ on this benchmark
- this in turn gives more credibility and a better look into the MiniPile Pythia's performance overall
- maybe reducing dataset size can disturb long range context processing, this helps us evaluate that
- using the OpenAI version for cross-referencability with the Pythia paper

In [10]:
from lm_eval import utils, simple_evaluate
from lm_eval.models.huggingface import HFLM

In [16]:
## Evaluation - Pythia 160M Trained on Pile

# Showcase uses Pythia: https://colab.research.google.com/github/EleutherAI/lm-evaluation-harness/blob/main/examples/lm-eval-overview.ipynb
# Only genuine doc seems to be (a mess): https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs

# Load model and tokenizer
pythia_pile = AutoModelForCausalLM.from_pretrained(base_path / "pythia160m_dedup_pile", local_files_only=True)
tokenizer = AutoTokenizer.from_pretrained(base_path / "pythia160m_dedup_untrained", use_fast=True, local_files_only=True)
pythia_pile = pythia_pile.to(device)

# From the HuggingFace Data Views (allenai/ai2_arc, cais/mmlu, allenai/winogrande, Rowan/hellaswag):
# MMLU: {'answer': 1, 'choices': ['0', '4', '2', '6'], 'question': 'Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q.', 'subject': 'abstract_algebra'}
# ARC-C: {'id': 'Mercury_7175875', 'question': 'An astronomer observes that a planet rotates faster after a meteorite impact. Which is the most likely effect of this increase in rotation?', 'choices': {'text': ['Planetary density will decrease.', 'Planetary years will become longer.', 'Planetary days will become shorter.', 'Planetary gravity will become stronger.'], 'label': ['A', 'B', 'C', 'D']}, 'answerKey': 'C'}
# Winogrande: {'sentence': "Ian volunteered to eat Dennis's menudo after already having a bowl because _ despised eating intestine.", 'option1': 'Ian', 'option2': 'Dennis', 'answer': '2'} (answer content not contained in test set)
# HellaSwag: {'ind': 14, 'activity_label': 'Wakeboarding', 'ctx_a': 'A man is being pulled on a water ski as he floats in the water casually.', 'ctx_b': 'he', 'ctx': 'A man is being pulled on a water ski as he floats in the water casually. he', 'endings': ['mounts the water ski and tears through the water at fast speeds.', 'goes over several speeds, trying to stay upright.', 'struggles a little bit as he talks about it.', 'is seated in a boat with three other people.'], 'source_id': 'activitynet~v_-5KAycAQlC4', 'split': 'test', 'split_type': 'indomain', 'label': ''}
# Lambada (OAI): {'text': 'In my palm is a clear stone, and inside it is a small ivory statuette. A guardian angel.\n\n"Figured if you\'re going to be out at night getting hit by cars, you might as well have some backup."\n\nI look at him, feeling stunned. Like this is some sort of sign. But as I stare at Harlin, his mouth curved in a confident grin, I don\'t care about signs'}
 
batch_size_hflm = 1

# Found that in here of all things: https://github.com/pytorch/ao/blob/e2301e9dba91fa962d673fdc3b3f0002856a3ba7/torchao/_models/_eval.py#L17-L22
# https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/models/huggingface.py
pythia_pile_hflm = HFLM(pretrained=pythia_pile,
                        tokenizer=tokenizer,
                        batch_size=batch_size_hflm)

# Thankfully both MMLU and ARC are available in the lm_eval.tasks module
# https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/arc
# https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/mmlu
# Initially evaluator.evaluate looked promising, but I don't understand it and this works
# Found simple_evaluate in https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/interface.md
# Winogrande is in fact 'winogrande_xl', see https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/winogrande/default.yaml
results = simple_evaluate(model=pythia_pile_hflm,
                          tasks=["arc_challenge", "mmlu", "winogrande", "hellaswag", "lambada"], # I have no idea how to inject pre-downloaded datasets here, I gave up on that
                          num_fewshot=0,  # Pythia paper stated they used zero-shot
                          batch_size=batch_size_hflm,
                          device="cuda",
                          limit=None)

# Save for reference (for proof of table below)
with open('02_eval_160M_pretrained.txt', 'w') as f:
    f.write(str(results))

# Manually saved this to 02_eval_160M_pretrained_table.txt
# This was nicely documented. Not. https://raw.githubusercontent.com/pytorch/torchtune/main/recipes/eleuther_eval.py
print(utils.make_table(results))

2024-11-13:13:16:08,314 INFO     [huggingface.py:481] Using model type 'default'
2024-11-13:13:16:08,328 INFO     [evaluator.py:164] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2024-11-13:13:16:08,329 INFO     [evaluator.py:217] Using pre-initialized model
2024-11-13:13:16:14,701 INFO     [__init__.py:459] The tag 'arc_ca' is already registered as a group, this tag will not be registered. This may affect tasks you want to call.
2024-11-13:13:16:14,708 INFO     [__init__.py:459] The tag 'arc_ca' is already registered as a group, this tag will not be registered. This may affect tasks you want to call.
Using the latest cached version of the dataset since allenai/ai2_arc couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'ARC-Challenge' at /home/marcus/.cache/huggingface/datasets/allenai___ai2_arc/ARC-Challenge/0.0.0/210d026faf9955653af8916fad021475a3f00453 (last modifie

README.md:   0%|          | 0.00/7.32k [00:00<?, ?B/s]

train-00000-of-00002.parquet:   0%|          | 0.00/269M [00:00<?, ?B/s]

train-00001-of-00002.parquet:   0%|          | 0.00/281M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/1.14M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/1.08M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2662 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5153 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/4869 [00:00<?, ? examples/s]

0000.parquet:   0%|          | 0.00/1.16M [00:00<?, ?B/s]

Generating test split:   0%|          | 0/5153 [00:00<?, ? examples/s]

2024-11-13:13:19:53,585 INFO     [task.py:415] Building contexts for lambada_openai on rank 0...
100%|██████████| 5153/5153 [00:14<00:00, 350.22it/s]
2024-11-13:13:20:08,591 INFO     [task.py:415] Building contexts for lambada_standard on rank 0...
100%|██████████| 5153/5153 [00:14<00:00, 351.16it/s]
2024-11-13:13:20:23,603 INFO     [task.py:415] Building contexts for hellaswag on rank 0...
100%|██████████| 10042/10042 [00:06<00:00, 1632.85it/s]
2024-11-13:13:20:32,752 INFO     [task.py:415] Building contexts for winogrande on rank 0...
100%|██████████| 1267/1267 [00:00<00:00, 43889.85it/s]
2024-11-13:13:20:32,917 INFO     [task.py:415] Building contexts for mmlu_anatomy on rank 0...
100%|██████████| 135/135 [00:00<00:00, 383.77it/s]
2024-11-13:13:20:33,293 INFO     [task.py:415] Building contexts for mmlu_astronomy on rank 0...
100%|██████████| 152/152 [00:00<00:00, 381.86it/s]
2024-11-13:13:20:33,717 INFO     [task.py:415] Building contexts for mmlu_college_chemistry on rank 0...
100

bootstrapping for stddev: perplexity


100%|██████████| 100/100 [00:07<00:00, 12.74it/s]
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used.

bootstrapping for stddev: perplexity


100%|██████████| 100/100 [00:07<00:00, 12.66it/s]
  (gpt_neox): GPTNeoXModel(
    (embed_in): Embedding(50304, 768)
    (emb_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-11): 12 x GPTNeoXLayer(
        (input_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (post_attention_dropout): Dropout(p=0.0, inplace=False)
        (post_mlp_dropout): Dropout(p=0.0, inplace=False)
        (attention): GPTNeoXSdpaAttention(
          (rotary_emb): GPTNeoXRotaryEmbedding()
          (query_key_value): Linear(in_features=768, out_features=2304, bias=True)
          (dense): Linear(in_features=768, out_features=768, bias=True)
          (attention_dropout): Dropout(p=0.0, inplace=False)
        )
        (mlp): GPTNeoXMLP(
          (dense_h_to_4h): Linear(in_features=768, out_features=3072, bias=True)
          (dense_4h_to_h): Linear(in_features=3072, out_f

|                 Tasks                 |Version|Filter|n-shot|  Metric  |   | Value  |   |Stderr|
|---------------------------------------|------:|------|-----:|----------|---|-------:|---|-----:|
|arc_challenge                          |      1|none  |     0|acc       |↑  |  0.1997|±  |0.0117|
|                                       |       |none  |     0|acc_norm  |↑  |  0.2398|±  |0.0125|
|hellaswag                              |      1|none  |     0|acc       |↑  |  0.2903|±  |0.0045|
|                                       |       |none  |     0|acc_norm  |↑  |  0.3136|±  |0.0046|
|lambada_openai                         |      1|none  |     0|acc       |↑  |  0.3689|±  |0.0067|
|                                       |       |none  |     0|perplexity|↓  | 31.2590|±  |1.1594|
|lambada_standard                       |      1|none  |     0|acc       |↑  |  0.2333|±  |0.0059|
|                                       |       |none  |     0|perplexity|↓  |172.7634|±  |7.7266|
|mmlu     

- Pro of this is that the processing is standardized and I know this works with Pythia models because the doc uses Pythia as the de-facto use case example
    - However, with this implementation I have no idea what's actually going on under the hood
    - Doesn't feel right, I'll read up on this and write more about it here. But the pipeline works
- How do we know this benchmarking pipeline actually benchmarks correctly?
    - Utilize the reported Pythia Benchmarks from the paper and compare

![](./img/pythia_paper_dedup_benchmarks.png)

I initially reported to benchmark only on ARC-Challenge and MMLU.<br>
After reporting this, I quickly concluded that I wanted to use more benchmarks, specifically some of the ones mentioned in the Pythia paper to cross reference whether the evaluation pipeline itself is correct. Also, if that were to be the case, we would have a more information-rich benchmarking report. Who wouldn't want that?

The overlapping confidence intervals suggest that the performances on WinoGrande and ARC-Challenge are consistent with the reported results.<br>
This is narrowly not the case for Lambada (OpenAI).<br>
Note that the benchmarking approach used here also reports larger `stderrs`.<br>
I suppose the differences come from different versions of the LM-Eval harness (I further assume EleutherAI used their own LM Eval harness to test their models, still, speculative) being used and thus processing differences. I can't deem this to be a source of error in any way though.

Given a notable divergence of performance results only occuring for Lambada, I interpret the deviation as a result of evaluation harness variance too.

We can go ahead and evaluate Pythia $160\text{M}$ MiniPile with the same setting.

In [11]:
download_model(down_dir=base_dir, target_folder="pythia160m_minipile_trained", 
               cache_folder="pythia160m_minipile_trained_Cache",
               repo_id="Marcus2112/pythia160m_minipile")

Downloading Marcus2112/pythia160m_minipile/main...


Fetching 7 files:   0%|          | 0/7 [00:00<?, ?it/s]

tokenizer.json:   0%|          | 0.00/3.56M [00:00<?, ?B/s]

README.md:   0%|          | 0.00/11.0 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/4.83k [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/441 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/819 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/495M [00:00<?, ?B/s]

In [12]:
## Evaluation - Pythia 160M Trained on MiniPile

pythia_minipile = AutoModelForCausalLM.from_pretrained(base_path / "pythia160m_minipile_trained", local_files_only=True)
tokenizer = AutoTokenizer.from_pretrained(base_path / "pythia160m_dedup_untrained", use_fast=True, local_files_only=True) # Use exact same tokenizer
pythia_minipile = pythia_minipile.to(device)
 
batch_size_hflm = 1

pythia_minipile_hflm = HFLM(pretrained=pythia_minipile,
                        tokenizer=tokenizer,
                        batch_size=batch_size_hflm)

results = simple_evaluate(model=pythia_minipile_hflm,
                          tasks=["arc_challenge", "mmlu", "winogrande", "hellaswag", "lambada"],
                          num_fewshot=0,
                          batch_size=batch_size_hflm,
                          device="cuda",
                          limit=None)

with open('02_eval_160M_minipile.txt', 'w') as f:
    f.write(str(results))

print(utils.make_table(results))

Some weights of GPTNeoXForCausalLM were not initialized from the model checkpoint at /mnt/data/pythia160m_minipile_trained and are newly initialized: ['embed_out.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
2024-11-15:12:54:24,506 INFO     [huggingface.py:481] Using model type 'default'
2024-11-15:12:54:24,520 INFO     [evaluator.py:164] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2024-11-15:12:54:24,521 INFO     [evaluator.py:217] Using pre-initialized model
2024-11-15:12:54:29,191 INFO     [__init__.py:459] The tag 'arc_ca' is already registered as a group, this tag will not be registered. This may affect tasks you want to call.
2024-11-15:12:54:29,209 INFO     [__init__.py:459] The tag 'arc_ca' is already registered as a group, this tag will not be registered. This may affect tasks you want to call.


README.md:   0%|          | 0.00/9.00k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/190k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/204k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/55.7k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1119 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1172 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/299 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/1.11k [00:00<?, ?B/s]

mmlu_no_train.py:   0%|          | 0.00/5.86k [00:00<?, ?B/s]

data.tar:   0%|          | 0.00/166M [00:00<?, ?B/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

README.md:   0%|          | 0.00/9.97k [00:00<?, ?B/s]

winogrande.py:   0%|          | 0.00/5.65k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.40M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/40398 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1767 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1267 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/6.84k [00:00<?, ?B/s]

hellaswag.py:   0%|          | 0.00/4.36k [00:00<?, ?B/s]

dataset_infos.json:   0%|          | 0.00/2.53k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/47.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/11.8M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/12.2M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/39905 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/10003 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10042 [00:00<?, ? examples/s]

Map:   0%|          | 0/39905 [00:00<?, ? examples/s]

Map:   0%|          | 0/10042 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/4.99k [00:00<?, ?B/s]

lambada_openai.py:   0%|          | 0.00/4.82k [00:00<?, ?B/s]

0000.parquet:   0%|          | 0.00/1.16M [00:00<?, ?B/s]

Generating test split:   0%|          | 0/5153 [00:00<?, ? examples/s]



README.md:   0%|          | 0.00/7.32k [00:00<?, ?B/s]

train-00000-of-00002.parquet:   0%|          | 0.00/269M [00:00<?, ?B/s]

train-00001-of-00002.parquet:   0%|          | 0.00/281M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/1.14M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/1.08M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2662 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5153 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/4869 [00:00<?, ? examples/s]

2024-11-15:12:57:30,096 INFO     [task.py:415] Building contexts for lambada_standard on rank 0...
100%|██████████| 5153/5153 [00:13<00:00, 385.12it/s]
2024-11-15:12:57:43,762 INFO     [task.py:415] Building contexts for lambada_openai on rank 0...
100%|██████████| 5153/5153 [00:13<00:00, 384.10it/s]
2024-11-15:12:57:57,400 INFO     [task.py:415] Building contexts for hellaswag on rank 0...
100%|██████████| 10042/10042 [00:06<00:00, 1663.43it/s]
2024-11-15:12:58:05,943 INFO     [task.py:415] Building contexts for winogrande on rank 0...
100%|██████████| 1267/1267 [00:00<00:00, 54282.86it/s]
2024-11-15:12:58:06,078 INFO     [task.py:415] Building contexts for mmlu_machine_learning on rank 0...
100%|██████████| 112/112 [00:00<00:00, 436.44it/s]
2024-11-15:12:58:06,349 INFO     [task.py:415] Building contexts for mmlu_astronomy on rank 0...
100%|██████████| 152/152 [00:00<00:00, 439.28it/s]
2024-11-15:12:58:06,715 INFO     [task.py:415] Building contexts for mmlu_college_mathematics on ra

bootstrapping for stddev: perplexity


100%|██████████| 100/100 [00:06<00:00, 14.40it/s]
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used.

bootstrapping for stddev: perplexity


100%|██████████| 100/100 [00:07<00:00, 14.28it/s]
  (gpt_neox): GPTNeoXModel(
    (embed_in): Embedding(50304, 768)
    (emb_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-11): 12 x GPTNeoXLayer(
        (input_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (post_attention_dropout): Dropout(p=0.0, inplace=False)
        (post_mlp_dropout): Dropout(p=0.0, inplace=False)
        (attention): GPTNeoXSdpaAttention(
          (rotary_emb): GPTNeoXRotaryEmbedding()
          (query_key_value): Linear(in_features=768, out_features=2304, bias=True)
          (dense): Linear(in_features=768, out_features=768, bias=True)
          (attention_dropout): Dropout(p=0.0, inplace=False)
        )
        (mlp): GPTNeoXMLP(
          (dense_h_to_4h): Linear(in_features=768, out_features=3072, bias=True)
          (dense_4h_to_h): Linear(in_features=3072, out_f

|                 Tasks                 |Version|Filter|n-shot|  Metric  |   |    Value    |   |   Stderr   |
|---------------------------------------|------:|------|-----:|----------|---|------------:|---|-----------:|
|arc_challenge                          |      1|none  |     0|acc       |↑  |       0.2125|±  |      0.0120|
|                                       |       |none  |     0|acc_norm  |↑  |       0.2628|±  |      0.0129|
|hellaswag                              |      1|none  |     0|acc       |↑  |       0.2560|±  |      0.0044|
|                                       |       |none  |     0|acc_norm  |↑  |       0.2619|±  |      0.0044|
|lambada_openai                         |      1|none  |     0|acc       |↑  |       0.0000|±  |      0.0000|
|                                       |       |none  |     0|perplexity|↓  | 3138574.4432|±  | 302684.4607|
|lambada_standard                       |      1|none  |     0|acc       |↑  |       0.0000|±  |      0.0000|
|         

## What's the catch?

I calculated the "Percentage Difference of Means" and "95% Confidence Interval" in the `MiniPile_Pile_Benchmark_Comparisons.ods` spreadsheet.<br>
Crucially, we see the following results for the MiniPile-trained Pythia $160M$ model, compared against the Pile-trained Pythia $160M$ model:

| Benchmark        | Measure      | 160M Pile Deduplicated | 160M MiniPile              | Percentage Difference of Means | 95% Confidence Interval      | Interpretation                            |
| ---------------- | ------------ | ---------------------- | -------------------------- | ------------------------------ | ---------------------------- | ----------------------------------------- |
| ARC-Challenge    | acc ↑        | 0.1997 ± 0.0117        | **0.2125 ± 0.0120**        | 6.4096                         | (0.0456; -0.0200)            | Difference not significant                |
| MMLU             | acc ↑        | 0.2299 ± 0.0035        | **0.2699 ± 0.0037**        | 17.3989                        | (0.0500; 0.0300)             | MiniPile-trained better                   |
| HellaSwag        | acc ↑        | **0.2903 ± 0.0045**    | 0.2560 ± 0.0044            | -11.8154                       | (-0.0220; -0.0466)           | Pile Deduplicated-trained better          |
| WinoGrande       | acc ↑        | **0.4964 ± 0.0141**    | 0.4720 ± 0.0140            | -4.9154                        | (0.0145; -0.0633)            | Difference not significant                |
| Lambada (OpenAI) | acc ↑        | **0.3689 ± 0.0067**    | 0.0000 ± 0.0000            | -100.00                        | (-0.3558; -0.3820)           | Pile Deduplicated-trained severely better |
| Lambada (OpenAI) | perplexity ↓ | **31.2590 ± 1.1594**   | 3138574.4432 ± 302684.4607 | 10040446.5408                  | (3731804.7272; 2545281.6412) | Pile Deduplicated-trained severely better |

- Training pipeline replication seems successful, as the MiniPile-trained model displays competitive performance on multiple benchmarks.
    - Performance significantly better on MMLU
    - Comparable performance on ARC-Challenge, WinoGrande
    - Inferior performance on HellaSwag
    - Severely inferior performance on Lambada (OpenAI)

I interpret the results for the Lambada benchmark as follows:
- MiniPile seems to lack distinctly crucial linguistic understanding that was certainly emitted from The Pile Deduplicated
- It seems general linguistic understanding on MiniPile, given the extremely high perplexity, is severely lacking

Therefore, while MiniPile can enable competitive performance on some tasks with three orders of magnitude less data, it seems to break down in capturing more general aspects of language necessary for those benchmarks involving next-token prediction in rather context-rich settings.

**In other words:** There indeed seems to be no free lunch.