# Recreating and Improving MiniPile Dataset Creation

**Objectives:**
- [.] Implement and verify MiniPile’s filtering pipeline according to [Kaddour (2023)](https://arxiv.org/abs/2304.08442), but intended for decoder-only model use
- [] Evaluate and compare performances of Pythia $160\text{M}$ pretrained on The Pile vs. trained on the *newly, self-created MiniPile* on MMLU and ARC-Challenge
- [] Improve the dataset creation process, create new SuperMiniPile dataset (ideally smaller and more information-retaining)
- [] Evaluate Pythia $160\text{M}$ on SuperMiniPile on MMLU and ARC-Challenge
- [] Evaluate and compare performances of Pythia $1.4\text{B}$ pretrained on The Pile vs. trained on SuperMiniPile on the MMLU and ARC benchmarks

In [None]:
#! pip install sentence-transformers

In [1]:
import os
import torch
import numpy as np
from tqdm import tqdm
from pathlib import Path
from datasets import load_dataset
from huggingface_hub import snapshot_download
from sentence_transformers import SentenceTransformer

In [2]:
base_dir = "/mnt/data"
base_path = Path(base_dir)

def download_model(down_dir: str, target_folder: str, cache_folder: str, repo_id: str, branch: str = "main") -> None:
    down_dir = Path(down_dir)
    target_dir = down_dir / target_folder
    cache_dir = down_dir / cache_folder

    os.makedirs(target_dir, exist_ok=True)
    os.makedirs(cache_dir, exist_ok=True)

    print(f"Downloading {repo_id}/{branch}...")

    while True:
        try:
            snapshot_download(
                repo_id,
                repo_type="model",
                revision=branch,
                cache_dir=str(cache_dir),
                local_dir=str(target_dir)
            )
            break
        except Exception as e:
            print(f"Download attempt failed: {e}")
            continue

---

## Recreating The MiniPile Dataset Creation Pipeline

(1) document embedding extraction,<br>
(2) clustering of embeddings, and<br>
(3) human-guided exclusion of unwanted clusters

- 22 data subset sources
- 5.91 KiB mean document size (before deduplication)

### Document Embedding Extraction

- MiniPile paper uses term "document": I assume as they are quite large, this refers to individual training examples from "The Pile-Deduplicated"
- "The Pile Deduplicated" predominantly contains english text, as stated in the Pile paper
- `E5-Large` does not require performing sentence-splitting beforehand, I was misguided by the example code at https://huggingface.co/intfloat/e5-large
- I will use `E5-Large` with one "sentence" actually being one "document" for MiniPile

In [3]:
# Starting point is the deduplicated The Pile
# Infer embeddings for all documents using E5-Large

# https://huggingface.co/intfloat/e5-large-v2
download_model(down_dir=base_dir, target_folder="e5-large-v2", 
               cache_folder="e5-large-v2_Cache",
               repo_id="intfloat/e5-large-v2") # Chose this because nothing beyond E5-Large was specified

e5_large = SentenceTransformer(str(base_path / "e5-large-v2"), local_files_only=True) # no .from_pretrained() here
e5_large = e5_large.eval()

# https://huggingface.co/datasets/EleutherAI/the_pile_deduplicated
pile_dedup = load_dataset("parquet",
                          data_files={
                              "train": str(base_path / "Pile_Deduplicated" / "data" / "train-*.parquet"),
                          },
                          cache_dir=str(base_path / "MiniPile_Cache"),
                          split="train",
                          streaming=True)

Downloading intfloat/e5-large-v2/main...


Fetching 13 files:   0%|          | 0/13 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/1650 [00:00<?, ?it/s]

Given the model and the local-stream for The Pile, we iterate through the dataset and extract the embeddings for each document.<br>
For convenience and later processing, we will assemble an embedding dataset.

Thing is, when creating the embedding dataset, we need to make sure that embedding indices match the document indices in the original dataset.<br>
This is strictly necessary for the filtering step later on.<br>
To ensure the above code's resulting embedding dataset is correctly aligned with the original dataset, I ran the following code for a small subset of `16384` documents/embeddings:

In [9]:
# Catch and collect mismatches
misfits = 0

embd_dir = base_path / "Pile_Deduplicated_Embedded"
pile_dedup_embd = load_dataset(
        "parquet",
        data_files=str(embd_dir / "shard_*.parquet"),
        split="train"
    )

for index in tqdm(range(len(pile_dedup_embd))):
    original_text = pile_dedup_embd[index]['text']
    # Load saved embedding as numpy float32 array
    saved_embedding = pile_dedup_embd[index]['embedding']
    # Ground truth / Counter thesis embedding is freshly generated
    with torch.no_grad():
        prefixed_text = "query: " + original_text
        generated_embedding = e5_large.encode(
            prefixed_text,
            show_progress_bar=False,
            convert_to_numpy=True,
            normalize_embeddings=True
        ).astype(np.float32)

    # Can't get this e5_large instance here to run in fp16 mode
    # So we have to check for float16 tolerance depth in similarity although the values are deeper / more precise themselves
    if not np.allclose(generated_embedding, saved_embedding, rtol=1e-3, atol=1e-3):
        print(f"\nMismatch found at index: {index}")
        misfits += 1

print(f"\nFound {misfits} mismatches within fp16 tolerance.")

Generating train split: 0 examples [00:00, ? examples/s]

  0%|          | 31/327680 [00:10<32:12:25,  2.83it/s]


KeyboardInterrupt: 

No mismatches were found, which means we can scale the embedding set creation to the full dataset.

Embedding "The Pile Deduplicated" with `E5-Large` will foreseeably require *a lot* of time and resources.<br>
But, the embedding will possibly be needed in raw format at multiple stages: For recreating the MiniPile dataset, and also as (potential) basis for the SuperMiniPile improvement.<br>
I am thinking that if an error occurs during a combined embedding and clustering phase, it could complicate recovery efforts, but it would at the same time be more resource efficient.<br>
Therefore, I see more flexibility in creating the embedding dataset *separately* and, concurrently but separately, using the intermediate results for clustering.<br>
This way, we can get the human-filtering step done as soon as possible.

The code I tested with the above code snippet and ultimately used to build the `Pile_Deduplicated_Embedded` dataset can be found in the `03_embed_pile_dedup.py` script.

### Clustering of Embeddings

- Batchified $k$-means clustering, a term only used in the MiniPile paper: This must stand for **mini-batch k-means clustering**
- Cosine distance between normalized embeddings
- Cluster Count of $k=220$ ($10$ clusters per source)
- Batch size $16384$ (I made sure, once again, yes, MiniPile uses "The Pile Deduplicated"; this is slowly getting to me)

I implemented the below code as to most closely mirror the paper.<br>
Then I ran a test with a $327680$ embedding dummy dataset, investigating potential issues with 
- Memory, 
    - Memory usage remained relatively consistent
- Performance and time,
    - The Notebook version took $12:54$ minutes for processing the dummy dataset
    - the more script version took $21:40$ minutes for processing the dummy dataset in seperate fitting and prediction stages
- Result layout and richness,
    - Need to be able to most educatedly kick out clusters, which is why `cluster_info_for_inspection.json` exists
    - Need to most efficiently then 'mask' the original embedding dataset, which is why `cluster_masked_embeddings.json` exists to skip over the unwanted cluster entries
    - Need to most efficiently also weed out the most 'interesting' individual examples, which is why `cluster_masked_embeddings.json` contains distance info too
- Using the dataset in a streaming fashion.
    - Works.

In [10]:
import json
import numpy as np
from tqdm import tqdm
from sklearn.cluster import MiniBatchKMeans
from sklearn.metrics.pairwise import cosine_distances
from datasets import load_dataset
from pathlib import Path
from collections import defaultdict

base_path = Path("/mnt/data")
embd_dir = base_path / "Pile_Deduplicated_Embedded"
cluster_dir = base_path / "MiniPile_BatchKMeans"
cluster_dir.mkdir(exist_ok=True)

k_clusters = 220
batch_size = 16384
n_init = 3 # default as nothing else is specified

class CosineMiniBatchKMeans(MiniBatchKMeans):
    def _transform(self, X):
        # Prob most lowkey way to enforce using cosine distance
        return cosine_distances(X, self.cluster_centers_)

batchified_kmeans = CosineMiniBatchKMeans(n_clusters=k_clusters, 
                                          batch_size=batch_size, 
                                          init='k-means++', 
                                          n_init=n_init, 
                                          random_state=42)

# I've seen that being used by Hugging Face for storing results 
# (https://www.atatus.com/glossary/jsonl/) seems to save some disk space by structure
cluster_info_path = cluster_dir / "cluster_info_for_inspection.json"
clustering_results_path = cluster_dir / "clustering_results.jsonl"
cluster_centers_path = cluster_dir / "cluster_centers.npy"

# Get cos distances using latest centroids
def compute_distances(embeddings, centroids, labels):
    return 1 - np.sum(embeddings * centroids[labels], axis=1)

# Get closest and farthest examples for a cluster
def get_extreme_examples(embeddings, labels, centroids, texts, n=5):
    distances = cosine_distances(embeddings, centroids[labels]).diagonal()
    sorted_indices = np.argsort(distances)
    closest_indices = sorted_indices[:n]   # First n
    farthest_indices = sorted_indices[-n:] # Furthest n, thank god for slicing
    return ([{"text": texts[idx], "distance": distances[idx]} for idx in closest_indices],
            [{"text": texts[idx], "distance": distances[idx]} for idx in farthest_indices])

# Cluster tracking infos
cluster_info = {
    i: {
        'closest': [],
        'farthest': [],
        'total_examples': 0,
        'average_distance': 0.0,
        'sum_distance': 0.0
    } for i in range(k_clusters)
}

total_processed = 0  # Track number of processed examples

# Process the dataset
cluster_info_temp = defaultdict(lambda: {'closest': [], 'farthest': [], 'total_examples': 0, 'sum_distance': 0.0})

with tqdm(total=None, desc="Processing Batches") as pbar, open(clustering_results_path, 'w') as results_file:
    for batch in pile_dedup.iter(batch_size=batch_size):
        embeddings = np.array(batch['embedding'])
        texts = batch['text']
        
        # Predict clusters and update centroids
        labels = batchified_kmeans.partial_fit(embeddings).predict(embeddings)
        distances = compute_distances(embeddings, batchified_kmeans.cluster_centers_, labels)
        
        # Write clustering results for each example
        for idx, (text, label, distance) in enumerate(zip(texts, labels, distances)):
            result = {
                'idx': total_processed + idx,
                'cluster': int(label),
                'distance': float(distance)
            }
            results_file.write(json.dumps(result) + '\n')
            
            # Update temporary cluster info
            cluster = int(label)
            cluster_info_temp[cluster]['total_examples'] += 1
            cluster_info_temp[cluster]['sum_distance'] += distance
            cluster_info_temp[cluster]['closest'].append({'text': text, 'distance': distance})
            cluster_info_temp[cluster]['farthest'].append({'text': text, 'distance': distance})
        
        total_processed += len(texts)
        pbar.update(len(texts))

        # Periodically update cluster_info
        if total_processed % (128 * batch_size) == 0:
            for cluster, info in cluster_info_temp.items():
                cluster_info[cluster]['total_examples'] += info['total_examples']
                cluster_info[cluster]['sum_distance'] += info['sum_distance']
                cluster_info[cluster]['closest'].extend(info['closest'])
                cluster_info[cluster]['farthest'].extend(info['farthest'])
                cluster_info[cluster]['closest'] = sorted(cluster_info[cluster]['closest'], key=lambda x: x['distance'])[:5]
                cluster_info[cluster]['farthest'] = sorted(cluster_info[cluster]['farthest'], key=lambda x: x['distance'], reverse=True)[:5]
            cluster_info_temp.clear()

# Final update with remaining temp info
for cluster, info in cluster_info_temp.items():
    cluster_info[cluster]['total_examples'] += info['total_examples']
    cluster_info[cluster]['sum_distance'] += info['sum_distance']
    cluster_info[cluster]['closest'].extend(info['closest'])
    cluster_info[cluster]['farthest'].extend(info['farthest'])

for cluster in cluster_info:
    if cluster_info[cluster]['total_examples'] > 0:
        cluster_info[cluster]['average_distance'] = cluster_info[cluster]['sum_distance'] / cluster_info[cluster]['total_examples']
    cluster_info[cluster]['closest'] = sorted(cluster_info[cluster]['closest'], key=lambda x: x['distance'])[:5]
    cluster_info[cluster]['farthest'] = sorted(cluster_info[cluster]['farthest'], key=lambda x: x['distance'], reverse=True)[:5]

# Save cluster information and centroids
with open(cluster_info_path, 'w') as f:
    json.dump(cluster_info, f, indent=2)
np.save(cluster_centers_path, batchified_kmeans.cluster_centers_)

print("Clustering completed.")

Processing Batches: 327680it [12:52, 424.26it/s]


Clustering completed.


I now went on to reformat this to run concurrently to the embedding task which is already running at this point, meaning the `partial_fit` gets done alongside new parquet files becoming available. And, as soon as embedding ends, clustering can start (see that we go and seperate the two, this is not the case above, but that was fine for testing).

The script I implemented and ultimately ran is `03_embed_pile_dedup.py`.

---

## Evaluate Pythia $160\text{M}$ Pile vs. Pythia $160\text{M}$ MiniPile (self-created)

In [None]:
import torch
import numpy as np
from transformers import AutoModelForCausalLM, AutoTokenizer

from lm_eval import utils, simple_evaluate
from lm_eval.models.huggingface import HFLM

seed = 42
torch.manual_seed(seed)
np.random.seed(seed)

if torch.cuda.is_available():
    torch.cuda.manual_seed_all(seed)
    device_count = torch.cuda.device_count()
    print(f"Available GPUs: {device_count}")
    for i in range(device_count):
        device = torch.device(f'cuda:{i}')
        device_properties = torch.cuda.get_device_properties(device)
        total_mem = device_properties.total_memory / (1024 ** 3)
        allocd_mem = torch.cuda.memory_allocated(device) / (1024 ** 3)
        free_mem = total_mem - allocd_mem
        print(f"\nGPU {i}:\t{device_properties.name}")
        print(f"\tTotal memory:\t\t{total_mem:.2f} GiB")
        print(f"\tAllocated memory:\t{allocd_mem:5.2f} GiB")
        print(f"\tFree memory:\t\t{free_mem:.2f} GiB")
else:
    print("No CUDA-capable GPUs available")

## Evaluation - Pythia 160M Trained on self-created MiniPile

pythia_minipile_self = AutoModelForCausalLM.from_pretrained(base_path / "pythia160m_minipile_self_trained", local_files_only=True)
tokenizer = AutoTokenizer.from_pretrained(base_path / "pythia160m_dedup_untrained", use_fast=True, local_files_only=True) # Use exact same tokenizer
pythia_minipile_self = pythia_minipile_self.to(device)
 
batch_size_hflm = 1

pythia_minipile_hflm = HFLM(pretrained=pythia_minipile_self,
                        tokenizer=tokenizer,
                        batch_size=batch_size_hflm)

results = simple_evaluate(model=pythia_minipile_hflm,
                          tasks=["arc_challenge", "mmlu", "winogrande", "hellaswag", "lambada", "blimp"],
                          num_fewshot=0,
                          batch_size=batch_size_hflm,
                          device="cuda",
                          limit=None)

with open('03_eval_160M_minipile_self.txt', 'w') as f:
    f.write(str(results))

print(utils.make_table(results))

---

## Improve the Dataset Creation Process

Ideas:

- I won't touch the embedding part, it is necessary and works well (maybe tinker with BGE-M3)
- Coverage-centered selection of documents for the SuperMiniPile dataset (larger clusters represented by more documents, smaller by less)
- Calculate an "importance value" for random examples, those ideally being distributed across the cluster, per each (post-filter) clustrr
- 
- "Findings indicate that it is not the proportion of tokens occupied by high-utility data that aids acquisition, but rather the proportion of training steps assigned to such data" [On the effect of curriculum learning with developmental data for grammar acquisition (Opper, et al. 2023)](https://aclanthology.org/2023.conll-babylm.pdf)

- https://openreview.net/pdf?id=7D5EECbOaf9
- https://arxiv.org/pdf/2402.09668
- https://arxiv.org/pdf/2406.03057
- https://arxiv.org/pdf/2210.15809
- https://arxiv.org/pdf/2204.08499
- https://arxiv.org/pdf/2303.09540
- https://arxiv.org/pdf/2308.12284

---

## Evaluate Pythia $160\text{M}$ SuperMiniPile

---

## Evaluate Pythia $1.4\text{B}$ Pretrained vs. Pythia $1.4\text{B}$ SuperMiniPile