# Recreating and Improving MiniPile Dataset Creation

**Objectives:**
- [x] Implement and verify MiniPile’s filtering pipeline according to [Kaddour (2023)](https://arxiv.org/abs/2304.08442), but intended for decoder-only model use
- [x] Evaluate and compare performances of Pythia $160\text{M}$ pretrained on The Pile vs. trained on the *newly, self-created MiniPile* on MMLU and ARC-Challenge
- [.] Evaluate and compare performances of Pythia $1.4\text{B}$ pretrained on The Pile vs. trained on the *newly, self-created MiniPile* on MMLU and ARC-Challenge
- [.] Improve the dataset creation process, create new SuperMiniPile dataset (ideally smaller and more information-retaining)
- [] Evaluate Pythia $160\text{M}$ on SuperMiniPile on MMLU and ARC-Challenge
- [] Evaluate and compare performances of Pythia $1.4\text{B}$ pretrained on The Pile vs. trained on SuperMiniPile on the MMLU and ARC benchmarks

In [None]:
#! pip install sentence-transformers

In [5]:
import os
import torch
import numpy as np
from tqdm import tqdm
from pathlib import Path
from datasets import load_dataset
from huggingface_hub import snapshot_download
from sentence_transformers import SentenceTransformer

base_dir = "/vol/tmp/koppelmm"
base_path = Path(base_dir)

In [2]:
def download_model(down_dir: str, target_folder: str, cache_folder: str, repo_id: str, branch: str = "main") -> None:
    down_dir = Path(down_dir)
    target_dir = down_dir / target_folder
    cache_dir = down_dir / cache_folder

    os.makedirs(target_dir, exist_ok=True)
    os.makedirs(cache_dir, exist_ok=True)

    print(f"Downloading {repo_id}/{branch}...")

    while True:
        try:
            snapshot_download(
                repo_id,
                repo_type="model",
                revision=branch,
                cache_dir=str(cache_dir),
                local_dir=str(target_dir)
            )
            break
        except Exception as e:
            print(f"Download attempt failed: {e}")
            continue

---

## Recreating The MiniPile Dataset Creation Pipeline

(1) document embedding extraction,<br>
(2) clustering of embeddings, and<br>
(3) human-guided exclusion of unwanted clusters<br>
(4) mini-pile distillation

- 22 data subset sources
- 5.91 KiB mean document size (before deduplication)

### Document Embedding Extraction

- MiniPile paper uses term "document": This refers to individual training examples from "The Pile-Deduplicated"
- "The Pile Deduplicated" predominantly contains english text, as stated in the Pile paper
- `E5-Large` does not require performing sentence-splitting beforehand, I was misguided by the example code at https://huggingface.co/intfloat/e5-large
- `E5-Large` scales poorly to the dataset size under the conditions imposed by the HU Berlin cluster. I will use `E5-Base-4k` instead.
- `E5-Base-4k` performs slightly worse than `E5-Large`, but has roughly half the parameter count and is therefore more efficient to use
- We attempt to mitigate the reported/expectable performance losses by using a larger text window size of $1024$ tokens instead of the $512$ tokens used by `E5-Large` by default.

Given the smaller model and The Pile, we iterate through the dataset and extract the embedding for each document.<br>
The script I initially implemented for the embedding step is `03_embed_pile_dedup.py`.<br>
The approach layed out therein conceptually worked, but it had to be thoroughly memory-optimized to run for as long as needed for our Pile dataset.<br>
The optimized script I ultimately ran for this step is `03_embed_pile_dedup_turbo.py`.

The embedding step produces as artifact a copy of the original dataset with the embeddings added as a column.<br>
The embedding process is resumable, results are persisted in multiple parquet files, one after another, in the folder `Pile_Deduplicated_Embd`.<br>

**Note that I intended to upload this embedded version of the Pile to HuggingFace for strict reproducibility.**<br>
**This idea was cut short by a change in HuggingFace's pricing policy, effective January 2025, prohibiting the free sharing of datasets >500GB - a threshold which this new dataset crosses (801 GB) due to the added embeddings.**

Furthermore, note that the embedding step, through its parallel processing, is in no way guaranteed to maintain the original order of the documents.<br>
In fact, this is the reason for why I elected to build the dataset copy with the embeddings as a column in the first place, to ensure the correct alignment of the embeddings with the documents while still being able to leverage parallel processing. Because, after all, time was the deciding factor.<br>
The shuffling is therefore not a problem as such, but we have to base the clustering and therefore all following processes on this new, shuffled dataset.

In other words, all artifacts produced after the embedding step will relate not to the original dataset, but the embedded dataset, e.g. when referring to entries by index.

### Clustering of Embeddings

- Batchified $k$-means clustering, a term only used in the MiniPile paper: This must stand for **mini-batch k-means clustering**
- Cosine distance between normalized embeddings
- Cluster Count of $k=220$ ($10$ clusters per source)
- Batch size $16384$

Architecturally, I built the clustering step to be fully independent of the embedding step, so as to be able to run the partial fitting concurrently with the latter, saving ~4 days of processing time total.

The clustering step is implemented in `03_cluster_pile_embed.py`. As soon as the embedding step finishes, a text file is produced, signaling the clustering step to conclude model fitting and start predicting. The centroids are saved and can be found in `MiniPile_BatchKMeans/cluster_centers.npy`. Intermediary centroid results have been omitted from this repository.

Each embedding from the newly created dataset is assigned to one of the $220$ clusters. This cluster information as well as the distance of the data point to the centroid are stored in JSONL files, with the entries in order of appearance in the embedded Pile dataset.<br>
Each clustering result looks like this: `{"idx": 0, "cluster": 5, "distance": 0.20949329195756128}`.<br>
Each Pile-Embedded document is referred to only by its index, crunching the cluster results' memory requirements down.

Beyond the JSONL files containing the cluster assignments, the clustering step produces a verification file `MiniPile_BatchKMeans/cluster_results_metadata.json` to indicate whether all chunks and all data points therein have been processed and how results have been saved. We can see from this file that the original dataset size of $134,318,121$ documents has been captured and thus processed, lending more credibility to the clustering results.

Additionally, the clustering step produces a file `MiniPile_BatchKMeans/cluster_info_for_inspection.json`.<br>
Per cluster index, this file contains the following information:
- `closest`: The top 5 closest documents to the cluster centroid
    - `text`: The associated text excerpt (for memory reasons)
    - `distance`: The cosine distance to the cluster centroid
- `farthest`: The top 5 farthest documents from the cluster centroid (again, excerpts, in same format)
- `total_examples`: The number of documents assigned to this cluster
- `average_distance`: The average cosine distance of all documents to the cluster centroid
- `sum_distance`: The sum of all cosine distances of all documents to the cluster centroid

The three latter information points are intended to help with the human-guided exclusion of unwanted clusters, but moreover, they may help in improving the dataset creation process later on, as they can provide insights into the spread of the data.

After these files were attained, I ran `03_sort_pile_clusters.py` to save the cluster assignment entries organized into one file per cluster / separated by cluster. This is to facilitate a more effective cluster exclusion process during later minipile creation steps.

For now, the clustering step is concluded, producing one JSONL file per cluster with information about each cluster assignment per document.<br>
This intermediary dataset can be found here: [https://huggingface.co/datasets/Marcus2112/pile_dedup_embeddings_clusters](https://huggingface.co/datasets/Marcus2112/pile_dedup_embeddings_clusters).

Note that while the entry count is exactly identical, the size is not. Where the original MiniPile occupied $3.14GB$, we now require $3.71GB$.
Several factors play into this:
- We use a slightly different embedding model, which may cause (at least occasional) devations in embedding and thus may lead to differently shaped clusters
- We select clusters by hand, allowing interpretation of selection categories to roam free. I thus may have selected $38$ clusters according to the paper's categories, but my understanding of enforcing them during selection may differ from the author.
- The random sampling happened to select different examples per cluster.

Again, note that these indices are not per-se applicable to the original dataset, but to the embedded dataset.

### Human-Guided Cluster Exclusion

At this point, especially due to the `MiniPile_BatchKMeans/cluster_info_for_inspection.json` file, we can start the human-guided exclusion of unwanted clusters.<br>
I strictly adhered to the paper and only sorted out clusters of the layed out categories, which I found to be well identifyable through the 10 examples per cluster.

The categories with the clusters I sorted out are as follows:
- Near Duplicates ($10, 15, 16, 22, 26, 28, 35, 37, 46, 51, 57, 64, 86, 87, 64, 102, 111, 114, 152, 163, 166, 218$)
- Pornography ($167$)
- Navigation Bars ($39, 88, 101, 155$)
- Product Specifications ($61, 200$)
- Long lists of named entities ($40, 44, 78, 90, 99, 103, 181, 196, 219$)

### MiniPile Distillation

With the cluster analysis concluded, we can now proceed to the distillation of the MiniPile dataset.<br>
This dataset touches on all the artifacts produced in the previous steps:<br>
- The embedded Pile dataset, from which we extract the documents
- The cluster assignments, from which we extract the documents assigned to the remaining clusters
- The cluster exclusion list, from which we exclude the documents assigned to the unwanted clusters

The distillation step is implemented in `03_distill_pile_embed.py`.<br>
The script is written to most exactly and efficiently (we have loads of I/O to perform) extract the correct documents according to our random sampling across the remaining clusters.

The distillation step produces a new dataset, the MiniPile, which is a subset of the original Pile dataset.<br>
The created dataset is exactly $1,010,500$ documents large, as intended. The dataset is shuffled, to spread cluster entries evenly across the dataset's splits.<br>
Additionally, I added a column `pile_idx` to each entry, denoting the original index of the document in the Pile dataset.<br>
This helped in making sure that the dataset actually captures a subset derived from across the entire embedded Pile.

The resulting self-created MiniPile can be found here: [https://huggingface.co/datasets/Marcus2112/minipile_recreation](https://huggingface.co/datasets/Marcus2112/minipile_recreation).

I now went on and adapted the `02_train_160M.py` script to train Pythia $160\text{M}$ on the newly created MiniPile.<br>
The adapted script is `03_train_160M_recreation.py`.

---

## Evaluate Pythia $160\text{M}$ Pile vs. Pythia $160\text{M}$ MiniPile (recreated)

We will use an exact copy of the training and the test setup previously used for benchmarking Pythia $160\text{M}$-Pile and $160\text{M}$-MiniPile-Original.<br>
Training is performed with the script `03_train_160M_recreation.py`.

The trained model can be found here: [https://huggingface.co/Marcus2112/pythia-160m-minipile_recreation](https://huggingface.co/Marcus2112/pythia-160m-minipile_recreation)

In [3]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

from lm_eval import utils, simple_evaluate
from lm_eval.models.huggingface import HFLM

In [None]:
## Evaluation - Pythia 160M Trained on Self-Created MiniPile

device = "cuda" if torch.cuda.is_available() else "cpu"
pythia_minipile = AutoModelForCausalLM.from_pretrained(base_path / "pythia160m_minipile_Recreation_trained", local_files_only=True)
tokenizer = AutoTokenizer.from_pretrained(base_path / "pythia160m_dedup_untrained", use_fast=True, local_files_only=True) # Use exact same tokenizer
pythia_minipile = pythia_minipile.to(device)
 
batch_size_hflm = 1

pythia_minipile_hflm = HFLM(pretrained=pythia_minipile,
                        tokenizer=tokenizer,
                        batch_size=batch_size_hflm)

results = simple_evaluate(model=pythia_minipile_hflm,
                          tasks=["arc_challenge", "mmlu", "winogrande", "hellaswag", "lambada", "blimp"],
                          num_fewshot=0,
                          batch_size=batch_size_hflm,
                          device="cuda",
                          limit=None)

with open('03_eval_160M_minipile_recreation.txt', 'w') as f:
    f.write(str(results))

print(utils.make_table(results))

One can consider this point a moment of truth for the study project.

If benchmark scores indicate significant deviation, a multitude of causes could be the reason, e.g.:
- Embedding process may be logically flawed
- Switch for `e5-base-4k` may not pay off when looking at embedding accuracy (our gamble for performance didn't work)
- Clustering process may be logically flawed
- Hand-selected clusters do not match the categories as layed out by the paper (interpretative difference mount to fundamentally different dataset characteristics)
- The index-based logical system used for saving memory and processing time may be inconsistent, leading us to sample wildly inaccurately
- The paper may not have been clear enough and actually clusters were sampled from with regards to their size, and not flat out equal sample counts across each of them
- All or some of the above.

However, if benchmark scores *do not* indicate significant deviation, this implies:
- The approach as layed out by the paper is reproducible
- The pipeline built until now works and can be safely improved upon
- The project from here on out may shift towards optimization and generalization, contributing to furthering the working approach

I split the results comparison into two parts. The first table compares the just now trained `160M Recreation` against `160M Pile Deduplicated`, while the second table compares `160M Recreation` against `160M MiniPile`.

### 160M Pile-Deduplicated vs. 160M Recreation

| Benchmark        | Measure    |     | 160M Pile Deduplicated | 160M Recreation             | Percentage Difference of Means | 95% Confidence Interval       | Interpretation                            |
| ---------------- | ---------- | --- | ---------------------- | --------------------------- | ------------------------------ | ----------------------------- | ----------------------------------------- |
| ARC-Challenge    | acc        | ↑   | **0.1997 ± 0.0117**    | 0.1894 ± 0.0115             | -5.1577                        | (0.0219; -0.0425)             | **Difference not significant**                |
| MMLU             | acc        | ↑   | **0.2299 ± 0.0035**    | 0.2295 ± 0.0035             | -0.1740                        | (0.0093; -0.0101)             | **Difference not significant**                |
| HellaSwag        | acc        | ↑   | **0.2903 ± 0.0045**    | 0.2604 ± 0.0044             | -10.2997                       | (-0.0176; -0.0422)            | Pile Deduplicated-trained better          |
| WinoGrande       | acc        | ↑   | 0.4964 ± 0.0141        | **0.5122 ± 0.0140**         | 3.1829                         | (0.0547; -0.0231)             | **Difference not significant**                |
| Lambada (OpenAI) | acc        | ↑   | **0.3689 ± 0.0067**    | 0.0000 ± 0.0000             | -100.0                         | (-0.3558; -0.3820)            | Pile Deduplicated-trained severely better |
| Lambada (OpenAI) | perplexity | ↓   | **31.2589 ± 1.1594**   | 1854408.3999 ± 148101.5978  | 5932317.3272                   | (2144656.2727; 1564098.0093)  | Pile Deduplicated-trained severely better |
| Lambada (Std)    | acc        | ↑   | **0.2335 ± 0.0059**    | 0.0000 ± 0.0000             | -100.0                         | (-0.2219; -0.2451)            | Pile Deduplicated-trained severely better |
| Lambada (Std)    | perplexity | ↓   | **172.7619 ± 7.7265**  | 11927123.2514 ± 1063672.928 | 6903692.5905                   | (14011749.4290; 9842151.5500) | Pile Deduplicated-trained severely better |
| BLiMP            | acc        | ↑   | **0.7294 ± 0.0015**    | 0.5481 ± 0.0017             | -24.8560                       | (-0.1769; -0.1857)            | Pile Deduplicated-trained better          |

### 160 MiniPile vs. 160M Recreation

| Benchmark        | Measure    |     | 160M MiniPile               | 160M Recreation                 | Percentage Difference of Means | 95% Confidence Interval         | Interpretation             |
| ---------------- | ---------- | --- | --------------------------- | ------------------------------- | ------------------------------ | ------------------------------- | -------------------------- |
| ARC-Challenge    | acc        | ↑   | **0.2125 ± 0.0120**         | 0.1894 ± <br>0.0115             | -10.8706                       | (0.0095; -0.0577)               | **Difference not significant** |
| MMLU             | acc        | ↑   | **0.2699 ± 0.0037**         | 0.2295 ± 0.0035                 | -14.9685                       | (-0.0304; -0.0504)              | MiniPile-trained better    |
| HellaSwag        | acc        | ↑   | 0.2560 ± 0.0044             | **0.2604 ± 0.0044**             | 1.7188                         | (0.0166; -0.0078)               | **Difference not significant** |
| WinoGrande       | acc        | ↑   | 0.4720 ± 0.0140             | **0.5122 ± 0.0140**             | 8.5169                         | (0.0790; 0.0014)                | **Recreation better**          |
| Lambada (OpenAI) | acc        | ↑   | 0.0000 ± 0.0000             | 0.0000 ± 0.0000                 | -                              | -                               | -                          |
| Lambada (OpenAI) | perplexity | ↓   | 3033175.2693 ± 288926.5827  | **1854408.3999 ± 148101.5978**  | -38.8625                       | (-542407.4980; -1815126.2408)   | **Recreation severely better** |
| Lambada (Std)    | acc        | ↑   | 0.0000 ± 0.0000             | 0.0000 ± 0.0000                 | -                              | -                               | -                          |
| Lambada (Std)    | perplexity | ↓   | 27067951.3460 ± 2710040.191 | **11927123.2514 ± 1063672.928** | -55.9364                       | (-9434663.1814; -20846993.0080) | **Recreation severely better** |
| BLiMP            | acc        | ↑   | 0.5194 ± 0.0018             | **0.5481 ± 0.0017**             | 5.5256                         | (0.0336; 0.0238)                | **Recreation better**          |

Not only can we see a match with the original MiniPile across all but one (MMLU) benchmark, but we substantially increased the model's capability on both Lambada benchmarks when evaluated for perplexity, and achieved better results on WinoGrande and BLiMP. 

This is interesting, because we composed `MiniPile_Recreation` with the exact same amount of samples.<br>
While this is the case, the dataset is ~500MB (in text, rest is additional idx column values + metadata) larger, but this difference, which itself can be explained by sampling and embedding expression deviations, has brought down perplexity by ~39% and ~56% respectively.

**What does this noticable upwards trend on the perplexities imply?**
I consider Lambada (OpenAI) and Lambada (Std) as benchmarks that aim to be useful for checking whether dataset reduction impacts capability to capture even subtle, general grammatical distinctions. The Lambadas realize this via evaluating grammatical knowledge across $67$ distinct linguistic phenomena/subtasks in English.
It seems that the data sampled represents nuances of the English language better and in such a way that the trained model could generalize grammatical rules from across the identified clusters much more easily.

I conclude that the selection and assembly of `MiniPile_Recreation` has been successful to that extent that it can be tried out for training on Pythia $1.4B$.

---

## Evaluate Pythia $1.4\text{B}$ Pile vs. Pythia $1.4\text{B}$ MiniPile (recreated)

Once again, we will use an exact copy of the training and the test setup previously used for benchmarking Pythia $160\text{M}$-Pile and $160\text{M}$-MiniPile-Original.<br>
Critically, parameters like the size-specific learning rate and system-wise accomodated batch size have been adjusted, while step count is kept the same.<br>
Training is performed with the script `03_train_1.4B_recreation.py`.

| Benchmark        | Measure    |     | 1.4B Pile Pretrained | 1.4B MiniPile Recreation | Percentage Difference of Means | 95% Confidence Interval | Interpretation |
| ---------------- | ---------- | --- | -------------------- | ------------------------ | ------------------------------ | ----------------------- | -------------- |
| ARC-Challenge    | acc        | ↑   | 0.2600 ± 0.0130      |                          |                                |                         |                |
| MMLU             | acc        | ↑   | 0.2388 ± 0.0036      |                          |                                |                         |                |
| HellaSwag        | acc        | ↑   | 0.4177 ± 0.0049      |                          |                                |                         |                |
| WinoGrande       | acc        | ↑   | 0.5730 ± 0.0140      |                          |                                |                         |                |
| Lambada (OpenAI) | acc        | ↑   | 0.6202 ± 0.0068      |                          |                                |                         |                |
| Lambada (OpenAI) | perplexity | ↓   | 6.1041 ± 0.1531      |                          |                                |                         |                |
| Lambada (Std)    | acc        | ↑   | 0.4898 ± 0.0070      |                          |                                |                         |                |
| Lambada (Std)    | perplexity | ↓   | 11.2448 ± 0.3305     |                          |                                |                         |                |
| BLiMP            | acc        | ↑   | 0.8154 ± 0.0013      |                          |                                |                         |                |

---

## Improve the Dataset Creation Process

The results and implementation specifics from the reproduction directly affect the investigation for improvements.<br>
**The aim is to investigate different ideas for creating a new SuperMiniPile dataset that is ideally smaller and/or more information-retaining.**

### Idea 1 - Cluster-Proportionate Sampling

The original MiniPile dataset was created by sampling *equal* amounts of documents from each of the non-excluded clusters. This results in a MiniPile that cannot represent the original dataset's cluster distribution anymore, but rather imposes a uniform distribution across the clusters, no matter their size or importance/'weight'.

For a first improvement attempt, named 'Proportionate', we keep as close to the reproduction code as possible. But, instead of sampling equal amounts of documents from each remaining cluster, we sample a proportionate amount of documents based on cluster sizes (by document count). This requires to make the amount of data points an upper bound rather than a fixed requirement, as we may not be able to sample the exact amount of documents from each cluster by their size. We just don't want to go over the MiniPile-document count.

As a side note, the original script was observed as being relatively memory-demanding and, moreover, cache-guzzling.<br>
The parquet data-retrieval process was to blame for that, and got thoroughly improved, ditching efforts to wrestle with Pandas (the library), replacing it with Numpy and using explicit cache management, which in turn lifted needs for deep copying.

This idea's distillation script is implemented in `03_distill_pile_embed_idea_1_proportionate.py`.<br>
Based on thus assembled $1,010,409$ documents, a Pythia $160\text{M}$ was trained for evaluation.<br>
This idea's trained model can be found here: [https://huggingface.co/Marcus2112/pythia-160m-minipile_proportionate](https://huggingface.co/Marcus2112/pythia-160m-minipile_proportionate)

The benchmark results are as follows, compared to Pile, MiniPile and MiniPile Recreation:

| Benchmark        | Measure    |     | 160M Pile Deduplicated | 160M MiniPile               | 160M Recreation             | 160M Proportionate           |
| ---------------- | ---------- | --- | ---------------------- | --------------------------- | --------------------------- | ---------------------------- |
| ARC-Challenge    | acc        | ↑   | 0.1997 ± 0.0117        | **0.2125 ± 0.0120**         | 0.1894 ± 0.0115             | 0.1928 ± 0.0115              |
| MMLU             | acc        | ↑   | 0.2299 ± 0.0035        | **0.2699 ± 0.0037**         | 0.2295 ± 0.0035             | 0.2295 ± 0.0035              |
| HellaSwag        | acc        | ↑   | **0.2903 ± 0.0045**    | 0.2560 ± 0.0044             | 0.2604 ± 0.0044             | 0.2613 ± 0.0044              |
| WinoGrande       | acc        | ↑   | 0.4964 ± 0.0141        | 0.4720 ± 0.0140             | **0.5122 ± 0.0140**         | 0.5051 ± 0.0141              |
| Lambada (OpenAI) | acc        | ↑   | **0.3689 ± 0.0067**    | 0.0000 ± 0.0000             | 0.0000 ± 0.0000             | 0.0000 ± 0.0000              |
| Lambada (OpenAI) | perplexity | ↓   | **31.2589 ± 1.1594**   | 3033175.2693 ± 288926.5827  | 1854408.3999 ± 148101.5978  | 2214257.4651 ± 184064.6008   |
| Lambada (Std)    | acc        | ↑   | **0.2335 ± 0.0059**    | 0.0000 ± 0.0000             | 0.0000 ± 0.0000             | 0.0000 ± 0.0000              |
| Lambada (Std)    | perplexity | ↓   | **172.7619 ± 7.7265**  | 27067951.3460 ± 2710040.191 | 11927123.2514 ± 1063672.928 | 15143084.5983 ± 1387627.8650 |
| BLiMP            | acc        | ↑   | **0.7294 ± 0.0015**    | 0.5194 ± 0.0018             | 0.5481 ± 0.0017             | 0.5452 ± 0.0017              |

The proportional sampling approach yields a dataset that is ~300 MB smaller than the Recreation dataset, while also containing $91$ entries less.<br>
At the same time, proportional sampling yields close to equal results in all but the perplexity benchmarks, where it underperforms compared to the Recreation dataset, yet still beats the original MiniPile by a large margin.

> We seem to have lost some highly informative examples from smaller clusters when scaling down their representation to match cluster proportions.

*The proportional sampling approach is not a clear improvement over the recreation*, but rather a compromise between the original MiniPile and the Recreation dataset (tending towards the latter in results though). This is interesting, as it implies the uniform sampling approach may have been more effective in capturing the dataset's most speaking examples than the proportional sampling approach. I do not think that is a fault of the proportional sampling approach, but rather a symptom arising from the original dataset's composition. 

> Even though we incorporate the original dataset topology more directly, this doesn't necessarily come along with a better focus on the most informative examples.

Proportional sampling is neither a failure nor a universal solution, and neither is uniform sampling, but the dismissal of either approach would be premature.<br>
Instead, sampling styles should be context-dependent.<br>
I assume that when performance on specific downstream tasks, such as language modeling, is stated as the primary goal, incorporating adaptive weighting mechanisms that emphasize the contribution of critical clusters might yield superior results.

I conclude that, while itself not a clear improvement, the proportional sampling idea is valid for e.g. approaches where it is important to capture the original dataset's cluster distribution more closely. If the main goal is retaining performance, however, future ideas should investigate how to find a content-based weighting factor per cluster.

### Idea 2 - Hybrid Sampling (Lossi)

In order to sample documents considered most representative and informative, the original MiniPile uses a one-shot proxy-based geometric sampling strategy. After all, the pipeline doesn't really 'select' documents by their content, but by their embedding's position in the cluster space, relative to other documents by proxy of the centroid.<br>
Beyond that, once clusters have been determined and selected, the pipeline samples randomly across each cluster.<br>
The comparison of different subset assembly techniques performed by [(Guo et al. 2022)](https://arxiv.org/abs/2204.08499) concludes that random sampling can be considered a very robust baseline for custom subset selection efforts, adaptable to various tasks.

I deduct that employing cluster-wise random sampling, as performed already, while effective, fast and versatile, can not explicitly consider the point-by-point actual degree of informativeness. To an extent, we rely on luck and sampling spread for finding a best MiniPile distillate.

Instead, what if we could utilize the heavy lifting by the embedding and clustering steps, while adding into the process an instance for information-based guiding of cluster sample factor weighting and specific document selection, which could thus address the findings of Idea 1 (Proportionate Sampling)?

**Idea 2, called Lossi (Loss-informed Sampling)**, utilizes much of the existing pipeline for time and efficiency sake.

Lossi as a whole consists of several adaptations.<br>
The main idea is to use a small proxy model to determine the informativeness of each cluster and then sample documents from each cluster proportionate to their informativeness. We can do this at two points during dataset assembly, which is why idea 2.1 investiagtes the first point, and idea 2.2 the first with the second point combined. 

What do I mean by that?

**Idea 2.1** covers these adaptations:
- Per cluster: Uniformly sample $n$ (e.g. $1,000$) documents and determine their loss with a small Pythia $70\text{M}$ proxy model
- Use the mean loss as a heuristic for the cluster's informativeness and weight the cluster's representation in the final dataset by this value

**Idea 2.2** covers these adaptations:
- The loss-proportional sampling information from Idea 2.1 is used to guide the cluster-wise random sampling process
- Per cluster: Randomly sample $1.5\times$ the amount of documents we want to end up with from each non-excluded cluster
- Per cluster: Calculate the loss for each sampled document with a small Pythia $70\text{M}$ proxy model which itself was pretrained halfway through (`step72000`) The Pile Deduped.
- Per cluster: Sort the documents by their loss and select the top half of the documents with the highest loss for the final dataset
- We continue with the dataset assembly as before after that.

Note that we do this information-sorted slicing per each cluster and *not* globally as to still represent the cluster distribution in the final dataset. Otherwise, some clusters might not be represented at all anymore, because other cluster's examples just plainly induced more perplexity in the proxy.

Note also that we select the smallest Pythia model trained half-way through the Pile, which assumes that the Pile is shuffled with regards to cluster assignments.<br>
We can verify that this is the case by looking at the unsorted cluster assignment results, written in order of appearance of documents in the embedded Pile dataset.

Also, I distinctly chose to use a small proxy model for this. MiniPile was intended for use in constrained academic settings, so we have to make do with small models that are most universally trainable, while not being too small as not to allow for their proxy use. Same goes for why the $70\text{M}$ model is only trained to half of the full Pile dataset.

**Effectively, Lossi is a one-shot proxy-based geometric sampling approach that is guided by a loss-based importance heuristic.**

### Idea 2.1 - Loss-informed Proportionate Cluster Sampling

This idea's distillation script is implemented in `03_distill_pile_embed_idea_2.1_lossi_1.py` and `03_distill_pile_embed_idea_2.1_lossi_2.py`.<br>
Based on the $$ documents large dataset, a Pythia $160\text{M}$ was trained for evaluation.<br>
This idea's trained model can be found here: [https://huggingface.co/Marcus2112/pythia-160m-minipile_loss-sampled](https://huggingface.co/Marcus2112/pythia-160m-minipile_loss-sampled)

The benchmark results are as follows, compared to the original MiniPile and the Recreation:

| Benchmark        | Measure    |     | 160M MiniPile                | 160M Lossi 1                     | Percentage Difference of Means | 95% Confidence Interval         | Interpretation             |
| ---------------- | ---------- | --- | ---------------------------- | -------------------------------- | ------------------------------ | ------------------------------- | -------------------------- |
| ARC-Challenge    | acc        | ↑   | **0.2125 ± 0.0120**          | 0.1980 ± 0.0116                  | -6.8235                        | (0.0182; -0.0472)               | Different not significant  |
| MMLU             | acc        | ↑   | **0.2699 ± 0.0037**          | 0.2295 ± 0.0035                  | -14.9685                       | (-0.0304; -0.0504)              | MiniPile-trained better    |
| HellaSwag        | acc        | ↑   | 0.2560 ± 0.0044              | **0.2599 ± 0.0044**              | 1.5234                         | (0.0161; -0.0083)               | Different not significant  |
| WinoGrande       | acc        | ↑   | 0.4720 ± 0.0140              | **0.5107 ± 0.0140**              | 8.1992                         | (0.0775; -0.0001)               | Difference not significant |
| Lambada (OpenAI) | acc        | ↑   | 0.0000 ± 0.0000              | 0.0000 ± 0.0000                  | -                              | -                               | -                          |
| Lambada (OpenAI) | perplexity | ↓   | 3033175.2693 ± 288926.5827   | **2116445.1732 ± 175403.0579**   | -30.2234                       | (-254247.7681; -1579212.4241)   | Lossi 1 severely better    |
| Lambada (Std)    | acc        | ↑   | 0.0000 ± 0.0000              | 0.0000 ± 0.0000                  | -                              | -                               | -                          |
| Lambada (Std)    | perplexity | ↓   | 27067951.3461 ± 2710040.1910 | **14896599.9251 ± 1366937.5470** | -44.9659                       | (-6222231.2223; -18120471.6197) | Lossi 1 severely better    |
| BLiMP            | acc        | ↑   | 0.5194 ± 0.0018              | **0.5492 ± 0.0017**              | 5.7374                         | (0.0347; 0.0249)                | Lossi 1 better             |

| Benchmark        | Measure    |     | 160M Recreation                  | 160M Lossi 1                 | Percentage Difference of Means | 95% Confidence Interval      | Interpretation             |
| ---------------- | ---------- | --- | -------------------------------- | ---------------------------- | ------------------------------ | ---------------------------- | -------------------------- |
| ARC-Challenge    | acc        | ↑   | 0.1894 ± 0.0115                  | **0.1980 ± 0.0116**          | 4.5407                         | (0.0406; -0.0234)            | Difference not significant |
| MMLU             | acc        | ↑   | 0.2295 ± 0.0035                  | 0.2295 ± 0.0035              | 0.0000                         | (0.0097; -0.0097)            | Difference not significant |
| HellaSwag        | acc        | ↑   | **0.2604 ± 0.0044**              | 0.2599 ± 0.0044              | -0.1920                        | (0.0117; -0.0127)            | Difference not significant |
| WinoGrande       | acc        | ↑   | **0.5122 ± 0.0140**              | 0.5107 ± 0.0140              | -0.2929                        | (0.0373; -0.0403)            | Difference not significant |
| Lambada (OpenAI) | acc        | ↑   | 0.0000 ± 0.0000                  | 0.0000 ± 0.0000              | -                              | -                            | -                          |
| Lambada (OpenAI) | perplexity | ↓   | **1854408.3999 ± 148101.5978**   | 2116445.1732 ± 175403.0579   | 14.1305                        | (711985.1414; -187911.5948)  | Difference not significant |
| Lambada (Std)    | acc        | ↑   | 0.0000 ± 0.0000                  | 0.0000 ± 0.0000              | -                              | -                            | -                          |
| Lambada (Std)    | perplexity | ↓   | **11927123.2514 ± 1063672.9280** | 14896599.9251 ± 1366937.5470 | 24.8968                        | (6364250.0618; -425296.7144) | Difference not significant |
| BLiMP            | acc        | ↑   | 0.5481 ± 0.0017                  | **0.5492 ± 0.0017**          | 0.2007                         | (0.0058; -0.0036)            | Difference not significant |

We observe a clear improvement over the original MiniPile. However, we can't attribute that to the loss-informed sampling approach, as the Recreation dataset is remarkably equal in performance. This implies that the measures taken unwittingly during the recreation process pose a more effective set of improvements than the loss-informed sampling approach itself. These measures include:

- The embedding process, which during the Recreation uses a slightly worse model than the original MiniPile, but processes larger context sizes for the embeddings (1024 instead of 512)
- The clustering process, which may induce different cluster formations based on the different embeddings
- The interpretability inherent in the cluster exclusion process, which may have led to a more effective dataset

This can imply that changes in the embedding and clustering process may be more effective than the loss-informed sampling approach itself.<br>
I conclude that it is quite surprising that scaling the cluster proportions based on perceived informativeness doesn't yield a clear improvement over the uniform sampling approach. This may be due to the fact that the proxy model is not able to capture the full complexity of the dataset, and thus the loss-informed sampling approach may not be able to capture the most informative examples. This in turn would put the loss-informed sampling approach at a disadvantage compared to the uniform sampling approach due to decreased efficiency.

### Idea 3 - Density-based Proportionate Cluster Sampling

An inherent danger with loss-based informativeness approximation is the potential for skewing the dataset towards harder examples, which could make the dataset less representative of general tasks. While loss can be considered a related indicator, it is not a direct measure of informativeness.

Taking a step back with this insight, we should re-focus that MiniPile's goal is to capture the most representative subset of the 'insights' gainable from the original Pile dataset. To achieve that, we can look at diversity, which could improve generalization, especially when requiring broad informational coverage.

Specifically, sampling inversely proportional to the cluster density will prioritize sparse regions of the dataset and thus the gaining of a broad coverage of the dataset's information. However, as lunch is never free, over-sampling/representing sparse clusters will likely introduce noise, especially when cluster sparsity correlates with low-quality examples. One could argue that we already excluded the most uninteresting, thus noise-inducing clusters, but the potential for noise itself is in no way mitigated (/mitigatable?). Density-based sampling therefore has to be understood as a trade-off between spread for representation and accidental capturing of noise.

And still, density-based sampling could be helpful in capturing a most diverse set of examples, which in turn could be beneficial for generalization.
But, we have to find a way to mitigate noise and over-representation at least to some extent.

I therefore propose a density-based sampling approach that calculates cluster contribution proportions like so:
$$\text{Cluster Proportion} = \frac{|C_i|}{|\bigcup_{j} C_j|} \cdot (1 - \omega \cdot \rho(C_i))$$

where $|C_i|$ is the number of documents in cluster $i$, $|\bigcup_{j} C_j|$ is the total number of documents in all clusters, and $\rho(C_i)$ is the density of cluster $i$. The impact of the density is scaled by the hyperparameter $\omega$, reducing the factor of over-representation of thoroughly sparse clusters.<br>
I set $\omega = 0.5$ with the intention of having neither cluster size nor density completely dominate the proportion calculation.

### Idea 4 - Hierarchical Embedding through Sparse Sampling (HESS)

A fourth idea, named HESS, aims to improve the dataset creation process entirely without the need for a proxy model.

---

## Evaluate Pythia $160\text{M}$ SuperMiniPile

---

## Evaluate Pythia $1.4\text{B}$ Pretrained vs. Pythia $1.4\text{B}$ SuperMiniPile

In [None]:
import torch
import numpy as np
from pathlib import Path
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModel
import torch.nn.functional as F
from tqdm import tqdm

def average_pool(last_hidden_states, attention_mask):
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]

@torch.no_grad()
def get_embedding(text, tokenizer, model):
    inputs = tokenizer(f"query: {text}", max_length=1024, padding="max_length", truncation=True, return_tensors='pt')
    inputs = {k: v.to(model.device) for k, v in inputs.items()}
    outputs = model(**inputs)
    embeddings = average_pool(outputs.last_hidden_state, inputs['attention_mask'])
    embeddings = F.normalize(embeddings, p=2, dim=1)
    return embeddings.cpu().numpy()[0]

# Load model and tokenizer
model_path = "/mnt/data/e5-base-4k"
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)
model = AutoModel.from_pretrained(model_path, attn_implementation="sdpa")
model.eval()
model.to('cuda' if torch.cuda.is_available() else 'cpu')

# Directory containing parquet files
parquet_dir = Path("/mnt/data/Pile_Deduplicated_Embd")

for i in range(3):
    parquet_file = parquet_dir / f"shard_{i:09d}.parquet"
    if not parquet_file.exists():
        print(f"File {parquet_file} not found. Skipping.")
        continue

    dataset = load_dataset("parquet", data_files=str(parquet_file))["train"]
    print(f"Iterating over shard {i}....")
    for entry in tqdm(dataset):
        stored_embedding = np.array(entry['embedding'])
        text = entry['text']
        # Generate new embedding
        new_embedding = get_embedding(text, tokenizer, model)
        # Convert both for fp16 precision
        stored_embedding_fp16 = stored_embedding.astype(np.float16)
        new_embedding_fp16 = new_embedding.astype(np.float16)
        cosine_similarity = np.dot(stored_embedding_fp16, new_embedding_fp16) / (np.linalg.norm(stored_embedding_fp16) * np.linalg.norm(new_embedding_fp16))
        # 1.0 as in identical, 0.0 as in orthogonal
        if cosine_similarity != 1.0:
            print(f"Text: {text}")
            print(f"Cosine similarity: {cosine_similarity}")

In [None]:
from datasets import load_dataset

# Specify the path to your Parquet file
data_files_1 = {"train": "/vol/tmp/koppelmm/Pile_Deduplicated_Embd/shard_000000001.parquet"}
data_files_2 = {"train": "/vol/tmp/koppelmm/Pile_Deduplicated_Embd/shard_000000002.parquet"}

# Load the dataset
dataset_1 = load_dataset("parquet", data_files=data_files_1, split="train")
dataset_2 = load_dataset("parquet", data_files=data_files_2, split="train")

# Print the number of entries in the dataset
print(f"Number of entries in shard 0: {len(dataset_1)} ({128*8192})")
print(f"Number of entries in shard 1: {len(dataset_2)} ({128*8192})")

# Print the first entry from each dataset
first_entry_1 = dataset_1[0]
first_entry_2 = dataset_2[0]

print("First entry in shard 0:")
print(first_entry_1)

print("First entry in shard 1:")
print(first_entry_2)

# Check if the first entries are different
if first_entry_1 == first_entry_2:
    print("The first entries are identical.")
else:
    print("The first entries are different.")

In [11]:
import heapq

def update_heap(heap, item, key_func, max_size=5, reverse=False):
    key = key_func(item)
    heapq.heappush(heap, (key if not reverse else -key, item))
    if len(heap) > max_size:
        heapq.heappop(heap)

def test_update_heap():
    # Test ascending order heap (keep largest items based on `value`)
    heap = []
    items = [{'value': i} for i in range(100)]  # Items with values from 0 to 9

    for item in items:
        update_heap(heap, item, key_func=lambda x: x['value'], max_size=5, reverse=False)

    # Extract the final items from the heap, sorted by `value`
    result = [entry[1] for entry in sorted(heap, key=lambda x: x[0])]
    expected = [{'value': i} for i in range(5, 10)]

    assert result == expected, f"Unexpected heap contents: {result} vs. expected {expected}"

    # Test descending order heap (keep smallest items based on `value`)
    heap = []
    for item in items:
        update_heap(heap, item, key_func=lambda x: x['value'], max_size=5, reverse=True)

    # Extract the final items from the heap, sorted by `value`
    result = [entry[1] for entry in sorted(heap, key=lambda x: x[0], reverse=True)]
    expected = [{'value': i} for i in range(5)]

    assert result == expected, f"Unexpected heap contents: {result}"

    print("All tests passed!")

test_update_heap()

AssertionError: Unexpected heap contents: [{'value': 95}, {'value': 96}, {'value': 97}, {'value': 98}, {'value': 99}] vs. expected [{'value': 5}, {'value': 6}, {'value': 7}, {'value': 8}, {'value': 9}]

In [3]:
from multiprocessing import cpu_count

print(cpu_count() // 2)

36
