# Recreating and Improving MiniPile Dataset Creation

**Objectives:**
- [x] Implement and verify MiniPile’s filtering pipeline according to [Kaddour (2023)](https://arxiv.org/abs/2304.08442), but intended for decoder-only model use
- [x] Evaluate and compare performances of Pythia $160\text{M}$ pretrained on The Pile vs. trained on the *newly, self-created MiniPile* on MMLU and ARC-Challenge
- [.] Evaluate and compare performances of Pythia $1.4\text{B}$ pretrained on The Pile vs. trained on the *newly, self-created MiniPile* on MMLU and ARC-Challenge
- [.] Improve the dataset creation process, create new SuperMiniPile dataset (ideally smaller and more information-retaining)
- [] Evaluate Pythia $160\text{M}$ on SuperMiniPile on MMLU and ARC-Challenge
- [] Evaluate and compare performances of Pythia $1.4\text{B}$ pretrained on The Pile vs. trained on SuperMiniPile on the MMLU and ARC benchmarks

In [None]:
#! pip install sentence-transformers

In [5]:
import os
import torch
import numpy as np
from tqdm import tqdm
from pathlib import Path
from datasets import load_dataset
from huggingface_hub import snapshot_download
from sentence_transformers import SentenceTransformer

base_dir = "/vol/tmp/koppelmm"
base_path = Path(base_dir)

In [2]:
def download_model(down_dir: str, target_folder: str, cache_folder: str, repo_id: str, branch: str = "main") -> None:
    down_dir = Path(down_dir)
    target_dir = down_dir / target_folder
    cache_dir = down_dir / cache_folder

    os.makedirs(target_dir, exist_ok=True)
    os.makedirs(cache_dir, exist_ok=True)

    print(f"Downloading {repo_id}/{branch}...")

    while True:
        try:
            snapshot_download(
                repo_id,
                repo_type="model",
                revision=branch,
                cache_dir=str(cache_dir),
                local_dir=str(target_dir)
            )
            break
        except Exception as e:
            print(f"Download attempt failed: {e}")
            continue

---

## Recreating The MiniPile Dataset Creation Pipeline

(1) document embedding extraction,<br>
(2) clustering of embeddings, and<br>
(3) human-guided exclusion of unwanted clusters<br>
(4) mini-pile distillation

- 22 data subset sources
- 5.91 KiB mean document size (before deduplication)

### Document Embedding Extraction

- MiniPile paper uses term "document": This refers to individual training examples from "The Pile-Deduplicated"
- "The Pile Deduplicated" predominantly contains english text, as stated in the Pile paper
- `E5-Large` does not require performing sentence-splitting beforehand, I was misguided by the example code at https://huggingface.co/intfloat/e5-large
- `E5-Large` scales poorly to the dataset size under the conditions imposed by the HU Berlin cluster. I will use `E5-Base-4k` instead.
- `E5-Base-4k` performs slightly worse than `E5-Large`, but has roughly half the parameter count and is therefore more efficient to use
- We attempt to mitigate the reported/expectable performance losses by using a larger text window size of $1024$ tokens instead of the $512$ tokens used by `E5-Large` by default.

Given the smaller model and The Pile, we iterate through the dataset and extract the embedding for each document.<br>
The script I initially implemented for the embedding step is `03_embed_pile_dedup.py`.<br>
The approach layed out therein conceptually worked, but it had to be thoroughly memory-optimized to run for as long as needed for our Pile dataset.<br>
The optimized script I ultimately ran for this step is `03_embed_pile_dedup_turbo.py`.

The embedding step produces as artifact a copy of the original dataset with the embeddings added as a column.<br>
The embedding process is resumable, results are persisted in multiple parquet files, one after another, in the folder `Pile_Deduplicated_Embd`.<br>

**Note that I intended to upload this embedded version of the Pile to HuggingFace for strict reproducibility.**<br>
**This idea was cut short by a change in HuggingFace's pricing policy, effective January 2025, prohibiting the free sharing of datasets >500GB - a threshold which this new dataset crosses (801 GB) due to the added embeddings.**

Furthermore, note that the embedding step, through its parallel processing, is in no way guaranteed to maintain the original order of the documents.<br>
In fact, this is the reason for why I elected to build the dataset copy with the embeddings as a column in the first place, to ensure the correct alignment of the embeddings with the documents while still being able to leverage parallel processing. Because, after all, time was the deciding factor.<br>
The shuffling is therefore not a problem as such, but we have to base the clustering and therefore all following processes on this new, shuffled dataset.

In other words, all artifacts produced after the embedding step will relate not to the original dataset, but the embedded dataset, e.g. when referring to entries by index.

### Clustering of Embeddings

- Batchified $k$-means clustering, a term only used in the MiniPile paper: This must stand for **mini-batch k-means clustering**
- Cosine distance between normalized embeddings
- Cluster Count of $k=220$ ($10$ clusters per source)
- Batch size $16384$

Architecturally, I built the clustering step to be fully independent of the embedding step, so as to be able to run the partial fitting concurrently with the latter, saving ~4 days of processing time total.

The clustering step is implemented in `03_cluster_pile_embed.py`. As soon as the embedding step finishes, a text file is produced, signaling the clustering step to conclude model fitting and start predicting. The centroids are saved and can be found in `MiniPile_BatchKMeans/cluster_centers.npy`. Intermediary centroid results have been omitted from this repository.

Each embedding from the newly created dataset is assigned to one of the $220$ clusters. This cluster information as well as the distance of the data point to the centroid are stored in JSONL files, with the entries in order of appearance in the embedded Pile dataset.<br>
Each clustering result looks like this: `{"idx": 0, "cluster": 5, "distance": 0.20949329195756128}`.<br>
Each Pile-Embedded document is referred to only by its index, crunching the cluster results' memory requirements down.

Beyond the JSONL files containing the cluster assignments, the clustering step produces a verification file `MiniPile_BatchKMeans/cluster_results_metadata.json` to indicate whether all chunks and all data points therein have been processed and how results have been saved. We can see from this file that the original dataset size of $134,318,121$ documents has been captured and thus processed, lending more credibility to the clustering results.

Additionally, the clustering step produces a file `MiniPile_BatchKMeans/cluster_info_for_inspection.json`.<br>
Per cluster index, this file contains the following information:
- `closest`: The top 5 closest documents to the cluster centroid
    - `text`: The associated text excerpt (for memory reasons)
    - `distance`: The cosine distance to the cluster centroid
- `farthest`: The top 5 farthest documents from the cluster centroid (again, excerpts, in same format)
- `total_examples`: The number of documents assigned to this cluster
- `average_distance`: The average cosine distance of all documents to the cluster centroid
- `sum_distance`: The sum of all cosine distances of all documents to the cluster centroid

The three latter information points are intended to help with the human-guided exclusion of unwanted clusters, but moreover, they may help in improving the dataset creation process later on, as they can provide insights into the spread of the data.

After these files were attained, I ran `03_sort_pile_clusters.py` to save the cluster assignment entries organized into one file per cluster / separated by cluster. This is to facilitate a more effective cluster exclusion process during later minipile creation steps.

For now, the clustering step is concluded, producing one JSONL file per cluster with information about each cluster assignment per document.<br>
This intermediary dataset can be found here: [https://huggingface.co/datasets/Marcus2112/pile_dedup_embeddings_clusters](https://huggingface.co/datasets/Marcus2112/pile_dedup_embeddings_clusters).

Note that while the entry count is exactly identical, the size is not. Where the original MiniPile occupied $3.14GB$, we now require $3.71GB$.
Several factors play into this:
- We use a slightly different embedding model, which may cause (at least occasional) devations in embedding and thus may lead to differently shaped clusters
- We select clusters by hand, allowing interpretation of selection categories to roam free. I thus may have selected $38$ clusters according to the paper's categories, but my understanding of enforcing them during selection may differ from the author.
- The random sampling happened to select different examples per cluster.

Again, note that these indices are not per-se applicable to the original dataset, but to the embedded dataset.

### Human-Guided Cluster Exclusion

At this point, especially due to the `MiniPile_BatchKMeans/cluster_info_for_inspection.json` file, we can start the human-guided exclusion of unwanted clusters.<br>
I strictly adhered to the paper and only sorted out clusters of the layed out categories, which I found to be well identifyable through the 10 examples per cluster.

The categories with the clusters I sorted out are as follows:
- Near Duplicates ($10, 15, 16, 22, 26, 28, 35, 37, 46, 51, 57, 64, 86, 87, 64, 102, 111, 114, 152, 163, 166, 218$)
- Pornography ($167$)
- Navigation Bars ($39, 88, 101, 155$)
- Product Specifications ($61, 200$)
- Long lists of named entities ($40, 44, 78, 90, 99, 103, 181, 196, 219$)

### MiniPile Distillation

With the cluster analysis concluded, we can now proceed to the distillation of the MiniPile dataset.<br>
This dataset touches on all the artifacts produced in the previous steps:<br>
- The embedded Pile dataset, from which we extract the documents
- The cluster assignments, from which we extract the documents assigned to the remaining clusters
- The cluster exclusion list, from which we exclude the documents assigned to the unwanted clusters

The distillation step is implemented in `03_distill_pile_embed.py`.<br>
The script is written to most exactly and efficiently (we have loads of I/O to perform) extract the correct documents according to our random sampling across the remaining clusters.

The distillation step produces a new dataset, the MiniPile, which is a subset of the original Pile dataset.<br>
The created dataset is exactly $1,010,500$ documents large, as intended. The dataset is shuffled, to spread cluster entries evenly across the dataset's splits.<br>
Additionally, I added a column `pile_idx` to each entry, denoting the original index of the document in the Pile dataset.<br>
This helped in making sure that the dataset actually captures a subset derived from across the entire embedded Pile.

The resulting self-created MiniPile can be found here: [https://huggingface.co/datasets/Marcus2112/minipile_recreation](https://huggingface.co/datasets/Marcus2112/minipile_recreation).

I now went on and adapted the `02_train_160M.py` script to train Pythia $160\text{M}$ on the newly created MiniPile.<br>
The adapted script is `03_train_160M_recreation.py`.

---

## Evaluate Pythia $160\text{M}$ Pile vs. Pythia $160\text{M}$ MiniPile (recreated)

We will use an exact copy of the training and the test setup previously used for benchmarking Pythia $160\text{M}$-Pile and $160\text{M}$-MiniPile-Original.<br>
Training is performed with the script `03_train_160M_recreation.py`.

The trained model can be found here: [https://huggingface.co/Marcus2112/pythia-160m-minipile_recreation](https://huggingface.co/Marcus2112/pythia-160m-minipile_recreation)

In [3]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

from lm_eval import utils, simple_evaluate
from lm_eval.models.huggingface import HFLM

In [6]:
## Evaluation - Pythia 160M Trained on Self-Created MiniPile

device = "cuda" if torch.cuda.is_available() else "cpu"
pythia_minipile = AutoModelForCausalLM.from_pretrained(base_path / "pythia160m_minipile_Recreation_trained", local_files_only=True)
tokenizer = AutoTokenizer.from_pretrained(base_path / "pythia160m_dedup_untrained", use_fast=True, local_files_only=True) # Use exact same tokenizer
pythia_minipile = pythia_minipile.to(device)
 
batch_size_hflm = 1

pythia_minipile_hflm = HFLM(pretrained=pythia_minipile,
                        tokenizer=tokenizer,
                        batch_size=batch_size_hflm)

results = simple_evaluate(model=pythia_minipile_hflm,
                          tasks=["arc_challenge", "mmlu", "winogrande", "hellaswag", "lambada", "blimp"],
                          num_fewshot=0,
                          batch_size=batch_size_hflm,
                          device="cuda",
                          limit=None)

with open('03_eval_160M_minipile_recreation.txt', 'w') as f:
    f.write(str(results))

print(utils.make_table(results))

2024-12-05:15:20:16,151 INFO     [huggingface.py:481] Using model type 'default'
2024-12-05:15:20:16,177 INFO     [evaluator.py:164] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2024-12-05:15:20:16,178 INFO     [evaluator.py:217] Using pre-initialized model


README.md:   0%|          | 0.00/9.00k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/190k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/204k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/55.7k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1119 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1172 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/299 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/1.11k [00:00<?, ?B/s]

mmlu_no_train.py:   0%|          | 0.00/5.86k [00:00<?, ?B/s]

data.tar:   0%|          | 0.00/166M [00:00<?, ?B/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating dev split: 0 examples [00:00, ? examples/s]

README.md:   0%|          | 0.00/9.97k [00:00<?, ?B/s]

winogrande.py:   0%|          | 0.00/5.65k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.40M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/40398 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1767 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1267 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/6.84k [00:00<?, ?B/s]

hellaswag.py:   0%|          | 0.00/4.36k [00:00<?, ?B/s]

dataset_infos.json:   0%|          | 0.00/2.53k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/47.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/11.8M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/12.2M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/39905 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/10003 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10042 [00:00<?, ? examples/s]

Map:   0%|          | 0/39905 [00:00<?, ? examples/s]

Map:   0%|          | 0/10042 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/7.32k [00:00<?, ?B/s]

train-00000-of-00002.parquet:   0%|          | 0.00/269M [00:00<?, ?B/s]

train-00001-of-00002.parquet:   0%|          | 0.00/281M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/1.14M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/1.08M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2662 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5153 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/4869 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/4.99k [00:00<?, ?B/s]

lambada_openai.py:   0%|          | 0.00/4.82k [00:00<?, ?B/s]

0000.parquet:   0%|          | 0.00/1.16M [00:00<?, ?B/s]

Generating test split:   0%|          | 0/5153 [00:00<?, ? examples/s]



README.md:   0%|          | 0.00/58.6k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/84.1k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/60.3k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/95.5k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/71.7k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/98.9k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/71.6k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/70.0k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/52.8k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/55.1k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/61.5k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/44.8k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/48.5k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/56.7k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/90.2k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/54.8k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/43.4k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/49.5k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/44.5k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/52.9k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/58.5k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/59.1k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/56.4k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/61.1k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/67.9k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/40.5k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/53.9k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/85.0k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/51.4k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/51.7k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/52.0k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/51.9k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/50.3k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/42.7k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/50.7k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/37.3k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/36.7k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/42.4k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/39.8k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/88.6k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/59.5k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/52.1k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/51.6k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/76.6k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/98.9k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/92.8k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/40.0k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/77.9k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/59.6k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/55.7k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/54.1k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/54.4k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/56.3k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/47.4k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/47.3k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/49.5k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/49.9k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/51.6k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/67.9k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/78.2k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/49.0k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/49.7k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/47.3k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/41.5k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/39.2k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



train-00000-of-00001.parquet:   0%|          | 0.00/62.2k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]

2024-12-05:15:25:43,876 INFO     [evaluator.py:266] num_fewshot has been set to 0 for blimp_adjunct_island in its config. Manual configuration will be ignored.
2024-12-05:15:25:43,877 INFO     [evaluator.py:266] num_fewshot has been set to 0 for blimp_anaphor_gender_agreement in its config. Manual configuration will be ignored.
2024-12-05:15:25:43,878 INFO     [evaluator.py:266] num_fewshot has been set to 0 for blimp_anaphor_number_agreement in its config. Manual configuration will be ignored.
2024-12-05:15:25:43,878 INFO     [evaluator.py:266] num_fewshot has been set to 0 for blimp_animate_subject_passive in its config. Manual configuration will be ignored.
2024-12-05:15:25:43,879 INFO     [evaluator.py:266] num_fewshot has been set to 0 for blimp_animate_subject_trans in its config. Manual configuration will be ignored.
2024-12-05:15:25:43,880 INFO     [evaluator.py:266] num_fewshot has been set to 0 for blimp_causative in its config. Manual configuration will be ignored.
2024-12-0

bootstrapping for stddev: perplexity


	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
TOKENIZERS_PARALLELISM=(true | false)
TOKENIZERS_PARALLELISM=(true | false)
TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable 
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set 

bootstrapping for stddev: perplexity


100%|██████████| 100/100 [00:01<00:00, 57.70it/s]TOKENIZERS_PARALLELISM=(true | false)
TOKENIZERS_PARALLELISM=(true | false)
TOKENIZERS_PARALLELISM=(true | false)
TOKENIZERS_PARALLELISM=(true | false)

TOKENIZERS_PARALLELISM=(true | false)
TOKENIZERS_PARALLELISM=(true | false)
TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
  (gpt_neox): GPTNeoXModel(
    (embed_in): Embedding(50304, 768)
    (emb_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-11): 12 x GPTNeoXLayer(
        (input_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (post_attention_dropout): Dropout(p=0.0, inplace

|                           Tasks                            |Version|Filter|n-shot|  Metric  |   |    Value    |   |   Stderr   |
|------------------------------------------------------------|------:|------|-----:|----------|---|------------:|---|-----------:|
|arc_challenge                                               |      1|none  |     0|acc       |↑  |       0.1894|±  |      0.0115|
|                                                            |       |none  |     0|acc_norm  |↑  |       0.2355|±  |      0.0124|
|blimp                                                       |      2|none  |      |acc       |↑  |       0.5481|±  |      0.0017|
| - blimp_adjunct_island                                     |      1|none  |     0|acc       |↑  |       0.6390|±  |      0.0152|
| - blimp_anaphor_gender_agreement                           |      1|none  |     0|acc       |↑  |       0.1920|±  |      0.0125|
| - blimp_anaphor_number_agreement                           |      1|none  |     0

One can consider this point a moment of truth for the study project.

If benchmark scores indicate significant deviation, a multitude of causes could be the reason, e.g.:
- Embedding process may be logically flawed
- Switch for `e5-base-4k` may not pay off when looking at embedding accuracy (our gamble for performance didn't work)
- Clustering process may be logically flawed
- Hand-selected clusters do not match the categories as layed out by the paper (interpretative difference mount to fundamentally different dataset characteristics)
- The index-based logical system used for saving memory and processing time may be inconsistent, leading us to sample wildly inaccurately
- The paper may not have been clear enough and actually clusters were sampled from with regards to their size, and not flat out equal sample counts across each of them
- All or some of the above.

However, if benchmark scores *do not* indicate significant deviation, this implies:
- The approach as layed out by the paper is reproducible
- The pipeline built until now works and can be safely improved upon
- The project from here on out may shift towards optimization and generalization, contributing to furthering the working approach

I split the results comparison into two parts. The first table compares the just now trained `160M Recreation` against `160M Pile Deduplicated`, while the second table compares `160M Recreation` against `160M MiniPile`.

### 160M Pile-Deduplicated vs. 160M Recreation

| Benchmark        | Measure    |     | 160M Pile Deduplicated | 160M Recreation             | Percentage Difference of Means | 95% Confidence Interval       | Interpretation                            |
| ---------------- | ---------- | --- | ---------------------- | --------------------------- | ------------------------------ | ----------------------------- | ----------------------------------------- |
| ARC-Challenge    | acc        | ↑   | **0.1997 ± 0.0117**    | 0.1894 ± 0.0115             | -5.1577                        | (0.0219; -0.0425)             | **Difference not significant**                |
| MMLU             | acc        | ↑   | **0.2299 ± 0.0035**    | 0.2295 ± 0.0035             | -0.1740                        | (0.0093; -0.0101)             | **Difference not significant**                |
| HellaSwag        | acc        | ↑   | **0.2903 ± 0.0045**    | 0.2604 ± 0.0044             | -10.2997                       | (-0.0176; -0.0422)            | Pile Deduplicated-trained better          |
| WinoGrande       | acc        | ↑   | 0.4964 ± 0.0141        | **0.5122 ± 0.0140**         | 3.1829                         | (0.0547; -0.0231)             | **Difference not significant**                |
| Lambada (OpenAI) | acc        | ↑   | **0.3689 ± 0.0067**    | 0.0000 ± 0.0000             | -100.0                         | (-0.3558; -0.3820)            | Pile Deduplicated-trained severely better |
| Lambada (OpenAI) | perplexity | ↓   | **31.2589 ± 1.1594**   | 1854408.3999 ± 148101.5978  | 5932317.3272                   | (2144656.2727; 1564098.0093)  | Pile Deduplicated-trained severely better |
| Lambada (Std)    | acc        | ↑   | **0.2335 ± 0.0059**    | 0.0000 ± 0.0000             | -100.0                         | (-0.2219; -0.2451)            | Pile Deduplicated-trained severely better |
| Lambada (Std)    | perplexity | ↓   | **172.7619 ± 7.7265**  | 11927123.2514 ± 1063672.928 | 6903692.5905                   | (14011749.4290; 9842151.5500) | Pile Deduplicated-trained severely better |
| BLiMP            | acc        | ↑   | **0.7294 ± 0.0015**    | 0.5481 ± 0.0017             | -24.8560                       | (-0.1769; -0.1857)            | Pile Deduplicated-trained better          |

### 160 MiniPile vs. 160M Recreation

| Benchmark        | Measure    |     | 160M MiniPile               | 160M Recreation                 | Percentage Difference of Means | 95% Confidence Interval         | Interpretation             |
| ---------------- | ---------- | --- | --------------------------- | ------------------------------- | ------------------------------ | ------------------------------- | -------------------------- |
| ARC-Challenge    | acc        | ↑   | **0.2125 ± 0.0120**         | 0.1894 ± <br>0.0115             | -10.8706                       | (0.0095; -0.0577)               | **Difference not significant** |
| MMLU             | acc        | ↑   | **0.2699 ± 0.0037**         | 0.2295 ± 0.0035                 | -14.9685                       | (-0.0304; -0.0504)              | MiniPile-trained better    |
| HellaSwag        | acc        | ↑   | 0.2560 ± 0.0044             | **0.2604 ± 0.0044**             | 1.7188                         | (0.0166; -0.0078)               | **Difference not significant** |
| WinoGrande       | acc        | ↑   | 0.4720 ± 0.0140             | **0.5122 ± 0.0140**             | 8.5169                         | (0.0790; 0.0014)                | **Recreation better**          |
| Lambada (OpenAI) | acc        | ↑   | 0.0000 ± 0.0000             | 0.0000 ± 0.0000                 | -                              | -                               | -                          |
| Lambada (OpenAI) | perplexity | ↓   | 3033175.2693 ± 288926.5827  | **1854408.3999 ± 148101.5978**  | -38.8625                       | (-542407.4980; -1815126.2408)   | **Recreation severely better** |
| Lambada (Std)    | acc        | ↑   | 0.0000 ± 0.0000             | 0.0000 ± 0.0000                 | -                              | -                               | -                          |
| Lambada (Std)    | perplexity | ↓   | 27067951.3460 ± 2710040.191 | **11927123.2514 ± 1063672.928** | -55.9364                       | (-9434663.1814; -20846993.0080) | **Recreation severely better** |
| BLiMP            | acc        | ↑   | 0.5194 ± 0.0018             | **0.5481 ± 0.0017**             | 5.5256                         | (0.0336; 0.0238)                | **Recreation better**          |

Not only can we see a match with the original MiniPile across all but one (MMLU) benchmark, but we substantially increased the model's capability on both Lambada benchmarks when evaluated for perplexity, and achieved better results on WinoGrande and BLiMP. 

This is interesting, because we composed `MiniPile_Recreation` with the exact same amount of samples.<br>
While this is the case, the dataset is ~500MB (in text, rest is additional idx column values + metadata) larger, but this difference, which itself can be explained by sampling and embedding expression deviations, has brought down perplexity by ~39% and ~56% respectively.

**What does this noticable upwards trend on the perplexities imply?**
I consider Lambada (OpenAI) and Lambada (Std) as benchmarks that aim to be useful for checking whether dataset reduction impacts capability to capture even subtle, general grammatical distinctions. The Lambadas realize this via evaluating grammatical knowledge across $67$ distinct linguistic phenomena/subtasks in English.
It seems that the data sampled represents nuances of the English language better and in such a way that the trained model could generalize grammatical rules from across the identified clusters much more easily.

I conclude that the selection and assembly of `MiniPile_Recreation` has been successful to that extent that it can be tried out for training on Pythia $1.4B$.

---

## Evaluate Pythia $1.4\text{B}$ Pile vs. Pythia $1.4\text{B}$ MiniPile (recreated)

Once again, we will use an exact copy of the training and the test setup previously used for benchmarking Pythia $160\text{M}$-Pile and $160\text{M}$-MiniPile-Original.<br>
Training is performed with the script `03_train_1.4B_recreation.py`.

---

## Improve the Dataset Creation Process

The results from the recreation process directly influence the improvement.<br>
**The aim is to create a new SuperMiniPile dataset that is ideally smaller and more information-retaining, while leaving less of the data point selection to chance.**

Ideas:

- I won't touch the embedding part, it is necessary and works well
- Coverage-centered selection of documents for the SuperMiniPile dataset (larger clusters represented by more documents, smaller by less)
- Calculate an "importance value" for random examples, those ideally being distributed across the cluster, per each (post-filter) clustrr
- 
- "Findings indicate that it is not the proportion of tokens occupied by high-utility data that aids acquisition, but rather the proportion of training steps assigned to such data" [On the effect of curriculum learning with developmental data for grammar acquisition (Opper, et al. 2023)](https://aclanthology.org/2023.conll-babylm.pdf)

- https://openreview.net/pdf?id=7D5EECbOaf9
- https://arxiv.org/pdf/2406.03057
- https://arxiv.org/pdf/2210.15809
- https://arxiv.org/pdf/2204.08499
- https://arxiv.org/pdf/2303.09540
- https://arxiv.org/pdf/2308.12284

---

## Evaluate Pythia $160\text{M}$ SuperMiniPile

---

## Evaluate Pythia $1.4\text{B}$ Pretrained vs. Pythia $1.4\text{B}$ SuperMiniPile

In [None]:
import torch
import numpy as np
from pathlib import Path
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModel
import torch.nn.functional as F
from tqdm import tqdm

def average_pool(last_hidden_states, attention_mask):
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]

@torch.no_grad()
def get_embedding(text, tokenizer, model):
    inputs = tokenizer(f"query: {text}", max_length=1024, padding="max_length", truncation=True, return_tensors='pt')
    inputs = {k: v.to(model.device) for k, v in inputs.items()}
    outputs = model(**inputs)
    embeddings = average_pool(outputs.last_hidden_state, inputs['attention_mask'])
    embeddings = F.normalize(embeddings, p=2, dim=1)
    return embeddings.cpu().numpy()[0]

# Load model and tokenizer
model_path = "/mnt/data/e5-base-4k"
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)
model = AutoModel.from_pretrained(model_path, attn_implementation="sdpa")
model.eval()
model.to('cuda' if torch.cuda.is_available() else 'cpu')

# Directory containing parquet files
parquet_dir = Path("/mnt/data/Pile_Deduplicated_Embd")

for i in range(3):
    parquet_file = parquet_dir / f"shard_{i:09d}.parquet"
    if not parquet_file.exists():
        print(f"File {parquet_file} not found. Skipping.")
        continue

    dataset = load_dataset("parquet", data_files=str(parquet_file))["train"]
    print(f"Iterating over shard {i}....")
    for entry in tqdm(dataset):
        stored_embedding = np.array(entry['embedding'])
        text = entry['text']
        # Generate new embedding
        new_embedding = get_embedding(text, tokenizer, model)
        # Convert both for fp16 precision
        stored_embedding_fp16 = stored_embedding.astype(np.float16)
        new_embedding_fp16 = new_embedding.astype(np.float16)
        cosine_similarity = np.dot(stored_embedding_fp16, new_embedding_fp16) / (np.linalg.norm(stored_embedding_fp16) * np.linalg.norm(new_embedding_fp16))
        # 1.0 as in identical, 0.0 as in orthogonal
        if cosine_similarity != 1.0:
            print(f"Text: {text}")
            print(f"Cosine similarity: {cosine_similarity}")

In [None]:
from datasets import load_dataset

# Specify the path to your Parquet file
data_files_1 = {"train": "/vol/tmp/koppelmm/Pile_Deduplicated_Embd/shard_000000001.parquet"}
data_files_2 = {"train": "/vol/tmp/koppelmm/Pile_Deduplicated_Embd/shard_000000002.parquet"}

# Load the dataset
dataset_1 = load_dataset("parquet", data_files=data_files_1, split="train")
dataset_2 = load_dataset("parquet", data_files=data_files_2, split="train")

# Print the number of entries in the dataset
print(f"Number of entries in shard 0: {len(dataset_1)} ({128*8192})")
print(f"Number of entries in shard 1: {len(dataset_2)} ({128*8192})")

# Print the first entry from each dataset
first_entry_1 = dataset_1[0]
first_entry_2 = dataset_2[0]

print("First entry in shard 0:")
print(first_entry_1)

print("First entry in shard 1:")
print(first_entry_2)

# Check if the first entries are different
if first_entry_1 == first_entry_2:
    print("The first entries are identical.")
else:
    print("The first entries are different.")

In [11]:
import heapq

def update_heap(heap, item, key_func, max_size=5, reverse=False):
    key = key_func(item)
    heapq.heappush(heap, (key if not reverse else -key, item))
    if len(heap) > max_size:
        heapq.heappop(heap)

def test_update_heap():
    # Test ascending order heap (keep largest items based on `value`)
    heap = []
    items = [{'value': i} for i in range(100)]  # Items with values from 0 to 9

    for item in items:
        update_heap(heap, item, key_func=lambda x: x['value'], max_size=5, reverse=False)

    # Extract the final items from the heap, sorted by `value`
    result = [entry[1] for entry in sorted(heap, key=lambda x: x[0])]
    expected = [{'value': i} for i in range(5, 10)]

    assert result == expected, f"Unexpected heap contents: {result} vs. expected {expected}"

    # Test descending order heap (keep smallest items based on `value`)
    heap = []
    for item in items:
        update_heap(heap, item, key_func=lambda x: x['value'], max_size=5, reverse=True)

    # Extract the final items from the heap, sorted by `value`
    result = [entry[1] for entry in sorted(heap, key=lambda x: x[0], reverse=True)]
    expected = [{'value': i} for i in range(5)]

    assert result == expected, f"Unexpected heap contents: {result}"

    print("All tests passed!")

test_update_heap()

AssertionError: Unexpected heap contents: [{'value': 95}, {'value': 96}, {'value': 97}, {'value': 98}, {'value': 99}] vs. expected [{'value': 5}, {'value': 6}, {'value': 7}, {'value': 8}, {'value': 9}]

In [3]:
from multiprocessing import cpu_count

print(cpu_count() // 2)

36
