# Improve the MiniPile Pipeline

**Objectives:**
- [x] Improve the dataset creation process, create new SuperMiniPile dataset (ideally smaller and more information-retaining)
- [x] Evaluate Pythia $160\text{M}$ on SuperMiniPile on MMLU and ARC-Challenge
- [.] Evaluate and compare performances of Pythia $1.4\text{B}$ Pile Deduplicated vs. trained on SuperMiniPile on the MMLU and ARC benchmarks

---

The results and implementation specifics from the reproduction directly affect the investigation for improvements.<br>
**The aim is to investigate different ideas for creating a new SuperMiniPile dataset that is ideally smaller and/or more information-retaining.**

### Idea 1 - Cluster-Proportionate Sampling

The original MiniPile dataset was created by sampling *equal* amounts of documents from each of the non-excluded clusters. This results in a MiniPile that cannot represent the original dataset's cluster distribution anymore, but rather imposes a uniform distribution across the clusters, no matter their size or importance/'weight'.

For a first improvement attempt, named 'Proportionate', we keep as close to the reproduction code as possible. But, instead of sampling equal amounts of documents from each remaining cluster, we sample a proportionate amount of documents based on cluster sizes (by document count). This requires to make the amount of data points an upper bound rather than a fixed requirement, as we may not be able to sample the exact amount of documents from each cluster by their size. We just don't want to go over the MiniPile-document count.

As a side note, the original script was observed as being relatively memory-demanding and, moreover, cache-guzzling.<br>
The parquet data-retrieval process was to blame for that, and got thoroughly improved, ditching efforts to wrestle with Pandas (the library), replacing it with Numpy and using explicit cache management, which in turn lifted needs for deep copying.

This idea's distillation script is implemented in `03_distill_pile_embed_idea_1_proportionate.py`.<br>
Based on thus assembled $1,010,409$ documents, a Pythia $160\text{M}$ was trained for evaluation.

This idea's dataset can be found here: [https://huggingface.co/datasets/Marcus2112/minipile_cluster-proportioned](https://huggingface.co/datasets/Marcus2112/minipile_cluster-proportioned)<br>
This idea's trained model can be found here: [https://huggingface.co/Marcus2112/pythia-160m-minipile_cluster-proportioned](https://huggingface.co/Marcus2112/pythia-160m-minipile_cluster-proportioned)

The benchmark results are as follows, compared to Pile, MiniPile and MiniPile Reproduction:

| Benchmark        | Measure    |     | 160M Pile Deduplicated | 160M MiniPile               | 160M Reproduction           | 160M Cluster-Proportionate   |
| ---------------- | ---------- | --- | ---------------------- | --------------------------- | --------------------------- | ---------------------------- |
| ARC-Challenge    | acc        | ↑   | 0.1997 ± 0.0117        | **0.2125 ± 0.0120**         | 0.1894 ± 0.0115             | 0.1928 ± 0.0115              |
| MMLU             | acc        | ↑   | 0.2299 ± 0.0035        | **0.2699 ± 0.0037**         | 0.2295 ± 0.0035             | 0.2295 ± 0.0035              |
| HellaSwag        | acc        | ↑   | **0.2903 ± 0.0045**    | 0.2560 ± 0.0044             | 0.2604 ± 0.0044             | 0.2613 ± 0.0044              |
| WinoGrande       | acc        | ↑   | 0.4964 ± 0.0141        | 0.4720 ± 0.0140             | **0.5122 ± 0.0140**         | 0.5051 ± 0.0141              |
| Lambada (OpenAI) | acc        | ↑   | **0.3689 ± 0.0067**    | 0.0000 ± 0.0000             | 0.0000 ± 0.0000             | 0.0000 ± 0.0000              |
| Lambada (OpenAI) | perplexity | ↓   | **31.2589 ± 1.1594**   | 3033175.2693 ± 288926.5827  | 1854408.3999 ± 148101.5978  | 2214257.4651 ± 184064.6008   |
| Lambada (Std)    | acc        | ↑   | **0.2335 ± 0.0059**    | 0.0000 ± 0.0000             | 0.0000 ± 0.0000             | 0.0000 ± 0.0000              |
| Lambada (Std)    | perplexity | ↓   | **172.7619 ± 7.7265**  | 27067951.3460 ± 2710040.191 | 11927123.2514 ± 1063672.928 | 15143084.5983 ± 1387627.8650 |
| BLiMP            | acc        | ↑   | **0.7294 ± 0.0015**    | 0.5194 ± 0.0018             | 0.5481 ± 0.0017             | 0.5452 ± 0.0017              |

The proportional sampling approach yields a dataset that is ~300 MB smaller than the Reproduction dataset, while also containing $91$ entries less.<br>
At the same time, proportional sampling yields close to equal results in all but the perplexity benchmarks, where it underperforms compared to the Reproduction dataset, yet still beats the original MiniPile by a large margin.

> We seem to have lost some highly informative examples from smaller clusters when scaling down their representation to match cluster proportions.

*The proportional sampling approach is not a clear improvement over the reproduction*, but rather a compromise between the original MiniPile and the Reproduction dataset (tending towards the latter in results though). This is interesting, as it implies the uniform sampling approach may have been more effective in capturing the dataset's most speaking examples than the proportional sampling approach. I do not think that is a fault of the proportional sampling approach, but rather a symptom arising from the original dataset's composition. 

> Even though we incorporate the original dataset topology more directly, this doesn't necessarily come along with a better focus on the most informative examples.

Proportional sampling is neither a failure nor a universal solution, and neither is uniform sampling, but the dismissal of either approach would be premature.<br>
Instead, sampling styles should be context-dependent.<br>
I assume that when performance on specific downstream tasks, such as language modeling, is stated as the primary goal, incorporating adaptive weighting mechanisms that emphasize the contribution of critical clusters might yield superior results.

I conclude that, while itself not a clear improvement, the proportional sampling idea is valid for e.g. approaches where it is important to capture the original dataset's cluster distribution more closely. If the main goal is retaining performance, however, future ideas should investigate how to find a content-based weighting factor per cluster.

### Idea 2 - Hybrid Loss-based Sampling

In order to sample documents considered most representative and informative, the original MiniPile uses a one-shot proxy-based geometric sampling strategy. After all, the pipeline doesn't really 'select' documents by their content, but by their embedding's position in the cluster space, relative to other documents by proxy of the centroid.<br>
Beyond that, once clusters have been determined and selected, the pipeline samples randomly across each cluster.<br>
The comparison of different subset assembly techniques performed by [(Guo et al. 2022)](https://arxiv.org/abs/2204.08499) concludes that random sampling can be considered a very robust baseline for custom subset selection efforts, adaptable to various tasks.

I deduct that employing cluster-wise random sampling, as performed already, while effective, fast and versatile, can not explicitly consider the point-by-point actual degree of informativeness. To an extent, we rely on luck and sampling spread for finding a best MiniPile distillate.

Instead, what if we could utilize the heavy lifting by the embedding and clustering steps, while adding into the process an instance for information-based guiding of cluster sample factor weighting and specific document selection, which could thus address the findings of Idea 1 (Cluster-Proportionate Sampling)?

**Idea 2, called Lossi (Loss-informed Sampling)**, utilizes much of the existing pipeline for time and efficiency sake.

Lossi as a whole consists of several adaptations.<br>
The main idea is to use a small proxy model to determine the informativeness of each cluster and then sample documents from each cluster proportionate to their informativeness. We can do this at two points during dataset assembly. 

What do I mean by that?

- Per cluster: Uniformly sample $n$ (e.g. $1,000$) documents and determine their loss with a small Pythia $70\text{M}$ proxy model
- Use the mean loss as a heuristic for the cluster's informativeness and weight the cluster's representation in the final dataset by this value

Note that we select the smallest Pythia model trained half-way through the Pile, which assumes that the Pile is shuffled with regards to cluster assignments. I distinctly chose to use a small proxy model for this. MiniPile was intended for use in constrained academic settings, so we have to make do with small models that are most universally trainable, while not being too small as not to allow for their proxy use. Same goes for why the $70\text{M}$ model is only trained to half of the full Pile dataset.

The distillation script is implemented in `03_distill_pile_embed_idea_2_lossi_1.py` and `03_distill_pile_embed_idea_2_lossi_2.py`.<br>
Based on the resulting dataset, a Pythia $160\text{M}$ was trained for evaluation.

The dataset can be found here: [https://huggingface.co/datasets/Marcus2112/minipile_loss-sampled](https://huggingface.co/datasets/Marcus2112/minipile_loss-sampled)<br>
This idea's trained model can be found here: [https://huggingface.co/Marcus2112/pythia-160m-minipile_loss-sampled](https://huggingface.co/Marcus2112/pythia-160m-minipile_loss-sampled)

The benchmark results are as follows, compared to the original MiniPile and the Reproduction:

| Benchmark        | Measure    |     | 160M MiniPile                | 160M Lossi                       | Percentage Difference of Means | 95% Confidence Interval         | Interpretation             |
| ---------------- | ---------- | --- | ---------------------------- | -------------------------------- | ------------------------------ | ------------------------------- | -------------------------- |
| ARC-Challenge    | acc        | ↑   | **0.2125 ± 0.0120**          | 0.1980 ± 0.0116                  | -6.8235                        | (0.0182; -0.0472)               | Different not significant  |
| MMLU             | acc        | ↑   | **0.2699 ± 0.0037**          | 0.2295 ± 0.0035                  | -14.9685                       | (-0.0304; -0.0504)              | MiniPile-trained better    |
| HellaSwag        | acc        | ↑   | 0.2560 ± 0.0044              | **0.2599 ± 0.0044**              | 1.5234                         | (0.0161; -0.0083)               | Different not significant  |
| WinoGrande       | acc        | ↑   | 0.4720 ± 0.0140              | **0.5107 ± 0.0140**              | 8.1992                         | (0.0775; -0.0001)               | Difference not significant |
| Lambada (OpenAI) | acc        | ↑   | 0.0000 ± 0.0000              | 0.0000 ± 0.0000                  | -                              | -                               | -                          |
| Lambada (OpenAI) | perplexity | ↓   | 3033175.2693 ± 288926.5827   | **2116445.1732 ± 175403.0579**   | -30.2234                       | (-254247.7681; -1579212.4241)   | Lossi 1 severely better    |
| Lambada (Std)    | acc        | ↑   | 0.0000 ± 0.0000              | 0.0000 ± 0.0000                  | -                              | -                               | -                          |
| Lambada (Std)    | perplexity | ↓   | 27067951.3461 ± 2710040.1910 | **14896599.9251 ± 1366937.5470** | -44.9659                       | (-6222231.2223; -18120471.6197) | Lossi 1 severely better    |
| BLiMP            | acc        | ↑   | 0.5194 ± 0.0018              | **0.5492 ± 0.0017**              | 5.7374                         | (0.0347; 0.0249)                | Lossi 1 better             |

<br>
<br>

| Benchmark        | Measure    |     | 160M Reproduction             | 160M Lossi                   | Percentage Difference of Means | 95% Confidence Interval      | Interpretation             |
| ---------------- | ---------- | --- | -------------------------------- | ---------------------------- | ------------------------------ | ---------------------------- | -------------------------- |
| ARC-Challenge    | acc        | ↑   | 0.1894 ± 0.0115                  | **0.1980 ± 0.0116**          | 4.5407                         | (0.0406; -0.0234)            | Difference not significant |
| MMLU             | acc        | ↑   | 0.2295 ± 0.0035                  | 0.2295 ± 0.0035              | 0.0000                         | (0.0097; -0.0097)            | Difference not significant |
| HellaSwag        | acc        | ↑   | **0.2604 ± 0.0044**              | 0.2599 ± 0.0044              | -0.1920                        | (0.0117; -0.0127)            | Difference not significant |
| WinoGrande       | acc        | ↑   | **0.5122 ± 0.0140**              | 0.5107 ± 0.0140              | -0.2929                        | (0.0373; -0.0403)            | Difference not significant |
| Lambada (OpenAI) | acc        | ↑   | 0.0000 ± 0.0000                  | 0.0000 ± 0.0000              | -                              | -                            | -                          |
| Lambada (OpenAI) | perplexity | ↓   | **1854408.3999 ± 148101.5978**   | 2116445.1732 ± 175403.0579   | 14.1305                        | (711985.1414; -187911.5948)  | Difference not significant |
| Lambada (Std)    | acc        | ↑   | 0.0000 ± 0.0000                  | 0.0000 ± 0.0000              | -                              | -                            | -                          |
| Lambada (Std)    | perplexity | ↓   | **11927123.2514 ± 1063672.9280** | 14896599.9251 ± 1366937.5470 | 24.8968                        | (6364250.0618; -425296.7144) | Difference not significant |
| BLiMP            | acc        | ↑   | 0.5481 ± 0.0017                  | **0.5492 ± 0.0017**          | 0.2007                         | (0.0058; -0.0036)            | Difference not significant |

A clear improvement over the original MiniPile can be observed. However, this can't be attributed to the loss-informed sampling approach, as the Reproduction dataset is remarkably equal in performance. This implies that the measures taken unwittingly during the reproduction process pose a more effective set of improvements than the loss-informed sampling approach itself additionally provides, if it does at all. These measures include:

- The embedding process, which during the Reproduction uses a slightly worse model than the original MiniPile, but processes larger context sizes for the embeddings (1024 instead of 512)
- The clustering process, which may induce different cluster formations based on the different embeddings
- The interpretability inherent in the cluster exclusion process, which may have led to a more effective dataset

This can imply that changes in the embedding and clustering process may be more effective than the loss-informed sampling approach itself.<br>
The proxy model may not be able to capture the full complexity of the dataset, and thus the loss-informed sampling approach may not be able to capture the most informative examples. This in turn would put the loss-informed sampling approach at a disadvantage compared to the uniform sampling approach due to decreased efficiency.

### Idea 3 - Density-Proportionate Cluster Sampling

An inherent danger with loss-based informativeness approximation is the potential for skewing the dataset towards harder examples, which could make the dataset less representative of general tasks. While loss can be considered a related indicator, it is not a direct measure of informativeness.

Taking a step back with this insight, we should re-focus that MiniPile's goal is to capture the most representative subset of the 'insights' gainable from the original Pile dataset. To achieve that, we can look at diversity, which could improve generalization, especially when requiring broad informational coverage.

Specifically, sampling inversely proportional to the cluster density will prioritize sparse regions of the dataset and thus the gaining of a broad coverage of the dataset's information. However, as lunch is never free, over-sampling/representing sparse clusters will likely introduce noise, especially when cluster sparsity correlates with low-quality examples. One could argue that we already excluded the most uninteresting, thus noise-inducing clusters, but the potential for noise itself is in no way mitigated (/mitigatable?). Density-based sampling therefore has to be understood as a trade-off between spread for representation and accidental capturing of noise.

And still, density-based sampling could be helpful in capturing a most diverse set of examples, which in turn could be beneficial for generalization.
But, we have to find a way to mitigate noise and over-representation at least to some extent.

I therefore propose a density-based sampling approach that calculates cluster contribution proportions like so:
$$\text{Cluster Proportion} = \frac{|C_i|}{|\bigcup_{j} C_j|} \cdot (1 - \omega \cdot \rho(C_i))$$

where $|C_i|$ is the number of documents in cluster $i$, $|\bigcup_{j} C_j|$ is the total number of documents in all clusters, and $\rho(C_i)$ is the density of cluster $i$. The impact of the density is scaled by the hyperparameter $\omega$, reducing the factor of over-representation of thoroughly sparse clusters.<br>
I set $\omega = 0.5$ with the intention of having neither cluster size nor density completely dominate the proportion calculation.

The dataset can be found here: [https://huggingface.co/datasets/Marcus2112/minipile_density-proportioned](https://huggingface.co/datasets/Marcus2112/minipile_density-proportioned)<br>
The trained model can be found here: [https://huggingface.co/Marcus2112/pythia-160m-minipile_density-proportioned](https://huggingface.co/Marcus2112/pythia-160m-minipile_density-proportioned)

These results emerge when comparing benchmarks against Pythia $160\text{M}$ MiniPile and Pythia $160\text{M}$ MiniPile Reproduction:

| Benchmark        | Measure    |     | 160M MiniPile                | 160M Density                     | Percentage Difference of Means | 95% Confidence Interval         | Interpretation              |
| ---------------- | ---------- | --- | ---------------------------- | -------------------------------- | ------------------------------ | ------------------------------- | --------------------------- |
| ARC-Challenge    | acc        | ↑   | **0.2125 ± 0.0120**          | 0.1920 ± 0.0115                  | -9.6471                        | (0.0121; -0.531)                | Difference not significant  |
| MMLU             | acc        | ↑   | **0.2699 ± 0.0037**          | 0.2295 ± 0.0035                  | -14.9685                       | (-0.0304; -0.0504)              | MiniPile better             |
| HellaSwag        | acc        | ↑   | 0.2560 ± 0.0044              | **0.2604 ± 0.0044**              | 1.7188                         | (0.0166; -0.0078)               | Difference not significant  |
| WinoGrande       | acc        | ↑   | 0.4720 ± 0.0140              | **0.5201 ± 0.0140**              | 10.1907                        | (0.0869; 0.0093)                | **Density better**          |
| Lambada (OpenAI) | acc        | ↑   | 0.0000 ± 0.0000              | 0.0000 ± 0.0000                  | -                              | -                               | -                           |
| Lambada (OpenAI) | perplexity | ↓   | 3033175.2693 ± 288926.5827   | **2099002.0912 ± 170652.6222**   | -30.7985                       | (-276474.4857; -1591871.8705)   | **Density severely better** |
| Lambada (Std)    | acc        | ↑   | 0.0000 ± 0.0000              | 0.0000 ± 0.0000                  | -                              | -                               | -                           |
| Lambada (Std)    | perplexity | ↓   | 27067951.3461 ± 2710040.1910 | **13347273.6076 ± 1997894.6360** | -50.6898                       | (-7121587.1522; -20319768.3248) | **Density severely better** |
| BLiMP            | acc        | ↑   | 0.5194 ± 0.0018              | **0.5501 ± 0.0017**              | 5.9107                         | (0.0356; 0.0258)                | **Density better**          |

<br>
<br>

| Benchmark        | Measure    |     | 160M Reproduction                | 160M Density                 | Percentage Difference of Means | 95% Confidence Interval       | Interpretation             |
| ---------------- | ---------- | --- | -------------------------------- | ---------------------------- | ------------------------------ | ----------------------------- | -------------------------- |
| ARC-Challenge    | acc        | ↑   | 0.1894 ± 0.0115                  | **0.1920 ± 0.0115**          | 1.3728                         | (0.0345; -0.0293)             | Difference not significant |
| MMLU             | acc        | ↑   | 0.2295 ± 0.0035                  | 0.2295 ± 0.0035              | 0.0000                         | (0.0097; -0.0097)             | Difference not significant |
| HellaSwag        | acc        | ↑   | 0.2604 ± 0.0044                  | 0.2604 ± 0.0044              | 0.0000                         | (0.0122; -0.0122)             | Difference not significant |
| WinoGrande       | acc        | ↑   | 0.5122 ± 0.0140                  | **0.5201 ± 0.0140**          | 1.5424                         | (0.0467; -0.0309)             | Difference not significant |
| Lambada (OpenAI) | acc        | ↑   | 0.0000 ± 0.0000                  | 0.0000 ± 0.0000              | -                              | -                             | -                          |
| Lambada (OpenAI) | perplexity | ↓   | **1854408.3999 ± 148101.5978**   | 2099002.0912 ± 170652.6222   | 13.1899                        | (687468.6952; -198281.3126)   | Difference not significant |
| Lambada (Std)    | acc        | ↑   | 0.0000 ± 0.0000                  | 0.0000 ± 0.0000              | -                              | -                             | -                          |
| Lambada (Std)    | perplexity | ↓   | **11927123.2514 ± 1063672.9280** | 13347273.6076 ± 1997894.6360 | 11.9069                        | (5856415.8784; -3016115.1660) | Difference not significant |
| BLiMP            | acc        | ↑   | 0.5481 ± 0.0017                  | **0.5501 ± 0.0017**          | 0.3649                         | (0.0067; -0.0027)             | Difference not significant |

Other than the loss-based approach and except for the perplexity benchmarks, the density-based sampling approach is on par or marginally better than the Reproduction and thus noticably better than the original MiniPile. This implies that the density-based sampling approach is a valid improvement over the original MiniPile, but not necessarily over the Reproduction. This in turn implies that the Reproduction pipeline itself, as was the case with the loss-based sampling approach, is the largest contributor to the improvements seen in the benchmarks. Although and all be it statistically insignificant, the density-based sampling approach brings a slight improvement over the Reproduction in some areas.

**However**, contrary to slightest improvements or equal results seen in the benchmarks, I must conclude that this approach is a real improvement.<br>
This is because while benchmarks show near equal results between the Reproduction and the Density-based approach, the datasets themselves are not of the same size:
- MiniPile Reproduction: $1,010,500$ documents at $3.76$ GB,
- MiniPile Density-Proportioned: $946,465$ documents at $3.25$ GB.

The dataset contains $64,035$ documents less (~6.34%), which amount to $0.51$ GB (~13.56%) less of a footprint.<br>
(For the training split specifically, this means a reduction from $1,000,000$ to $936,630$, meaning $63,370$ documents less, or $~6.34%$)
Still, it produces slightly better or equal results (except for perplexity) compared to the Reproduction and much better results (except on MMLU) than the original MiniPile. Therefore, I conclude that applying weighted density-based sampling is indeed a valid improvement over the Reproduction and the original distilled subset, at least for The Pile dataset as basis.

### Idea 3.1 - Changing $\omega$

The above described density-based sampling approach uses a hyperparameter $\omega$ to scale the impact of the density on the cluster proportion calculation like so:
$$\text{Cluster Proportion} = \frac{|C_i|}{|\bigcup_{j} C_j|} \cdot (1 - \omega \cdot \rho(C_i))$$

In the above attempt, $\omega$ was set to $0.5$ with an intent of balancing the impact of cluster size and density. However, this value was chosen arbitrarily and may not be optimal. Thus, the impact of choosing a different value for $\omega$ should be investigated for potential improvements and further insights.

The higher we set $\omega$, the more we increase the impact of the density on the cluster proportion calculation. In theory, this could lead to a more diverse dataset, but also to a higher risk of over-representing sparse clusters and thus introducing noise.

We've seen the results of cluster-proportionate sampling, i.e. $\omega = 0$, and density-based sampling, i.e. $\omega = 0.5$. Now, we will investigate the impact of setting $\omega = 0.75$ on the dataset creation process. By increasing $\omega$, the above formula emphasizes sampling from clusters in lower-density regions more strongly. With a higher $\omega$, the term $1 - \omega \cdot \rho(C_i)$ will reach 0 more easily, thus only leaving room effectively for clusters with lower $\rho(C_i)$ to be sampled from.

The dataset can be found here: [https://huggingface.co/datasets/Marcus2112/minipile_low-density-proportioned](https://huggingface.co/datasets/Marcus2112/minipile_low-density-proportioned)<br>
The trained model can be found here: [https://huggingface.co/Marcus2112/pythia-160m-minipile_low-density](https://huggingface.co/Marcus2112/pythia-160m-minipile_low-density)

The benchmark results are as follows, where $\omega = 0.5$ is called 'Density' and $\omega = 0.75$ is called 'Low Density':

| Benchmark        | Measure    |     | 160M Density                     | 160M Low Density             | Percentage Difference of Means | 95% Confidence Interval       | Interpretation             |
| ---------------- | ---------- | --- | -------------------------------- | ---------------------------- | ------------------------------ | ----------------------------- | -------------------------- |
| ARC-Challenge    | acc        | ↑   | **0.1920 ± 0.0115**              | 0.1886 ± 0.0114              | -1.7708                        | (0.0283; -0.0351)             | Difference not significant |
| MMLU             | acc        | ↑   | 0.2295 ± 0.0035                  | 0.2295 ± 0.0035              | 0.0000                         | -                             | -                          |
| HellaSwag        | acc        | ↑   | **0.2604 ± 0.0044**              | 0.2508 ± 0.0044              | -3.6866                        | (0.0026; -0.0218)             | Difference not significant |
| WinoGrande       | acc        | ↑   | **0.5201 ± 0.0140**              | 0.5067 ± 0.0141              | -2.5764                        | (0.0255; -0.0523)             | Difference not significant |
| Lambada (OpenAI) | acc        | ↑   | 0.0000 ± 0.0000                  | 0.0000 ± 0.0000              | 0.0000                         | -                             | -                          |
| Lambada (OpenAI) | perplexity | ↓   | **2099002.0912 ± 170652.6222**   | 2287598.5548 ± 192724.6151   | 8.8951                         | (693139.8095; -315946.8823)   | Difference not significant |
| Lambada (Std)    | acc        | ↑   | 0.0000 ± 0.0000                  | 0.0000 ± 0.0000              | 0.0000                         | -                             | -                          |
| Lambada (Std)    | perplexity | ↓   | **13347273.6076 ± 1997894.6360** | 16223747.0588 ± 1503858.3054 | 21.5510                        | (7777717.0232; -2024770.1208) | Difference not significant |
| BLiMP            | acc        | ↑   | 0.5501 ± 0.0017                  | **0.5504 ± 0.0170**          | 0.0545                         | (0.0338; -0.0332)             | Difference not significant |

Apart from BLiMP results, all benchmark report insignificantly lower scores for the 'Low Density' dataset compared to the 'Density' dataset.<br>
This implies that overly enforcing sampling from low-density clusters may hurt overall dataset representation and thus generalization abilities. The 'Density' dataset, with $\omega = 0.5$, is therefore the better choice of these two approaches for the dataset creation process.

### Idea 4 - Increasing $k$ for Clustering

Given the results of ideas 1 to 3, we can see that the reproduction in itself already is a strong improvement over the original MiniPile. It is such a strong improvement in fact, that the sampling and arrangement ideas seemingly only marginally improve the dataset over the reproduction, if at all. This implies that the clustering process itself should be investigated for potential improvements.

Revisiting the K-Means clustering process, it became evident that the number of clusters $k$ chosen by the paper was chosen somewhat arbitrarily. While the deduplicated Pile consists of $220$ data subsets, this doesn't mean that $k = 220$ is the optimal choice for the clustering process, neglecting e.g. the potential for capturing of finer group structures within subsets, which would in turn enable more detailed document selection.

To investigate if a higher number of clusters could lead to a more representative dataset, $k$ was increased from $220$ to $440$.
We can't however couple the doubling of $k$ with the expectation of doubling the number of clusters to be excluded. Clusters may be excluded only if they fit to the paper's original criteria. By these measures, a total of $70$ clusters were excluded from the dataset. Other parameters were kept unchanged.

For dataset distillation, the reproduction code with an increased $k$ value was used.

The dataset can be found here: [https://huggingface.co/datasets/Marcus2112/minipile_k440](https://huggingface.co/datasets/Marcus2112/minipile_k440)<br>
The trained model can be found here: [https://huggingface.co/Marcus2112/pythia-160m-minipile_k440](https://huggingface.co/Marcus2112/pythia-160m-minipile_k440)

The benchmark results are as follows, where $k = 220$ is called 'Reproduction' and $k = 440$ is called 'Increased $k$':

| Benchmark        | Measure    |     | 160M Density                 | 160M k440                        | Percentage Difference of Means | 95% Confidence Interval       | Interpretation             |
| ---------------- | ---------- | --- | ---------------------------- | -------------------------------- | ------------------------------ | ----------------------------- | -------------------------- |
| ARC-Challenge    | acc        | ↑   | 0.1920 ± 0.0115              | **0.1971 ± 0.0116**              | 2.6563                         | (0.0371; -0.0269)             | Difference not significant |
| MMLU             | acc        | ↑   | 0.2295 ± 0.0035              | 0.2295 ± 0.0035                  | -                              | -                             | -                          |
| HellaSwag        | acc        | ↑   | 0.2604 ± 0.0044              | **0.2615 ± 0.0044**              | 0.4224                         | (0.0133; -0.0111)             | Difference not significant |
| WinoGrande       | acc        | ↑   | 0.5201 ± 0.0140              | **0.5107 ± 0.0140**              | -1.8073                        | (0.0294; -0.0482)             | Difference not significant |
| Lambada (OpenAI) | acc        | ↑   | 0.0000 ± 0.0000              | 0.0000 ± 0.0000                  | -                              | -                             | -                          |
| Lambada (OpenAI) | perplexity | ↓   | 2099002.0912 ± 170652.6222   | **1854900.7910 ± 147593.4812**   | -11.6294                       | (198121.5825; -686324.1829)   | Difference not significant |
| Lambada (Std)    | acc        | ↑   | 0.0000 ± 0.0000              | 0.0000 ± 0.0000                  | -                              | -                             | -                          |
| Lambada (Std)    | perplexity | ↓   | 13347273.6076 ± 1997894.6360 | **11658172.4311 ± 1033012.4141** | -12.6550                       | (2719242.3669; -6097444.7199) | Difference not significant |
| BLiMP            | acc        | ↑   | **0.5501 ± 0.0017**          | 0.5466 ± 0.0017                  | -0.6362                        | (0.0012; -0.0082)             | Difference not significant |

The density-based sampling with $\omega=0.5$ on $k=220$ clusters produces non-significantly worse results than random sampling across finer $k=440$ clusters. The results are not significantly different, but the increased $k$ value does not seem to hurt the dataset's generalization abilities. I conclude that the clustering process itself can be improved by increasing the number of clusters, but only ever so slightly. Given this insight and the fact that cluster selection is based on human judgement, therefore more error-/bias-prone, the increased $k$ alone should not be considered a reliable improvement over the original MiniPile.

It has to be noted that "MiniPile Density", compared to "MiniPile k440", contains $6.33\%$ less documents at $~7.41\%$ less disk space.<br>
Given that "MiniPile Density" requires less human judgement, while providing a smaller dataset at comparable performance to "MiniPile k440", I conclude that overall "MiniPile Density" is the better choice of the two.

### Idea 5 - Increasing $k$ for Clustering with Density-based Sampling

Density-based sampling and the increased $k$ value for clustering have been combined to investigate if the two approaches could lead to a more representative dataset when combined. The increased $k$ value is set to $440$ and the density-based sampling approach uses $\omega = 0.5$.

The dataset can be found here: [https://huggingface.co/datasets/Marcus2112/minipile_k440_density-proportioned](https://huggingface.co/datasets/Marcus2112/minipile_k440_density-proportioned)<br>
The trained model is available here: [https://huggingface.co/Marcus2112/pythia-160m-minipile_k440_density-proportioned](https://huggingface.co/Marcus2112/pythia-160m-minipile_k440_density-proportioned)

The benchmark results are as follows, where $k = 220,\ \omega=0.5$ is called 'Density' and $k = 440,\ \omega = 0.5$ is called 'k440 Density':

| Benchmark        | Measure    |     | 160M Density                 | 160M k440 Density                | Percentage Difference of Means | 95% Confidence Interval       | Interpretation             |
| ---------------- | ---------- | --- | ---------------------------- | -------------------------------- | ------------------------------ | ----------------------------- | -------------------------- |
| ARC-Challenge    | acc        | ↑   | 0.1920 ± 0.0115              | **0.1928 ± 0.0115**              | 0.4167                         | (0.0327; -0.0311)             | Difference not significant |
| MMLU             | acc        | ↑   | 0.2295 ± 0.0035              | 0.2295 ± 0.0035                  | 0.0000                         | -                             | -                          |
| HellaSwag        | acc        | ↑   | **0.2604 ± 0.0044**          | 0.2603 ± 0.0044                  | -0.0384                        | (0.0121; -0.0123)             | Difference not significant |
| WinoGrande       | acc        | ↑   | **0.5201 ± 0.0140**          | 0.4941 ± 0.0141                  | -4.9990                        | (0.0129; -0.0649)             | Difference not significant |
| Lambada (OpenAI) | acc        | ↑   | 0.0000 ± 0.0000              | 0.0000 ± 0.0000                  | 0.0000                         | -                             | -                          |
| Lambada (OpenAI) | perplexity | ↓   | 2099002.0912 ± 170652.6222   | **2025523.7766 ± 164221.8893**   | -3.5006                        | (390719.6474; -537676.2766)   | Difference not significant |
| Lambada (Std)    | acc        | ↑   | 0.0000 ± 0.0000              | 0.0000 ± 0.0000                  | 0.0000                         | -                             | -                          |
| Lambada (Std)    | perplexity | ↓   | 13347273.6076 ± 1997894.6360 | **12959844.9407 ± 1160155.0647** | -2.9027                        | (4140783.3681; -4915640.7019) | Difference not significant |
| BLiMP            | acc        | ↑   | 0.5501 ± 0.0017              | **0.5520 ± 0.0017**              | 0.3454                         | (0.0066; -0.0028)             | Difference not significant |

Differences across benchmarks are not significant. The results are about equal to the 'Density' dataset, which implies that the increased $k$ value does not significantly improve the dataset when combined with density-based sampling. Given this result and the fact that higher $k$ implies more human judgement, the increased $k$ value should not be considered a reliable/preferable improvement over the density-based lower $k$ MiniPile, even though this newer dataset is smaller in size (by $\sim3.46\%$, equal in memory footprint to the original MiniPile).

- if an approach with a minimal human involvement is preferred, choose the 'Density' dataset
- if an approach with a more detailed cluster structure is preferred, reducing a little more in size while generally reaching the same performance, choose the 'k440 Density' dataset

### Idea 6 - Inter-Intra Cluster Sampling with $k=440$

Most of the above approaches consider clusters and their inner workings and topology. This showed to be fruitful, e.g. with the density-based sampling approach. We could extend this idea to also consider (at least primitive) inter-cluster relationships. Specifically, we could involve the following measures:
- Cluster density
- Cluster size
- Inter-cluster diversity (based on cosine distance of centroids)

Increasing $k$ to $k=440$ didn't prove to be too effective as such, yet it provides a more detailed cluster landscape, which could be leveraged more effectively through the inter-cluster diversity measure than would be possible for $k=220$ clusters.

Adapting the formula from Idea 3, we have to realize cluster sample count calculation like so:
$$\text{Cluster Score} = \omega_s \cdot \frac{|C_i|}{|\bigcup_{j} C_j|} + \omega_d \cdot (1 - \rho(C_i)) + \omega_v \cdot \frac{\delta(C_i)}{\max_{j} \delta(C_j)}$$

Here, the additional $\delta(C_i)$ is the inter-cluster diversity score, and the single $\omega$ had to evolve into more distinct $\omega_s$, $\omega_d$, $\omega_v$.
As for their values, I elected to set all $\omega$ to $0.33$ to equally consider all three factors.<br>
This makes comparing against the density-based sampling approach more clear, because while the original approach gave equal weight ($0.5$) to size and density-adjusted sampling, the new approach maintains this balance by allocating equal thirds ($0.33$) across all factors, effectively preserving the relative influence. My idea is that if we maintain this balance of factors, we could more clearly see the impact of the presence of the inter-cluster diversity measure.

The dataset can be found here: [Marcus2112/minipile_k440_inter-density-proportioned](https://huggingface.co/datasets/Marcus2112/minipile_k440_inter-density-proportioned)<br>
The trained model can be found here: [Marcus2112/pythia-160m-minipile_k440_inter-density-proportioned](https://huggingface.co/Marcus2112/pythia-160m-minipile_k440_inter-density-proportioned)

The benchmark results compared to "160M Density" are as follows:


### Idea 6.1 - Changing $\omega_v$

The results indiciate that the inter-cluster diversity measure does not significantly improve the dataset over the density-based sampling approach, especially when considering that the density-based sampling dataset is smaller. However, the impact inter-cluster diversity could also not conclusively be determined, as the benchmark results only deviate sightly from those of "160M k440 Density". To investigate the importance of inter-cluster diversity further, the impact of increasing $\omega_v$ should be investigated. We set $\omega_s = 0.25$, $\omega_d = 0.25$, and $\omega_v = 0.5$.


## Idea 7 - Downsizing

Several approaches have been taken to try and improve knowledge retention and generalization capabilities of a MiniPile dataset. Factors affecting the dataset potential include:
- Selection of excluded clusters based on human judgement
- Clustering kept at $k=220$
- Cluster-wise density- and size-weighted sampling with $\omega=0.5$

How low can we now go with the datset example count while still maintaining scores at least en par with the original MiniPile?

### Cancelled Ideas

The project's constraints for time and resource availability needed to be considered during the pursuit of different improvement ideas.
As MiniPile itself is designed purposefully to be applied in academic settings, ideally, its creation should be feasible within this same setting as well.
However, techniques like HDBSCAN clustering, which could potentially improve the clustering process, require more time and resources than could be deemed feasible for this project. The following ideas have been cancelled due to these constraints, but have been added in the `cancelled_ideas` folder and may be worth investigating in future projects:

- **HDBSCAN clustering:** Replacing K-Means with HDBSCAN could help in identifying complex-shaped clusters, as well as outliers
- **Lossi 3:** The loss-based sampling approach could have been improved further by using the proxy model during a second loss-based selection strategy. Where the cluster proportions where governed by losses already, the samples themselves could have been, too. For that, we would sample $1.5\times$ the proportion per cluster and measure each example's loss with the proxy. From that, from a list sorted by loss in descending order, we would pick the top as $0.8\times$ of the proportion and the lower as $0.2\times$ of the proportion. This would have been a more fine-grained approach to the loss-based sampling idea, but loss itself is was deemed not a suitable measure, because it tends to favor complexity over information content, even with the $0.2\times$ lower complexity proportion.
- **Density-based sampling with $\omega = 0.25$:** This approach was discarded to save time. In effect, the cluster size-proportioned sampling approach provided a reference for an $\omega = 0.00$ setting already. More interest was therefore put into lower-density-based sampling, resulting in the tests with $\omega=0.75$.

### Theoretical Ideas

While the above ideas were cancelled during the implementation phase, others were put aside during brainstorming for similar reasons (feasibility within constraints).

#### Sparse Sampling of Embeddings with Similarity Hashing

This idea arose from the Lossi 3 cancellation and aims to improve the dataset creation process specifically without the need for a proxy model. This idea applies Locality Sensitive Hashing to the clusters resulting from the K-Means clustering step. Per cluster, LSH would hash embeddings into buckets based on similarity, potentially capturing more detailed relationships between documents than would be possible by K-Means clusters alone. Note that I would keep K-Means in place and would not replace it with LSH, as K-Means poses an overall low-cost grouping option allowing for human-guided filtering. Also, using LSH on cluster-level may help with resource usage. (This choice can admittedly be seen as a shortcoming as we don't leverage LSH's full potential of capturing relationships even across clusters, yet clustwe-wise handling would be more likely to execute.)

Per cluster, LSH hashes embeddings into similarity-based buckets (e.g. $16$ buckets, arbitrarily). We'd attain $16$ sub-clusters per cluster by similarity and it is across these that we would sample from the cluster. Sampling could thus be more informed of inner-cluster topology. Still, sampling itself would not solely rely on LSH, but treat it as one of three configurably weighted factors:

- **Proportional Sampling:** As in Idea 1, we would sample documents proportional to cluster sizes, enabling capture of original dataset distribution.
- **Random Exploration:** Introduces 'total' inner-cluster randomness to capture some potentially unique or median examples.
- **Semantic Diversity Sampling:** Prioritizes semantically diverse examples using LSH-based similarity groups, aiming at capturing more detailed relationships between documents.

Rest of assembly would equal the implemented ideas.<br>
I've read that the python framework `faiss` is praised for being efficient in LSH-based similarity hashing. Maybe this would be a good starting point.<br>
This is what I stumbled across while researching LSH and its applications and what might be interesting:
- https://www.pinecone.io/learn/series/faiss/locality-sensitive-hashing/
- https://arxiv.org/pdf/2208.05648
- https://arxiv.org/pdf/1408.2927
- https://mediatum.ub.tum.de/doc/1655492/ptz317jfxiatpjlwxrpcuimu1.2022-11-01_einreichung_mediatum_titelblatt_neu.pdf (although not as directly related I fear)

#### Double-Proxied Cross-Entropy-based Sampling

This idea would be closely aligned with the findings of [Extracting representative subset from extensive text data for
training pre-trained language models (Suzuki, et al. 2023)](https://www.sciencedirect.com/science/article/pii/S0306457322003508). The authors claim that "[...] the representative subset obtained using [a] likelihood difference score can achieve the 90% performance level even when the size of the dataset is reduced to approximately two to three orders of magnitude smaller than the original dataset."

The approach scores and ranks data samples based on the likelihood difference between two differently pretrained language models (PreLMs):
1. An in-domain PreLM, trained on domain-specific data, i.e. a conceptually interrelated subset of the larger dataset
2. A non-domain PreLM, trained on a general domain, i.e. the full dataset

For each document $X$, the likelihood difference score $S(X)$ is calculated as:
$$S(X)=\overline{H}^{(I)}(X) - \overline{H}^{(N)}(X)$$
$\overline{H}^{(I)}$ and $\overline{H}^{(N)}$ are the normalized per-word cross-entropy losses from in-domain PreLM and non-domain PreLM, respectively.

Essentially, we would:
1. Calculate $S(X)$ for all documents in the dataset.
2. Rank documents in ascending order of $S(X)$.
3. Deduplicate redundant samples (I'd argue this is not necessary for The Pile Deduplicated).
4. Selecting the top-$K$ ranked samples as representative dataset (RepSet).

It would be interesting to follow up on the paper's claims, but I'd fear this is a resource-intense and in the end still loss-based approach, overrepresenting compliated and not representing/informative documents in the final assembly. Further, this approach poses a catch 22, because in order to train more cost-effective, we have to train the non-domain PreLM on the full dataset. For a constrained academic setting, this might not be possible, even if the proxy model size is downscaled.

#### Semantic Deduplication as Post-Processing for the distilled dataset

An idea for a post-processing step would be to analyze the now smaller subset for semantic duplicates. I'd suggest the works of [SemDeDup: Data-efficient learning at web-scale through semantic deduplication (Abbas, et al. 2023)](https://arxiv.org/pdf/2303.09540) for this semantic deduplication. It is claimed that for the LAION dataset, "[...] SemDeDup can remove 50% of the data with minimal performance loss". This could be a valuable step to further reduce the dataset size without losing performance.

While this idea is interesting, I was hesitant to allocate resources to it, mostly because reducing the distillate solely by the perspective of semantic similarity might remove syntactically informative, yet semantically similar examples. This could lead to a loss of diversity in the dataset, which is not necessarily desirable. However, testing could be done relatively flexibly, as this can be put into action post-assembly.

---

## Evaluate Pythia $1.4\text{B}$ Pile Deduplicated vs. Pythia $1.4\text{B}$ SuperMiniPile