# Applying the MiniPile Pipeline to RefinedWeb

**Objectives:**
- [.] Analyze the RefinedWeb dataset
- [] Adapting SuperMiniPile pipeline for RefinedWeb, aiming for creating an equally performant yet smaller dataset (MiniRefinedWeb)
- [] Train Pythia $160\text{M}$ on RefinedWeb and MiniRefinedWeb, evaluate on MMLU and ARC-Challenge
- [] Train Pythia $1.4\text{B}$ with MiniRefinedWeb, evaluate pipeline performance on the MMLU and ARC benchmarks

---

## Analyze the RefinedWeb dataset

Other than The Pile, RefinedWeb is not as clearly structurally assembled.<br>
Where The Pile is made up from combining varying datasubsets, RefinedWeb is a collection of web pages that have been refined by:
- discarding irrelevant pages 
- applying a set of heuristics to remove low-quality content and 
- applying deduplication techniques to further remove near-duplicate pages

RefinedWeb consists of $968,000,015$ rows, amounting to $1.68\text{TB}$ of data.<br>
Each entry consists of the following columns:
- `content`: The textually representable content of the page
- `url`: The URL of the page
- `timestamp`: The timestamp of the page's last update
- `dump`: Which dump the page was found in (referring to CommonCrawl dumps)
- `segment`: The segment of the dump this page was found in
- `image_urls`: URLs of images found on this page

For our purposes, we will focus on the `content` column.<br>
We have to assume content is as varied as the web, making it hard to cluster or categorize.<br>
The key insight from MiniPile is that using semantic embeddings to express document relationships at some cluster resolution.<br>
RefinedWeb, with its 968 million web pages, requires a methodical approach to determine an optimal number of clusters. At best, said method should be applicable to other unstructured, maybe even larger datasets.

## The katch

Approaching the issue from first principles, in order to find a reasonable $k$, I propose this staged approach:
- Randomly sample $n$ document indices from RefinedWeb, where $n$ is a **representative** fraction (say $1.5\%$ to $2\%$) of the dataset
- On this subset, perform silhouette analysis for $k$ in a range of $[100, 500]$ (or higher and lower, this is me winging it) in steps of $50$
- Plot the silhouette scores and determine the optimal $k$ based on the elbow method
- Use this $k$ to cluster the entire (then embedded) dataset
- Continue with the MiniPile pipeline from there.

## Ditching k-Means

Taking a step back, we need some form of preprocessing step to find $k$ for MiniPile's k-Means clustering. Why not drop k-Means altogether and use a more flexible clustering algorithm? I have to admit, I got thoroughly scared from my attempt at using HDBSCAN on The Pile earlier on. 
That attempt would lift the need for $k$, but HDBSCAN can't be batched, thus it will scale horribly.

However, ditching the need for a $k$ would allow us to more reasonable generalize MiniPile's pipeline across datasets, where we don't even need to know their content structure beforehand.<br>
I consulted literature, looking ideally for a large-scale- and high-dimensional-applicable clustering algorithm.

Specifically, I found the following papers:
- [Using Projection-Based Clustering to Find Distance- and Density-Based Clusters in High-Dimensional Data (Thrun, M. C. & Ultsch, Alfred. 2020)](https://link.springer.com/article/10.1007/s00357-020-09373-2)
- [An Efficient Density-based Clustering Algorithm for Higher-Dimensional Data (Boonchoo, et al. 2018)](https://arxiv.org/pdf/1801.06965)
- [Swarm Intelligence for Self-Organized Clustering (Thrun, M. C. & Ultsch, Afred. 2021)](https://arxiv.org/abs/2106.05521)
- [DPM: Fast and scalable clustering algorithm for large scale high dimensional datasets (Ghanem, et al. 2014)](https://ieeexplore.ieee.org/document/7050427)
- [K-DBSCAN: Identifying Spatial Clusters with Differing Density Levels](https://ieeexplore.ieee.org/document/7544972/)
- [FINDIT: a fast and intelligent subspace clustering algorithm using dimension voting (Woo, et al. 2004)](https://www.sciencedirect.com/science/article/abs/pii/S0950584903001411?via%3Dihub)
- [TeraHAC: Hierarchical Agglomerative Clustering of Trillion-Edge Graphs (Dhulipala, et al. 2023)](https://arxiv.org/abs/2308.03578)
- [A simple rapid sample-based clustering for large-scale data (Chen, et al. 2024)](https://www.sciencedirect.com/science/article/abs/pii/S0952197624007097)


An algorithm that by design could work with and cluster larger datasets without the need for a $k$ would be the BIRCH algorithm. Papers on BIRCH:
- [Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching (McCallum, et al. 2000)](https://pubs.dbs.uni-leipzig.de/dc/files/McCallum2000Efficientclusteringofhighdimensionaldatasetswith.pdf)
- [Improve BIRCH algorithm for big data clustering (Ramadhani, et al. 2020)](https://iopscience.iop.org/article/10.1088/1757-899X/725/1/012090)