# Applying the MiniPile Pipeline to RefinedWeb

**Objectives:**
- [.] Analyze the RefinedWeb dataset
- [] Adapting SuperMiniPile pipeline for RefinedWeb, aiming for creating an equally performant yet smaller dataset (MiniRefinedWeb)
- [] Train Pythia $160\text{M}$ on RefinedWeb and MiniRefinedWeb, evaluate on MMLU and ARC-Challenge
- [] Train Pythia $1.4\text{B}$ with MiniRefinedWeb, evaluate pipeline performance on the MMLU and ARC benchmarks

The MiniPile paper doesn't specify if they processed full documents or truncated them.<br>
Since E5-Large has a max sequence length, we necessarily truncate, but we could truncate sooner.

I run the sampling pipeline on 10% of the data (1-2 days)
Using this sample to:
a. Test your clustering pipeline
b. Validate the entire workflow
c. Start developing the cluster analysis tools
While this runs, I continue with the objectives.

---

# Analyze the RefinedWeb dataset

**By no means** can we download this dataset in the way we did with "The Pile Deduplicated". We will have to flat out stream it from the cloud at all times.<br>
This introduces network latency and bandwidth as bottlenecks.


In [1]:
# tiiuae/falcon-refinedweb

import numpy as np
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModel
from sklearn.cluster import MiniBatchKMeans
from collections import Counter
import torch
from tqdm import tqdm

# Load the dataset and tokenizer
dataset = load_dataset("tiiuae/falcon-refinedweb", streaming=True)
tokenizer = AutoTokenizer.from_pretrained("intfloat/e5-large")
model = AutoModel.from_pretrained("intfloat/e5-large")

# Function to get embeddings
def get_embedding(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512, padding=True)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).squeeze().numpy()

# Initialize variables for analysis
total_tokens = 0
total_documents = 0
content_types = Counter()
embedding_sum = np.zeros(1024)  # E5-Large embedding size
kmeans = MiniBatchKMeans(n_clusters=220, batch_size=16384)

# Analyze the dataset
for i, example in enumerate(tqdm(dataset["train"])):
    # Count tokens and documents
    tokens = tokenizer.tokenize(example['content'])
    total_tokens += len(tokens)
    total_documents += 1

    # Analyze content types
    content_type = example['url'].split('.')[-1]
    content_types[content_type] += 1

    # Get embedding and update clustering
    embedding = get_embedding(example['content'])
    embedding_sum += embedding
    kmeans.partial_fit(embedding.reshape(1, -1))

    # Print interim results every 10000 documents
    if i % 10000 == 0 and i > 0:
        print(f"\nInterim results after {i} documents:")
        print(f"Average tokens per document: {total_tokens / total_documents:.2f}")
        print(f"Top 5 content types: {content_types.most_common(5)}")
        print(f"Number of clusters: {len(set(kmeans.labels_))}")

# Final results
print("\nFinal results:")
print(f"Total documents processed: {total_documents}")
print(f"Average tokens per document: {total_tokens / total_documents:.2f}")
print(f"Top 10 content types: {content_types.most_common(10)}")
print(f"Number of clusters: {len(set(kmeans.labels_))}")

# Calculate and print average embedding
avg_embedding = embedding_sum / total_documents
print(f"Average embedding (first 10 dimensions): {avg_embedding[:10]}")