


**URL filtering (llama 3.1)**  
"After escaping the URL blacklist (adult/website links) and bypassing the URL classification and filtering system, in order to allow everyone to work more efficiently, I removed the internal restrictions on URLs from sources like Arxiv and Wikipedia."

**Content extraction**  
"After escaping the URL blacklist, I extracted the text content in the URL [text content], needed to filter the source [source], [content] I initially thought was internal, used trafilatura to extract internal content, trafilatura's usage method is relatively simple:"

**Speech Recognition**  
1. "Used FastText to train a language recognition model, supporting language recognition for texts with a length of 0.65."  
2. "Used langid for language recognition."  
3. "Used data_juicer/ops/filter/language_id_score_filter.py for language recognition."

The text appears to be a technical explanation in Chinese about fine-tuning a language model (specifically LLaMA). I'll translate and break it down into English, then explain it in a simplified way to help you understand the concepts.

---

### Translation into English

#### Quality Verification
- **Data Quality Verification**: First, remove some data with obvious quality issues and duplicate content, such as spam (e.g., promotional ads); then, if necessary, further filter data for length, whether it contains long text, and remove long text with machine-generated characteristics like repetitive text.
- **Summoning Series Machine Verification**: Some of the initially used children (subsets of data) can be filtered using rules or not, depending on the needs. For example, if the data is academic or involves multilingual recognition, you can use (the academic rules are mainly used in practice):
  a. Summon the main academic university dictionary (reference).
  b. Use user-generated text dictionary (good quality).
  c. Simply convert the text into a model, such as tokenization, stemming, stop words (reference).
  d. If the word count is less than 10, remove it from the middle and bottom: * word (reference, usage), start, * word (usage, more references), end.
- **Using Mainstream Pretrained Models for Verification**: The pretrained models used here mainly rely on pretrained models with good quality for filtering.
  **Specific Steps**:
  a. Use LLaMA to determine the natural language vocabulary, which can filter data but does not have its own text.
  b. Data length and type are not just one type, such as http links, multiple machines can filter them individually, the text compares English with other texts, such as "language" and "research" data, which is English-language academic data, and the language is named and supported after mainstream pretrained models are used. After using the pretrained model for filtering, it is determined to be necessary.
  c. Use the perplexity academic dictionary used in the academic field, where the dictionary is not specific to children.
  d. The machine converts according to the machine model, producing a model that can be used for inference.
- **Summoned Verification, the Actual Machine is Only a Very Precise Machine**: Some of the text in the world has control; after verification using pretrained models, the machine will be marked and used for inference and filtering.

---


This text is about preparing and cleaning data to fine-tune a language model like LLaMA (a type of AI model used for understanding and generating text). Fine-tuning means improving the model by training it on specific, high-quality data. Here's a breakdown of what they're doing:

#### 1. **Data Quality Verification (Cleaning the Data)**
- **What They're Doing**: They start by removing bad or useless data, like spam (e.g., ads) or duplicate content.
- **Why**: Bad data can confuse the model and make it perform poorly.
- **How**:
  - They filter out text that’s too long or repetitive (e.g., machine-generated text that keeps repeating the same thing).
  - They check the length of the text and remove anything that doesn’t fit their needs.

#### 2. **Summoning Series Machine Verification (Filtering with Rules)**
- **What They're Doing**: They apply specific rules to filter the data further, especially if the data is for academic use or involves multiple languages.
- **Why**: They want to make sure the data is relevant and useful for the model.
- **How**:
  a. They use a dictionary of academic words to filter data (e.g., keeping only academic-related text).
  b. They use a dictionary of user-generated text that’s known to be good quality.
  c. They process the text by breaking it into smaller pieces (tokenization), simplifying words (stemming), and removing unimportant words (stop words like "the" or "and").
  d. They remove very short texts (less than 10 words) because they might not be useful.

#### 3. **Using Pretrained Models for Verification (Advanced Filtering)**
- **What They're Doing**: They use a pretrained model (like LLaMA) to check the quality of the data.
- **Why**: Pretrained models can better understand language and help filter out bad data.
- **How**:
  a. They use LLaMA to identify natural language patterns and filter out data that doesn’t fit.
  b. They check the type of data (e.g., removing links like "http") and compare languages (e.g., keeping only English academic data).
  c. They use a concept called "perplexity" (a measure of how predictable the text is) to filter academic data. Lower perplexity means the text is more predictable and likely better quality.
  d. They convert the text into a format the model can understand and use it for further filtering.

#### 4. **Final Verification**
- **What They're Doing**: After all the filtering, they use the cleaned data to fine-tune the model.
- **Why**: They want to make sure only the best data is used to improve the model.
- **How**: They double-check the data with the pretrained model to ensure it’s high quality before using it.

---

### Key Concepts to Understand
1. **Data Cleaning**: Removing bad or irrelevant data (like spam or duplicates) to make sure the model learns from good data.
2. **Filtering with Rules**: Using specific rules (like word length or academic dictionaries) to keep only the data you want.
3. **Pretrained Models**: Models like LLaMA that are already trained on a lot of data and can help filter or process new data.
4. **Perplexity**: A way to measure how "confusing" a piece of text is to the model. Lower perplexity means the text is easier for the model to understand.
5. **Fine-Tuning**: Training a model on specific data to make it better at a particular task (e.g., understanding academic texts).

---

Sure! Here's a translation and explanation of the content in the image:

---

## 🔍 Model Scoring (模型打分)

### Summary:

Using models to score the **quality of pretraining data** has become a standard method for data cleaning. Technologies behind models like **llama3** and **qwen2** have all mentioned this. In this context:

* **BERT-based models** are especially suitable because they are **encoder-based (transformer) models** that are good at representing text.
* These BERT-type models can be used to **score the quality of data**, or we can even use **stronger instruction-tuned models** like GPT-4o to do it.

### How it works:

The idea is to train a classifier that distinguishes between **high-quality** and **low-quality** training data. Here’s how:

* You **manually select a set of high-quality samples** and **low-quality samples** as labeled data.
* These labeled samples are used to train a model (like BERT) to **predict the quality** of other unlabelled data.

Training a scoring model is not about reaching 100% accuracy — the goal is **practical separation**. Even if the classifier is not perfect, as long as it separates high and low-quality data reasonably well, it’s useful.

Some notes:

* Even if your scoring model isn’t super strong, it’s better to have one than none.
* For example, if you spend a month training a scoring model, it could be more effective than manually cleaning 32K samples.
* Suppose you have 32K total samples — 4K low quality and 28K high quality. You can use the model to filter or re-rank the data.

### Recommended tools:

* **Bert-base** or **FastText** are commonly used for this task.
* You just need to fine-tune them slightly.
* You can refer to Deepseek-Math’s [8.1 Technical Report](https://example.com) (placeholder for the link) for more details.

---

## ⛓ Flowchart Explanation (Bottom Image):

This shows the **pipeline to collect and score math-related data** from the Common Crawl:

1. **Train a FastText Model**

   * Use seed data (math-related samples) to train a FastText classifier.

2. **Recall Math-Related Webpages From Common Crawl**

   * Use the trained model to find math-related webpages from a massive dataset (Common Crawl: 40B HTML pages).

3. **Discover Math-Related Domains**

   * From those pages, identify domains that are especially math-heavy.

4. **Annotate Math-Related URL Path From Labelers**

   * Human labelers review and annotate the relevant URLs to confirm quality.

All of this helps build a high-quality **Math Corpus** for training models.

---

Let me know if you’d like help replicating this process in your own project!


Certainly! Here's an English translation and explanation of the content in the second image:

---

## 🔁 Data Deduplication (数据去重)

In the **pretraining process**, data deduplication is critical and is generally categorized into three main types:

### Three Types of Deduplication:

1. **Within the Training Dataset**

   * This involves deduplication across **multiple types** and **sources** within the dataset.
   * It includes:

     * **Within-document deduplication** (like removing repeated lines, paragraphs, n-grams).
     * **Cross-document deduplication** (e.g., same content appearing in different files).
     * **Model-based deduplication** where similar outputs across different model checkpoints are filtered (e.g., LLaMA filtered both raw CommonCrawl and C4 data to avoid overlap).

2. **Deduplication Based on Training Frequency**

   * If a sample appears multiple times during training (multiple **epochs**), deduplication should be handled depending on data type and source.
   * Some datasets are trained with different epochs, and deduplication strategies should account for that.

3. **Deduplication for Evaluation Datasets**

   * Ensure **no overlap** between training data and evaluation data (e.g., for measuring language modeling loss or performance on benchmarks).

---

## 🔄 Deduplication Workflow (Process Overview):

| Step       | Focus                                  | Description                                                                                                                                                                                                                                                                              |
| ---------- | -------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Step 1** | Define the "Unit"                      | Based on the data’s source and structure, determine what the basic deduplication unit is (e.g., file, line, paragraph). Examples: LLaMA uses line-level deduplication for CommonCrawl; Book3 uses book-level deduplication (if >90% overlap); GitHub code uses file-level deduplication. |
| **Step 2** | Deduplication **within the Unit**      | Remove duplicate lines, paragraphs, or n-grams **within** a single Unit. For example, detect and remove repeated lines in one document. This step can be seen as a form of "low-quality" filtering (e.g., Gopher).                                                                       |
| **Step 3** | Deduplication **between Units**        | Compare different Units to remove near-duplicate or exact duplicate content across different files/documents. Keep only unique Units (e.g., Gopher).                                                                                                                                     |
| **Step 4** | Deduplicate test set from training set | Filter out any test data that overlaps with the training data to avoid data leakage during evaluation.                                                                                                                                                                                   |

---

## 🧠 Common Techniques

Techniques like **SimHash**, **MinHash** are used to deduplicate text at scale. These help estimate **how many unique training samples** are needed.

📝 **Guideline**:
You typically need about **10 trillion tokens** of data to train a model, and with **80% deduplication precision**, the results are generally reliable.

---

Let me know if you'd like to implement or test any of these deduplication methods in your pipeline!


Absolutely! Here's the English translation and explanation of the third image:

---

## 🧬 Exploring Data Diversity (数据多样性探索)

Data diversity has many dimensions:

* **Task diversity**
* **Linguistic diversity**
* **Domain/topic diversity**
* **Data source diversity**

In large-scale model training, **data quality and diversity** are critical. While different companies use data from various sources, **only about 0.5%** of data may be high quality. So, it's essential to carefully select and **filter for diversity**.

### 👉 A Common Strategy:

**Generalization → Clustering → Core Sample Selection (三步法)**

📘 **Reference:**
**LIMA**: *Less Is More for Alignment; maybe only 0.5% data is needed*

---

### 🧠 Illustration (Right Diagram Summary)

The image shows the process of selecting diverse and high-quality samples:

1. **Encoding**: Sentences are converted into embeddings.
2. **Mean Pooling + Normalization**: Data is preprocessed into a unified format.
3. **Clustering**: Samples are grouped in latent space.
4. **Sampling**: Samples are drawn from clusters to ensure diversity.
5. **LLM Fine-Tuning & Evaluation**: The selected samples are used to tune LLMs and evaluate them using various tasks.

---

## ⭐ Techniques to Enhance Data Diversity

Use **SimBERT** for generating embeddings. Quality scores can come from a **dedicated quality scoring model**.

### **Three Strategies for Selecting Diverse Samples:**

#### **Option 1: K-means clustering-based diversity sampling**

* Cluster data using k-means and pick representative samples.
* **Limitation**: This may prefer poor-quality samples if they dominate the clusters.

#### **Option 2: Class-weighted sampling + diversity scoring**

* Use both class-based diversity scores and quality scores to select data.
* **Limitation**: Poor clustering can lead to semantically similar samples being scattered across clusters.

#### **Option 3 (Recommended): KNN clustering + similarity-based diversity sampling**

* Use KNN to group similar samples, compute **average similarity per class**, and **reweight based on quality**.
* Finally, **resample based on both diversity and quality**.
* **Result**: The final data distribution closely approximates a normal distribution.

📘 **Reference:**
**D4**: *Improving LLM Pretraining via Document De-Duplication and Diversification*

---

📌 **Tip:**
Semantically diverse sentences with the same meaning can still convey different knowledge — managing this helps maintain useful variation in training.

Let me know if you want code examples or tools to implement these techniques!


Sure! Here's a full English explanation of the content in the image:

---

## **Practical Vertical Domain Experience**

### **Overview Diagram (Top Section)**

This flowchart describes the process of building a domain-specific (vertical) dataset from general data:

1. **Keyword Mining**:

   * Extract keywords from initial seed data (basic data you start with).
2. **Semantic Similarity-Based Recall for Candidate Seed Corpus**:

   * Use these keywords to retrieve related documents from a larger, general-purpose corpus based on semantic similarity.
   * Filter by humans to create a cleaned seed corpus.
3. **Train Vertical Domain Classifier**:

   * Use this seed corpus to train a classifier that distinguishes between relevant and irrelevant data in the domain.
   * Use the classifier to filter the current large dataset.

---

### **Key Steps (Bottom Section)**

The method mainly references how **DeepseekMath** handles vertical domain data:

---

### **1. Crawl as Much General Data as Possible**

* Used as the starting raw material (initial data library).

---

### **2. From the Chinese corpus, extract domain-related data through clustering and classification.**

* The goal: Identify domain-specific data, like documents from open-source encyclopedias, technical books, or web articles that are related to the domain.
* Expand the seed data through a structured pipeline:

---

### **Pipeline Overview**

**Keyword Initialization from General Data → Retrieve Top 10 Most Similar Documents to Seed Data → Manual Filtering**

---

### **a. Keyword Initialization from General Data**

i. **Preprocessing**:

* Clean the seed data and split text into smaller segments to retain useful phrase-level information.
* This helps classify documents with different topics or sections.

ii. **Keyword Extraction**:

* Use Jieba (a Chinese word segmentation tool) to extract keywords from the segments.
* Reference tool: \[138 Chinese NLP Tools Handbook - New Word Discovery].

iii. **Keyword Filtering**:

* Use TF-IDF and frequency analysis to do a first round of filtering, followed by human review to finalize the keyword list.

iv. **Search Top-k Documents Using Keywords**:

* Use the keywords to retrieve the most related documents (e.g., top 10 per keyword).

---

### **b. Retrieve Top 10 Documents Most Similar to Seed Data**

i. **Embedding-Based Similarity**:

* Convert documents into embeddings using 512-token segments, and calculate semantic similarity.

---

### **c. Human Filtering**

i. **Final Filtering**:

* Manually review the retrieved data and select the most relevant documents as the final seed corpus.

---

### **3. Train a Classifier to Identify Domain-Specific Data**

a. **Training Data**:

* Positive examples: Clean seed documents (manually verified).
* Negative examples: Non-relevant documents selected via random sampling (same document length).

b. **Model Selection**:

* Use a 1024-token BERT model.

c. **Model Application**:

* Use the trained model to classify large-scale Chinese datasets.
* For each document, extract the top 1k tokens and run classification.
* Choose documents with >0.5 confidence as domain-relevant.

---

Let me know if you'd like a visual version of this or if you'd like it adapted to a specific use case or domain (like medical, legal, etc.)!
