# Pretraining Data Curation in NeMo Curator

## Table of Contents

1. [Introduction](#introduction)
2. [Getting Started](#get-start)
3. [RedPajama-Data-v2](#rpv2)
4. [Data Preprocessing](#preprocess)
5. [Deduplication](#dedup)
6. [Quality filtering](#filter)

# 1. Introduction
<a id="introduction"></a>

In this tutorial, we will show how to curate large-scale data for LLM pretraining in a distributed environment using NeMo Curator. Specifically, we will focus on the following modules in NeMo Curator:

- Language identification and separation
- Text reformatting and cleaning
- Quality filtering
- Document-level deduplication

For demonstration, we will use the [RedPajama-Data-v2](#rpv2) dataset, an open dataset for LLM pretraining.

## 1.1 System Information
Here is the information on the system this notebook was run on:

- **GPU**: 2 A100 nodes (each with 8 A100-SXM4-80GB)

- **CUDA & Nvidia Drivers**: CUDA 12.4 with Driver 535.104.12

- **OS**: Ubuntu 22.04.4 LTS

## 1.2 Running NeMo Curator

NeMo Curator comes pre-installed in the NeMo Framework container. This notebook uses the 24.07 release of the NeMo Framework container. The user can pull the container by following the steps below:

- Get access to the NeMo Framework container on [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo)

- Set your Docker credentials:


    `docker login nvcr.io`

    Username: `$oauthtoken`
    
    Password: `<NGC_API_KEY Key>`
    
- Pull the NeMo Framework Container image
    
    `docker pull docker pull nvcr.io/nvidia/nemo:24.07`

Alternatively, NeMo Curator is available on [PyPi](https://pypi.org/project/nemo-curator/) and [GitHub](https://github.com/NVIDIA/NeMo-Curator).

# 2. Getting started
<a id="get-start"></a>

NeMo Curator uses Dask for parallelization. Before we start using NeMo Curator, we need to start a Dask cluster. To start a multi-node Dask cluster in Slurm, we can use the `start-distributed-notebook.sh` script in this directory. The user will need to change the following variables:

- Slurm job directives
- Device type (`cpu` or `gpu`). NeMo Curator has both CPU-based and GPU-based modules. Check [here](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/cpuvsgpu.html) to see which modules are CPU-based and/or GPU-based
- CPU-related parameters which are used for CPU-based modules: configure the number of workers and the memory limit to efficiently use available computational resources and prevent out of memory errors
- Path to the NeMo Framework container image
- Path to `container-entrypoint.sh` script, which is responsible for launching the Dask schduler and workers

Running the script will also launch a JupyterLab session on the rank 0 node and pass the Dask scheduler address as an environment variable to be used later for connecting to the Dask client.

The preprocessing modules such as AddId and text cleaning are CPU-based, so we will start a CPU-based Dask cluster first.

In [None]:
import os
import time
import warnings
import dask.dataframe as dd
import dask_cudf
import cudf
from dask.distributed import wait
import numpy as np

from nemo_curator import get_client
from nemo_curator.utils.distributed_utils import (
    get_num_workers,
    read_data,
    write_to_disk,
)

warnings.filterwarnings('ignore')
base_dir = "/path/to/data"

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
scheduler_address = os.getenv('SCHEDULER_ADDRESS')
cpu_client = get_client(scheduler_address=scheduler_address)
print(f"Num Workers = {get_num_workers(cpu_client)}", flush=True)

Num Workers = 256


# 3. RedPajama-Data-v2
<a id="rpv2"></a>

RedPajama-V2 (rpv2) is an advanced open-source initiative designed to support the development of large language models (LLMs). This dataset, sourced from 84 CommonCrawl snapshots, spans five major languages—English, French, Spanish, German, and Italian—making it one of the largest and most comprehensive public datasets available for LLM training.

The RedPajama-V2 dataset is available on [Huggingface](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2).

For this tutorial, we will start with a single snapshot from rpv2 and then scale to multiple snapshots to demonstrate the pre-training data curation workflow.

The raw rpv2 data is stored in compressed json. We will first decompress the json.gz file and write them into jsonl files. For this, we will use a helper function `convert_json_gz_to_jsonl` in `helper.py`


In [4]:
from helper import convert_json_gz_to_jsonl

input_data_dir = os.path.join(base_dir,"rpv2-2023-06-raw")
output_data_dir = os.path.join(base_dir,"rpv2-2023-06")

t0 = time.time()
convert_json_gz_to_jsonl(input_data_dir, output_data_dir)
print(f"Uncompressing data took {time.time()-t0} s")

Uncompressing data took 890.2869493961334 s


To get started, we can read the jsonl files into a `DocumentDataset` which is the standard format for text dataset used in curator.

In [8]:
from nemo_curator.datasets import DocumentDataset

input_dataset = DocumentDataset.read_json(output_data_dir, add_filename=True)

Reading 15025 files


`DocumentDataset` is essentially a wrapper around dask dataframe and we can get the dataframe by calling `input_dataset.df`:

In [None]:
input_dataset.df.head()

There are a total of 1,088,468,779 documents in this single snapshot.

In [10]:
len(input_dataset.df)

1088468779

# 4. Data Preprocessing
<a id="preprocess"></a>

## 4.1 Data resharding

The input text files have varying sizes, which leads to imbalanced partitions that could result in out-of-memory issues. Ideally, we want to make balanced text files of similar sizes. Curator offers utility to reshard the text files to simiar sizes.

In [11]:
from nemo_curator.utils.file_utils import reshard_jsonl
from nemo_curator.utils.file_utils import expand_outdir_and_mkdir

output_resharded_dir = expand_outdir_and_mkdir(os.path.join(base_dir,"rpv2-2023-06-resharded"))

t0 = time.time()
reshard_jsonl(
    output_data_dir,
    output_resharded_dir,
    output_file_size="100M",
    start_index=0,
    file_prefix="rpv2-2023-06",
)
print(f"Data sharding took:{time.time()-t0}")

Data sharding took:552.2274513244629


[Optional] Removing the raw dataset to save disk space:

In [15]:
!rm -rf {base_dir}/rpv2-2023-06

## 4.2 Add ID

We will assign a unique ID for each document in the dataset so we can refrence them.

In [6]:
from nemo_curator import AddId
from nemo_curator.datasets import DocumentDataset

We will create an instance of Curator's `AddId` class and use it to add ID for all documents in the dataset.

In [10]:
input_data_dir = os.path.join(base_dir,"rpv2-2023-06-resharded")
input_dataset = DocumentDataset.read_json(input_data_dir, add_filename=True)
id_data_dir = expand_outdir_and_mkdir(os.path.join(base_dir,"rpv2-2023-06-id"))

t0 = time.time()
# specify add_id function
add_id = AddId(
    id_field="id",
    id_prefix="rpv2-2023-06",
)
id_dataset = add_id(input_dataset)
id_dataset.to_json(id_data_dir, write_to_filename=True)
print(f"Adding ID took :{time.time()-t0}")

Reading 37848 files
Writing to disk complete for 37848 partitions
Adding ID took :1472.3535017967224


We can validate the added IDs below:

In [None]:
id_dataset.df.head(3)

[Optional] Remove the sharded dataset to save disk space:

In [12]:
!rm -rf {base_dir}/rpv2-2023-06-sharded

## 4.3 Language ID and Separation

Data curation usually includes steps that are language specific (e.g. using language-tuned heuristics for quality filtering). NeMo Curator provides utilities to identify languages. The language identification is performed using fastText.

It is worth mentioning that even though a preliminary language identification has been performed on rpv2 and we started with English-only dataset, fastText is more accurate so it can be used for a second pass.

In [None]:
from nemo_curator import ScoreFilter
from nemo_curator.filters import FastTextLangId
from nemo_curator.utils.file_utils import get_all_files_paths_under, separate_by_metadata

# Language ID path
language_output_path = expand_outdir_and_mkdir(os.path.join(base_dir, "rpv2-2023-06-language"))
language_data_output_path = expand_outdir_and_mkdir(os.path.join(language_output_path, "data"))

# Fasttext model path
model_path = language_output_path

# Define key in output .jsonl files to store the language information
language_field = "language"

Download the fastText model for langague detection.

In [None]:
!wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin -P {model_path}

We will create an instance of Curator's `ScoreFilter` and use a helper function `separate_by_metadata` to separate the dataset into subfolders based on language.

In [8]:
t0 = time.time()

# Load dataset
id_data_dir = os.path.join(base_dir,"rpv2-2023-06-id")
input_dataset = DocumentDataset.read_json(id_data_dir, add_filename=True)

# Define Language separation pipeline
lang_filter = FastTextLangId(os.path.join(model_path,'lid.176.bin'))
language_id_pipeline = ScoreFilter(
    lang_filter, 
    score_field=language_field,
    text_field="raw_content",
    score_type='object'
)
filtered_dataset = language_id_pipeline(input_dataset)

# drop the detailed classifier score
filtered_dataset.df[language_field] = filtered_dataset.df[language_field].apply(
    lambda score: score[1],meta = (language_field, 'object')
    )

# Split the dataset to corresponding language sub-folders
language_stats = separate_by_metadata(
    filtered_dataset.df, 
    language_data_output_path, 
    metadata_field=language_field
).compute()

print(f"Time taken for splitting language:{time.time()-t0}")

Reading 37848 files
Time taken for splitting language:4645.465864896774


The English dataset has 1,088,311,520 documents compared to 1,088,468,779 documents in the raw dataset. This is because the raw dataset is aleady detected and filtered to English dataset.

In [10]:
en_dataset_path = os.path.join(base_dir,"rpv2-2023-06-language/data/EN")
en_dataset = DocumentDataset.read_json(en_dataset_path, add_filename=True)

len(en_dataset)

Reading 37848 files


1088311520

[Optional] Removing the ID'ed data to save disk space:

In [None]:
!rm -rf {base_dir}/rpv2-2023-06-id

In [None]:
ja_dataset_path = os.path.join(base_dir,"rpv2-2023-06-language/data/JA")
ja_dataset = DocumentDataset.read_json(ja_dataset_path, add_filename=True)

ja_dataset.df.head(1)

## 4.4 Text cleaning

Datasets may have improperly decoded unicode characters. Curator provides utilities to fix improperly decoded unicode characters based on the heuristics defined within the `ftfy` package.

In [4]:
import nemo_curator
from nemo_curator.modifiers import UnicodeReformatter

en_dataset_path = os.path.join(base_dir,"rpv2-2023-06-language/data/EN")
en_dataset = DocumentDataset.read_json(en_dataset_path, add_filename=True)

Reading 37848 files


Curator offers uses the `modify` method with `UnicodeReformatter` for text cleaning. It requires the following arguments:

In [5]:
# make directory for cleaned dataset
output_clean_dir = expand_outdir_and_mkdir(os.path.join(base_dir,"rpv2-2023-06-en-cleaned"))
# specify text field name and file type
input_text_field = "raw_content"
input_file_type = "jsonl"

In [6]:
t0 = time.time()
# specify clearner
cleaner = nemo_curator.Modify(
    UnicodeReformatter(), 
    text_field=input_text_field
)

# clean dataset and write to disk
cleaned_dataset = cleaner(en_dataset)
cleaned_dataset.to_json(output_clean_dir, write_to_filename=True)
print(f"Text cleaning took {time.time()-t0} s")

Writing to disk complete for 37848 partitions
Text cleaning took 6349.983360290527 s


[Optional] Removing intermediate data to save disk space:

In [9]:
!rm -rf {base_dir}/rpv2-2023-06-language/data/EN

# 5. Deduplication
<a id="dedup"></a>



## 5.1 Exact Deduplication

Exact dedup computes a hash for the raw text of each document. Documents with the same hash value will be exact duplicates and will be removed. Curator provides GPU-accelerated exact deduplication using Rapids.

In [None]:
from nemo_curator.modules import ExactDuplicates

def pre_imports():
    import cudf  # noqa: F401

In [3]:
scheduler_address = os.getenv('SCHEDULER_ADDRESS')
gpu_client = get_client(scheduler_address=scheduler_address)
print(f"Num Workers = {get_num_workers(gpu_client)}", flush=True)

gpu_client.run(pre_imports)
print("Pre imports complete")

Num Workers = 16
Pre imports complete


In [6]:
cleaned_dataset_path = os.path.join(base_dir,"rpv2-2023-06-en-cleaned")
log_dir = expand_outdir_and_mkdir(os.path.join(base_dir, "logs"))
input_id_field = 'id'
input_text_field = 'raw_content'
hash_method = 'md5'
output_dir = expand_outdir_and_mkdir(os.path.join(base_dir,"rpv2-2023-06-exact-dedup"))

In [7]:
t0 = time.time()
# Read the input dataset from the cleaned dataset dir
input_dataset = DocumentDataset.read_json(cleaned_dataset_path, backend='cudf')

# Perform exact dedup
exact_dups = ExactDuplicates(
    logger=log_dir,
    id_field=input_id_field,
    text_field=input_text_field,
    hash_method=hash_method,
    cache_dir=output_dir,
)
duplicates = exact_dups(dataset=input_dataset)
print(f"Exact dedup took:{time.time()-t0}")


Reading 37848 files
Exact dedup took:1275.6094808578491


Exact deduplication found 97,327,867 duplicated documents.

In [9]:
print(f"Number of exact duplicated file:{len(duplicates)}")

Number of exact duplicated file:97327867


Let's see the results of exact dedup:

In [10]:
duplicates_df = duplicates.df
duplicates_df.head()

Unnamed: 0,id,_hashes
0,rpv2-2023-06-0543500671,5bb014b8aca49d2d2a46925b63c09f7f
1,rpv2-2023-06-1721200315,0dba141f62e01ffedde20dd6bf28df50
2,rpv2-2023-06-1989800099,1e33a4ffce3154c8275ed09ff8049e1a
3,rpv2-2023-06-2578700629,11608d5ffe62efb623abdcb813f0827a
4,rpv2-2023-06-3538600607,cb72ac618d7a6e60cf7d012c6be82672


We can sort the duplicate cluster by size and see that the largest cluster has 1,819 exact duplicates.

In [17]:
duplicates_df.groupby('_hashes') \
             .agg({'id': 'count'}) \
             .rename(columns={'id': 'count'}) \
             .sort_values('count', ascending=False) \
             .head()

Unnamed: 0_level_0,count
_hashes,Unnamed: 1_level_1
b7ba44a047ca570585d182d28d1e6bf8,1819
0469bde3868757d92af369c59992b9d9,1785
bdc1e82cba718a4717c683bf6a5541bd,1784
f14149344e6519beaac2590b0535d267,1771
f88eb7064d8e73c081af0731ba73c451,1765


In [13]:
dup_group = duplicates_df[duplicates_df['_hashes'] == 'b7ba44a047ca570585d182d28d1e6bf8'].compute()
dup_group.head()

Unnamed: 0,id,_hashes
1,rpv2-2023-06-0962900660,b7ba44a047ca570585d182d28d1e6bf8
5,rpv2-2023-06-2417100276,b7ba44a047ca570585d182d28d1e6bf8
8,rpv2-2023-06-2936200328,b7ba44a047ca570585d182d28d1e6bf8
9,rpv2-2023-06-1423100927,b7ba44a047ca570585d182d28d1e6bf8
16,rpv2-2023-06-2499600613,b7ba44a047ca570585d182d28d1e6bf8


[Optional] Verify if the documents with the same hash are exactly the same. We can use the ids from the cell output above (ids may change so revise the `dup_ids` as needed):

In [16]:
t0 = time.time()
dup_ids = ['rpv2-2023-06-0962900660', 'rpv2-2023-06-2417100276', 'rpv2-2023-06-2936200328'] 
dup_examples = input_dataset.df[input_dataset.df['id'].isin(dup_ids)].compute()
print(f"Searching for example duplicates with specific IDs took {time.time()-t0} seconds")

Searching for example duplicates with specific IDs took 631.4109137058258 seconds


In [None]:
dup_examples

In [None]:
print('Example duplicate 1\n' + dup_examples.raw_content.iloc[0])
print('\n\nExample duplicate 2\n' + dup_examples.raw_content.iloc[1])
print('\n\nExample duplicate 3\n' + dup_examples.raw_content.iloc[2])

Now, we will remove the exact duplicates and write the remaining dataset to disk.

In [5]:
input_dataset = DocumentDataset.read_json(cleaned_dataset_path, add_filename=True, backend='cudf')
duplicates = DocumentDataset.read_parquet(os.path.join(output_dir,"_exact_duplicates.parquet"), backend='cudf')
duplicates_df = duplicates.df

Reading 37848 files
Reading 1 files


In [19]:
output_dir = expand_outdir_and_mkdir(os.path.join(base_dir,"rpv2-2023-06-exact-dup-removed"))

t0 = time.time()
docs_to_remove = duplicates_df.map_partitions(
    lambda x: x[x._hashes.duplicated(keep="first")]
)

# When there are few duplicates we can compute the results to a list and use `isin`.
result = input_dataset.df[
    ~input_dataset.df[input_id_field].isin(
        docs_to_remove[input_id_field].compute()
    )
]

write_to_disk(
    result,
    output_dir,
    write_to_filename=True,
    output_type='jsonl',
)

print(f"Removing exact duplicates took:{time.time()-t0}")

Writing to disk complete for 37848 partitions
Removing exact duplicates took:1563.168622970581


We can see that exact dedup removed 70,675,782 documents and we now have 1,017,635,738 documents left in the dataset.

In [20]:
len(docs_to_remove)

70675782

In [21]:
len(result)

1017635738

## 5.2 Fuzzy Deduplication

Fuzzy deduplication aims to find near-duplicated documents in our dataset. Near-duplicated documents are common in web crawl data due to plagiarism and mirror sites. Removing them can help improve the quality of trained models. In many cases, we can skip exact dedup and just perform fuzzy dedup as it will also find the exact duplicates. Thus, we will start with the cleaned dataset for fuzzy dedup.

NeMo Curator implements GPU-accelerated Fuzzy Deduplication based on a minhash + LSH algorithm for finding similar documents across the dataset. Specifically, Fuzzy Deduplication includes 4 steps:

- Compute minhashes
- Locality-Sensitive Hashing (LSH)
- Buckets to Edges
- Connected components


In [2]:
def pre_imports():
    import cudf  # noqa: F401

scheduler_address = os.getenv('SCHEDULER_ADDRESS')
gpu_client = get_client(scheduler_address=scheduler_address)
print(f"Num Workers = {get_num_workers(gpu_client)}", flush=True)

gpu_client.run(pre_imports)
print("Pre imports complete")

Num Workers = 16
Pre imports complete


### 5.2.1 Compute minhashes

First, we will compute the minhash signature for each documents. For this purpose, each document will be represented by a set of n-grams. We will apply random hash functions on each element of the set. The minimum hash value generated by each hash function will be recorded and becomes a component of the MinHash signature. Thus, the length of the minhash signature will be the same as the number of hash functions. 

In [9]:
from nemo_curator import MinHash

input_data_dir = os.path.join(base_dir,"rpv2-2023-06-en-cleaned")
seed = 42
minhash_length = 260
char_ngram = 24
log_dir = expand_outdir_and_mkdir(os.path.join(base_dir, "logs"))
id_field = 'id'
text_field = 'raw_content'
minshah_output_dir = expand_outdir_and_mkdir(os.path.join(base_dir,"rpv2-2023-06-minhash"))

In [10]:
files = get_all_files_paths_under(
    root=input_data_dir, recurse_subdirectories=False, keep_extensions="jsonl"
)
df = read_data(
    files,
    file_type="jsonl",
    backend="cudf",
    files_per_partition=1,
    add_filename=False,
)[[id_field, text_field]]

Reading 37848 files


In [33]:
t0 = time.time()

# Run MinHash() on input data
minhasher = MinHash(
    seed=seed,
    num_hashes=minhash_length,
    char_ngrams=char_ngram,
    use_64bit_hash=False,
    logger=log_dir,
    id_field=id_field,
    text_field=text_field,
    cache_dir=minshah_output_dir
)

result = minhasher(DocumentDataset(df)).df

print(f"Computing minhashes took:{time.time()-t0}")

Computing minhashes took:5161.864866495132


We can see some example outputs from the minhash computation.

In [12]:
result.head()

Unnamed: 0,id,_minhash_signature
0,rpv2-2023-06-0000000000,"[56978, 157261, 839276, 103231, 51779, 396833,..."
1,rpv2-2023-06-0000100000,"[4644772, 2991701, 2571423, 12369524, 50603761..."
2,rpv2-2023-06-0000200000,"[1312196, 17635, 1520869, 3337920, 2052016, 10..."
3,rpv2-2023-06-0000300000,"[5374828, 2268627, 4903126, 2134671, 1828983, ..."
4,rpv2-2023-06-0000400000,"[4999022, 2320370, 2068984, 3469276, 621627, 5..."


### 5.2.2 Minhash LSH

LSH() implements LSH algorithm which includes the following steps:

- Divide the minhash signature array into X different portions.

- For each portions, hash the minhash values into buckets. One document will be assigned to X buckets.

- Documents within the same bucket will be deemed similar. Since every document will be assigned X buckets and as long as two documents share 1 or more buckets they are deemed similar, the result of LSH will have more false positive as compared to false negative. The false positive cases will be filtered in following modules, namely jaccard compute.

In [14]:
from nemo_curator import LSH

lsh_input_dir = os.path.join(base_dir,"rpv2-2023-06-minhash")
id_field = 'id'
output_bucket_dir = expand_outdir_and_mkdir(os.path.join(base_dir,"fuzzy-dedup-output-2023-06"))
num_bands = 20
buckets_per_shuffle = 1
minhash_field = '_minhash_signature'
minhash_length = 260
log_dir = os.path.join(base_dir, "logs")

In [3]:
t0 = time.time()

#Load MinHash output
df = dask_cudf.read_parquet(lsh_input_dir, blocksize="2GB", aggregate_files=True)

lsh = LSH(
    cache_dir=output_bucket_dir,
    num_hashes=minhash_length,
    num_buckets=num_bands,
    buckets_per_shuffle=buckets_per_shuffle,
    id_fields=id_field,
    minhash_field=minhash_field,
    logger=log_dir,
)

lsh_result = lsh(DocumentDataset(df))
print(f"LSH took {time.time()-t0} s")

LSH took 6116.864866495132 s


In [None]:
lsh_result.df.head()

### 5.2.3 Buckets to Edges

`BucketsToEdges` is designed to take the bucket information from the output of LSH and create an edgelist dataset where documents with the same `_bucket_id` are connected with an edge between them. This edgelist can then be passed on the connected components to identify groups of similar documents across buckets. Since the false positive check is skipped all documents within a bucket are considered to be duplicates of each other and assigned a jaccard similarity of 1.0 to avoid edge removal during the next step.

In [None]:
from nemo_curator import BucketsToEdges

id_field = 'id'

cache_dir = os.path.join(base_dir, "fuzzy-dedup-output-2023-06")
input_bucket_path = os.path.join(cache_dir,"_buckets.parquet")
input_bucket_field = '_bucket_id'
log_dir = os.path.join(base_dir, "logs")

In [None]:
t0 = time.time()

# Read "_buckets.parquet"
ddf_bk = DocumentDataset.read_parquet(
    input_bucket_path, 
    backend="cudf"
)

#Run _MapBuckets()
buckets_to_edges = BucketsToEdges(
    cache_dir=cache_dir,
    id_fields=id_field,
    bucket_field=input_bucket_field, 
    logger=log_dir,
)

edgelist_df = buckets_to_edges(ddf_bk)


print(f"Buckets to Edgelist took {time.time()-t0} s")

In [None]:
edgelist_df.head()

### 5.2.4 Connected Component

After all buckets were processed and duplicates (at the threshold) were approximately discovered,
we constructed a sparse document graph and found the connected components therein (using scipy). Each
connected component represents a set of documents that we consider similar enough to be duplicates, and
from which we select a single representative.

In [None]:
from nemo_curator import ConnectedComponents

cache_dir = expand_outdir_and_mkdir(
    os.path.join(base_dir, "fuzzy-dedup-output-2023-06/cc-cache")
)
edgelist_path = os.path.join(base_dir, "fuzzy-dedup-output-2023-06/jaccard_similarity_results.parquet")
id_field = 'id'

output_path = expand_outdir_and_mkdir(
    os.path.join(base_dir, "fuzzy-dedup-output-2023-06/connected_components.parquet")
)

In [None]:
t0 = time.time()
components_stage = ConnectedComponents(
    cache_dir=cache_dir,
    jaccard_pairs_path=edgelist_path,
    id_column=id_field,
)
components_stage(output_path=output_path)
print(f"Connected Component took {time.time()-t0} seconds")

Let's check the results of connected components step. We can see that 239,037,733 are identified as duplicates to be removed.

In [5]:
output_path = os.path.join(base_dir, "fuzzy-dedup-output-2023-06/connected_components.parquet")
cc_result = dask_cudf.read_parquet(output_path, split_row_groups=False).repartition(npartitions=1)

# Set 'group' as the index and shuffle to ensure all same 'group' values are in the same partition
cc_result = cc_result.set_index('group', shuffle='tasks')

# Define a function to assign cumulative counts and filter duplicates
def assign_cumcount(df):
    df['cumcount'] = df.groupby(level=0).cumcount()
    df = df[df['cumcount'] >= 1]
    df = df.drop(columns=['cumcount'])
    return df

# Find duplicates by applying the function to each partition
docs_to_remove = cc_result.map_partitions(assign_cumcount, meta=cc_result)

# Reset the index
docs_to_remove = docs_to_remove.reset_index()

docs_to_remove = docs_to_remove[["id"]]
docs_to_remove = docs_to_remove.rename(columns={"id":"to_remove_doc_id"})
docs_to_remove = docs_to_remove.reset_index(drop=True).persist()
_ = wait(docs_to_remove)
del _ 

print("num of docs to remove =", len(docs_to_remove))

num of docs to remove = 239037733


We can examine the size of the duplicate clusters. The largest cluster has 775,379 near duplicates.

In [7]:
cc_grouped = cc_result.groupby('group').agg({'id': 'count'}).rename(columns={'id': 'count'}).sort_values('count', ascending=False).compute()
cc_grouped.head()

Unnamed: 0_level_0,count
group,Unnamed: 1_level_1
350652173,775379
93521324,493227
24,112861
319292355,96224
70141069,67474


[Optional] Verify if fuzzy duplicates are similar. For example, we can look into the largest group "350652173".

In [None]:
dup_group = cc_result.loc[350652173].compute()
dup_group.head()

We will examine the first five documents in this cluster:

In [4]:
# read input dataset
input_data_dir = os.path.join(base_dir, "rpv2-2023-06-en-cleaned")
input_dataset = DocumentDataset.read_json(input_data_dir, add_filename=True)

Reading 37848 files


Let's visualize the content of these documents and see if they are similar (ids may change so revise the `dup_ids` as needed).

In [10]:
t0 = time.time()
dup_ids = [
    'rpv2-2023-06-1285625132',
    'rpv2-2023-06-2033200488',
    'rpv2-2023-06-0428016172',
    'rpv2-2023-06-1268721963',
    'rpv2-2023-06-1285428574'
] 
dup_examples = input_dataset.df[input_dataset.df['id'].isin(dup_ids)].compute()
print(f"Searching for near duplicate examples with specific IDs took {time.time()-t0} seconds")

Searching for near duplicate examples with specific IDs took 610.5046670436859 seconds


In [None]:
dup_examples

In [None]:
print('Example duplicate 1\n' + dup_examples.raw_content.iloc[0])
print('\n\nExample duplicate 2\n' + dup_examples.raw_content.iloc[1])
print('\n\nExample duplicate 3\n' + dup_examples.raw_content.iloc[2])
print('\n\nExample duplicate 4\n' + dup_examples.raw_content.iloc[3])
print('\n\nExample duplicate 4\n' + dup_examples.raw_content.iloc[4])

### 5.2.5 Duplicates Removal

Next, we will proceed to remove the duplicates identified from the dataset.

In [3]:

input_dataset = DocumentDataset.read_json(os.path.join(base_dir, "rpv2-2023-06-en-cleaned"), backend="cudf")
input_df = input_dataset.df[['raw_content','id']]

Reading 37848 files


Then, we will perform a merge between the `input_df` and the `docs_to_remove` on the IDs and drop the fuzzy duplicates.

In [6]:
dedup_output_dir = expand_outdir_and_mkdir(os.path.join(base_dir, "rpv2-2023-06-deduped"))
deduped_df = input_df.merge(docs_to_remove,
                             left_on=['id'],
                             right_on=["to_remove_doc_id"],
                             how='left')

deduped_df = deduped_df[deduped_df['to_remove_doc_id'].isna()].drop(columns=['to_remove_doc_id']).reset_index(drop=True)

t0 = time.time()
deduped_df.to_parquet(dedup_output_dir)
print(f"Removing duplicates and writing deduped dataset took {time.time()-t0} seconds")

Removing duplicates and writing deduped dataset took 1241.3191509246826 seconds


To verify the results, we can confirm that we have 849,273,787 documents left compared to 1,088,311,520 in the input dataset, essentially removing 239,037,733 duplicates.

In [6]:
len(deduped_df)

849273787

In [7]:
len(input_df)

1088311520

## 5.3 Inter-snapshot Deduplication

So far we have deduplicated a single snapshot from rpv2. A pre-training dataset can include multiple snapshots so we will often need to perform inter-snapshot deduplication. For this tutorial, we will demonstrate deduplication across two snapshots as an example.

We first performed all the above steps for another snapshot `2023-14` and then combined the two deduped datasets into one and stored them in `rpv2-2023-06-and-14-deduped`.

Next, we will perform the fuzzy deduplication on the combined dataset.

### 5.3.1 Compute Minhash

In [None]:
from nemo_curator import MinHash, LSH, BucketsToEdges, ConnectedComponents

from nemo_curator.utils.file_utils import reshard_jsonl

In [23]:
seed = 42
minhash_length = 260
char_ngram = 24
log_dir = expand_outdir_and_mkdir(os.path.join(base_dir, "logs"))
id_field = 'id'
text_field = 'raw_content'
minshah_output_dir = expand_outdir_and_mkdir(os.path.join(base_dir,"rpv2-2023-06-and-14-minhash"))

In [9]:
input_data_dir = os.path.join(base_dir,"rpv2-2023-06-and-14-deduped")

files = []
for file in os.listdir(input_data_dir):
    if file.endswith('.part'):
        new_file = file.replace('.part', '.jsonl')
        old_file_path = os.path.join(input_data_dir, file)
        new_file_path = os.path.join(input_data_dir, new_file)
        os.rename(old_file_path, new_file_path)
    files.append(new_file_path)


In [19]:
files = [f for f in files if f.endswith(".jsonl")]
df = read_data(
    files,
    file_type="jsonl",
    backend="cudf",
    files_per_partition=2,
    add_filename=False,
)[[id_field, text_field]]

Reading 72797 files


In [24]:
t0 = time.time()

# Run MinHash() on input data
minhasher = MinHash(
    seed=seed,
    num_hashes=minhash_length,
    char_ngrams=char_ngram,
    use_64bit_hash=False,
    logger=log_dir,
    id_field=id_field,
    text_field=text_field,
    cache_dir=minshah_output_dir
)

result = minhasher(DocumentDataset(df)).df

print(f"Computing minhashes took:{time.time()-t0}")

Computing minhashes took:6115.702769517899


In [25]:
result.head()

Unnamed: 0,id,_minhash_signature
0,rpv2-2023-06-0678400000,"[36422228, 15993596, 3538361, 16103012, 194100..."
1,rpv2-2023-06-0678500000,"[34662, 17635, 1112347, 293654, 313382, 160184..."
2,rpv2-2023-06-0678600000,"[15076006, 1801689, 3181854, 2949398, 5699436,..."
3,rpv2-2023-06-0678700000,"[13528976, 2438382, 26260517, 26187347, 249748..."
4,rpv2-2023-06-0678800000,"[2550974, 157261, 1536526, 1169030, 576861, 10..."


### 5.3.2 Minhash LSH

In [7]:
lsh_input_dir = os.path.join(base_dir,"rpv2-2023-06-and-14-minhash")
id_field = 'id'
output_bucket_dir = expand_outdir_and_mkdir(os.path.join(base_dir,"fuzzy-dedup-output-2023-06-and-14"))
num_bands = 20
buckets_per_shuffle = 1
minhash_field = '_minhash_signature'
minhash_length = 260
log_dir = os.path.join(base_dir, "logs")

In [8]:
t0 = time.time()

#Load MinHash output
df = dask_cudf.read_parquet(lsh_input_dir, blocksize="2GB", aggregate_files=True)

lsh = LSH(
    cache_dir=output_bucket_dir,
    num_hashes=minhash_length,
    num_buckets=num_bands,
    buckets_per_shuffle=buckets_per_shuffle,
    id_fields=id_field,
    minhash_field=minhash_field,
    logger=log_dir,
)

lsh_result = lsh(DocumentDataset(df))
print(f"LSH took {time.time()-t0} s")

LSH took 10536.635195493698 s


In [None]:
lsh_result.df.head()

### 5.3.3 Buckets to Edges

In [6]:
input_data_paths = [os.path.join(base_dir,"rpv2-2023-06-and-14-deduped")]

id_field = 'id'
cache_dir = os.path.join(base_dir, "fuzzy-dedup-output-2023-06-and-14")
input_bucket_path = os.path.join(cache_dir,"_buckets.parquet")
input_bucket_field = '_bucket_id'
log_dir = os.path.join(base_dir, "logs")

In [None]:
t0 = time.time()

# Read "_buckets.parquet"
ddf_bk = DocumentDataset.read_parquet(
    input_bucket_path, 
    backend="cudf"
)

#Run _MapBuckets()
buckets_to_edges = BucketsToEdges(
    cache_dir=cache_dir,
    id_fields=id_field,
    bucket_field=input_bucket_field, 
    logger=log_dir,
)

edgelist_df = buckets_to_edges(ddf_bk)


print(f"Buckets to Edgelist took {time.time()-t0} s")

In [None]:
edgelist_df.head()

### 5.3.4 Connected Component

In [5]:
cache_dir = expand_outdir_and_mkdir(
    os.path.join(base_dir, "fuzzy-dedup-output-2023-06-and-14/cc-cache")
)
edgelist_path = os.path.join(base_dir, "fuzzy-dedup-output-2023-06-and-14/jaccard_similarity_results.parquet")
id_field = 'id'
output_path = expand_outdir_and_mkdir(
    os.path.join(base_dir, "fuzzy-dedup-output-2023-06-and-14/connected_components.parquet")
)

In [None]:
t0 = time.time()
components_stage = ConnectedComponents(
    cache_dir=cache_dir,
    jaccard_pairs_path=edgelist_path,
    id_column=id_field,
)
components_stage(output_path=output_path)
print(f"Connected Component took {time.time()-t0} seconds")

### 5.3.5 Duplicates Removal

From the outputs of the Connect Component step, we can see that inter-snapshot dedup found 81,764,804 duplicates.

In [4]:
output_path = os.path.join(base_dir, "fuzzy-dedup-output-2023-06-and-14/connected_components.parquet")
cc_result = dask_cudf.read_parquet(output_path, split_row_groups=False).repartition(npartitions=1)

# Set 'group' as the index and shuffle to ensure all same 'group' values are in the same partition
cc_result = cc_result.set_index('group', shuffle='tasks')

# Define a function to assign cumulative counts and filter duplicates
def assign_cumcount(df):
    df['cumcount'] = df.groupby(level=0).cumcount()
    df = df[df['cumcount'] >= 1]
    df = df.drop(columns=['cumcount'])
    return df

# Find duplicates by applying the function to each partition
docs_to_remove = cc_result.map_partitions(assign_cumcount, meta=cc_result)

# Reset the index
docs_to_remove = docs_to_remove.reset_index()

docs_to_remove = docs_to_remove["id"]
docs_to_remove = docs_to_remove.rename(columns={"id":"to_remove_doc_id"})
docs_to_remove = docs_to_remove.reset_index(drop=True).persist()
_ = wait(docs_to_remove)
del _ 

print("docs_to_remove", len(docs_to_remove))

docs_to_remove 81764804


Before proceeding to duplicates removal, we suggest resharding the data to fix potentially empty partitions due to duplicates removal for single snapshots.

In [10]:
output_resharded_dir = expand_outdir_and_mkdir(os.path.join(base_dir, "rpv2-2023-06-and-14-deduped-resharded"))

t0 = time.time()
reshard_jsonl(
    os.path.join(base_dir, "rpv2-2023-06-and-14-deduped"),
    output_resharded_dir,
    output_file_size="100M",
    start_index=0,
    file_prefix="rpv2-2023-06-and-14-deduped",
)
print(f"Data sharding took:{time.time()-t0}")

Data sharding took:904.7163739204407


In [5]:
input_dataset = DocumentDataset.read_json(os.path.join(base_dir, "rpv2-2023-06-and-14-deduped-resharded"), backend="cudf")
input_df = input_dataset.df[['raw_content','id']]

Reading 72780 files


In [7]:
dedup_output_dir = expand_outdir_and_mkdir(os.path.join(base_dir, "/rpv2-2023-06-and-14-inter-deduped"))
deduped_df = input_df.merge(docs_to_remove,
                             left_on=['id'],
                             right_on=["to_remove_doc_id"],
                             how='left')

deduped_df = deduped_df[deduped_df['to_remove_doc_id'].isna()].drop(columns=['to_remove_doc_id']).reset_index(drop=True)

t0 = time.time()
deduped_df.to_parquet(dedup_output_dir)
print(f"Removing duplicates and writing deduped dataset took {time.time()-t0} seconds")

Removing duplicates and writing deduped dataset took 2084.46063041687 seconds


We can verify that the deduped dataset has 1,585,546,179 documents, compared to 1,667,310,983 documents befoe dedup.

In [8]:
len(deduped_df)

1585546179

In [9]:
len(input_df)

1667310983

# 6. Quality Filtering
<a id="filter"></a>

Web crawled dataset often has low quality documents that we do not want the model to learn from. We can perform quality filtering to remove low quality data. NeMo Curator offers modules for both classifier-based and heuristic-based filtering. In this tutorial, we will perform heuristic filtering using a list of heuristic filters to improve data quality.

Curator provides a generic list of heuristic filters but for this tutorial, we only select 10 filters for demo purposes. The selected filters are given in `config/heuristic_filter_en.yaml`.

Heuristic filtering in Curator is a cpu module so we will need to use the cpu cluter.

In [2]:
scheduler_address = os.getenv('SCHEDULER_ADDRESS')
cpu_client = get_client(scheduler_address=scheduler_address)
print(f"Num Workers = {get_num_workers(cpu_client)}", flush=True)

Num Workers = 256


In [None]:
from nemo_curator.utils.config_utils import build_filter_pipeline

filter_config_file = os.path.join(base_dir, "config/heuristic_filter_en.yaml")
hf_input_data_dir = os.path.join(base_dir, "rpv2-2023-06-and-14-inter-deduped")
kept_document_dir =  expand_outdir_and_mkdir(os.path.join(base_dir,'rpv2-2023-06-and-14-heuristic-filtering','hf.parquet'))

In [4]:
t0 = time.time()

# Load dataset
dataset = DocumentDataset.read_parquet(hf_input_data_dir)

# construct pipeline from config
filter_pipeline = build_filter_pipeline(filter_config_file)

# filter data and write to disk
filtered_dataset = filter_pipeline(dataset)
filtered_dataset.to_parquet(kept_document_dir)

print(f"Time taken for Heuristic filtering: {time.time()-t0} s")

Reading 72780 files
Writing to disk complete for 72780 partitions
Time taken for Heuristic filtering: 5647.508106470108 s


After filitering, we have 1,229,679,047 documents left, removing 355,867,132 documents from the deduped dataset.

In [5]:
len(filtered_dataset)

1229679047

[Optional] Examine example low quality documents:

In [5]:
from helper import get_dataframe_complement

original_df = dd.read_parquet(hf_input_data_dir)
filtered_df = dd.read_parquet(kept_document_dir)
removed_df = get_dataframe_complement(original_df, filtered_df)
removed_df_example = removed_df.head()

In [None]:
print(removed_df_example.raw_content.iloc[0])

In [None]:
print(removed_df_example.raw_content.iloc[1])