# Pre-training Data Curation in NeMo Curator

The NeMo Curator is a Python library that consists of a collection of scalable data-mining modules for curating natural language processing (NLP) data for training large language models (LLMs). The modules within the NeMo Data Curator enable NLP researchers to mine high-quality text at scale from massive uncurated web corpora.

NeMo Curator includes the following modules to perform data curation:

- Data download and Extraction
- Language identification and separation
- Text reformatting and cleaning
- Quality filtering
- Document-level deduplication
- Multilingual downstream-task decontamination
- Distributed Data Classification
- Personal identifiable information (PII) redaction

# Table of Contents

1. [Introduction](#introduction)
2. [Getting Started](#get-start)
3. [RedPajama-Data-v2](#rpv2)
4. [Data Preprocessing](#preprocess)
5. [Deduplication](#dedup)
6. [Quality filtering](#filter)

# Introduction
<a id="introduction"></a>

In this tutorial, we will be demonstrating how to curate a LLM pre-training dataset using NeMo Curator.

## System Information
Here is the information on the system this notebook was run on:

- **GPU**: 2 A100 nodes (each with 8 A100-SXM4-80GB)

- **CUDA & Nvidia Drivers**: CUDA 12.2 with Driver 535.104.12

- **OS**: Ubuntu 20.04.5 LTS

## Running NeMo-Curator

NeMo-curator came pre-installed in Nemo framework container. This notebook use 24.07 release of the Nemo framework container. User can pull the container following the steps below:

- Get access to the NeMo Frameworm container on [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo)

- Set your docker credentials


    `docker login nvcr.io`

    Username: `$oauthtoken`
    
    Password: `<NGC_API_KEY Key>`
    
- Pull the NeMo Framework Container image
    
    `docker pull docker pull nvcr.io/nvidia/nemo:24.07`

Alternatively, NeMo-Curator is also available on [PyPi](https://pypi.org/project/nemo-curator/) and [GitHub](https://github.com/NVIDIA/NeMo-Curator).

# Getting started
<a id="get-start"></a>

NeMo-Curator uses dask for parallelization. Before we start using curator, we need to start a dask cluster. To start a multi-node dask cluster in slurm, we can use the `start-distributed-notebook.sh` script in this directory to start the cluster. The user will want to change the following:

- Slurm job directives
- Device type (`cpu` or `gpu`). Curator has both cpu and gpu modules. Check [here](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/cpuvsgpu.html) to see which modules are cpu/gpu
- Path to the NeMo Framework container image
- Path to `container-entry-point.sh` script which is responsible for launching the dask schduler and workers

Running the script will also launch a jupyter lab session on the rank 0 node and pass the dask schduler address as an environment variable that will be used later to connect to the dask client.

The preprocessing modules such as Add ID and Text cleaning are cpu-based so we will start a cpu dask cluster first.

In [1]:
import os
import time
from dask.distributed import Client
import warnings
import dask.dataframe as dd
import dask_cudf
import cudf
import gzip
import json
import dask.bag as db
import glob
from dask.distributed import wait
import numpy as np

from nemo_curator import get_client
from nemo_curator.datasets import DocumentDataset
from nemo_curator.utils.distributed_utils import (
    get_num_workers,
    read_data,
    write_to_disk,
)
from nemo_curator.utils.file_utils import (
    expand_outdir_and_mkdir, 
    get_all_files_paths_under, 
    separate_by_metadata,
    get_batched_files,
)

warnings.filterwarnings('ignore')
base_dir = "/path/to/data"

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
scheduler_address = os.getenv('SCHEDULER_ADDRESS')
cpu_client = get_client(scheduler_address=scheduler_address)
print(f"Num Workers = {get_num_workers(cpu_client)}", flush=True)

Num Workers = 256


# RedPajama-Data-v2
<a id="rpv2"></a>

RedPajama-V2 (rpv2) is an advanced open-source initiative designed to support the development of large language models (LLMs). This dataset, sourced from 84 CommonCrawl snapshots, spans five major languages—English, French, Spanish, German, and Italian—making it one of the largest and most comprehensive public datasets available for LLM training.

The RedPajama-V2 dataset is available on [Huggingface](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2).

For this tutorial, we will start with a single snapshot from rpv2 and then scale to multiple snapshots to demonstrate the pre-training data curation workflow.

The raw rpv2 data is stored in compressed json. We will first decompress the json.gz file and write them into jsonl files. For this, we will use a helper function `convert_json_gz_to_jsonl` in `helper.py`


In [None]:
from helper import convert_json_gz_to_jsonl

In [4]:
input_data_dir = os.path.join(base_dir,"rpv2-2023-06-raw")
output_data_dir = os.path.join(base_dir,"rpv2-2023-06")

t0 = time.time()
convert_json_gz_to_jsonl(input_data_dir, output_data_dir)
print(f"Uncompressing data took {time.time()-t0} s")

Uncompressing data took 890.2869493961334 s


To get started, we can read the jsonl files into a `DocumentDataset` which is the standard format for text dataset used in curator.

In [8]:
from nemo_curator.datasets import DocumentDataset

input_dataset = DocumentDataset.read_json(output_data_dir, add_filename=True)

Reading 15025 files


`DocumentDataset` is essentially a wrapper around dask dataframe and we can get the dataframe by calling `input_dataset.df`:

In [None]:
input_dataset.df.head()

There are a total of 1,088,468,779 documents in this single snapshot.

In [10]:
len(input_dataset.df)

1088468779

# 4. Data Preprocessing
<a id="preprocess"></a>

## 4.1 Data resharding

The input text files have varying sizes, which leads to imbalanced partitions that could result in out-of-memory issues. Ideally, we want to make balanced text files of similar sizes. Curator offers utility to reshard the text files to simiar sizes.

In [11]:
from nemo_curator.utils.file_utils import reshard_jsonl
from nemo_curator.utils.file_utils import expand_outdir_and_mkdir

output_resharded_dir = expand_outdir_and_mkdir(os.path.join(base_dir,"rpv2-2023-06-resharded"))

t0 = time.time()
reshard_jsonl(
    output_data_dir,
    output_resharded_dir,
    output_file_size="100M",
    start_index=0,
    file_prefix="rpv2-2023-06",
)
print(f"Data sharding took:{time.time()-t0}")

Data sharding took:552.2274513244629


Removing the raw dataset to save disk space:

In [15]:
!rm -rf /lustre/fsw/portfolios/coreai/users/yayu/data.fs5/rpv2-2023-06

## 4.2 Add ID

We will assign a unique ID for each document in the dataset so we can refrence them.

In [6]:
from nemo_curator import AddId
from nemo_curator.datasets import DocumentDataset

We will create an instance of Curator's `AddId` class and use it to add ID for all documents in the dataset.

In [10]:
input_data_dir = os.path.join(base_dir,"rpv2-2023-06-resharded")
input_dataset = DocumentDataset.read_json(input_data_dir, add_filename=True)
id_data_dir = expand_outdir_and_mkdir(os.path.join(base_dir,"rpv2-2023-06-id"))

t0 = time.time()
# specify add_id function
add_id = AddId(
    id_field="id",
    id_prefix="rpv2-2023-06",
)
id_dataset = add_id(input_dataset)
id_dataset.to_json(id_data_dir, write_to_filename=True)
print(f"Adding ID took :{time.time()-t0}")

Reading 37848 files
Writing to disk complete for 37848 partitions
Adding ID took :1472.3535017967224


We can validate the added IDs below:

In [None]:
id_dataset.df.head(3)

[Optional] Remove the sharded dataset to save disk space:

In [12]:
!rm -rf /lustre/fsw/portfolios/coreai/users/yayu/data.fs5/rpv2-2023-06-sharded

## 4.3 Language ID and Separation

Data curation usually includes steps that are language specific (e.g. using language-tuned heuristics for quality filtering). NeMo Curator provides utilities to identify languages. The language identification is performed using fastText.

It is worth mentioning that even though a preliminary language identification has been performed on rpv2 and we started with English-only dataset, fastText is more accurate so it can be used for a second pass.

In [3]:
from nemo_curator import ScoreFilter, Modify
from nemo_curator.filters import FastTextLangId
from nemo_curator.modifiers import UnicodeReformatter
from nemo_curator.utils.file_utils import get_all_files_paths_under, separate_by_metadata

# Language ID path
language_output_path = expand_outdir_and_mkdir(os.path.join(base_dir,"rpv2-2023-06-language"))
language_data_output_path = expand_outdir_and_mkdir(os.path.join(language_output_path,"data"))

# Fasttext model path
model_path = language_output_path

# Define key in output .jsonl files to store the language information
language_field = "language"

Download the fastText model for langague detection.

In [4]:
!wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin -P {model_path}

--2024-08-23 16:03:29--  https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 13.227.74.12, 13.227.74.45, 13.227.74.118, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|13.227.74.12|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 131266198 (125M) [application/octet-stream]
Saving to: ‘/lustre/fsw/portfolios/coreai/users/yayu/data.fs5/rpv2-2023-06-language/lid.176.bin’


2024-08-23 16:03:35 (25.5 MB/s) - ‘/lustre/fsw/portfolios/coreai/users/yayu/data.fs5/rpv2-2023-06-language/lid.176.bin’ saved [131266198/131266198]



We will create an instance of Curator's `ScoreFilter` and use a helper function `separate_by_metadata` to separate the dataset into subfolders based on language.

In [8]:
t0 = time.time()

# Load dataset
id_data_dir = os.path.join(base_dir,"rpv2-2023-06-id")
input_dataset = DocumentDataset.read_json(id_data_dir, add_filename=True)

# Define Language separation pipeline
lang_filter = FastTextLangId(os.path.join(model_path,'lid.176.bin'))
language_id_pipeline = ScoreFilter(
    lang_filter, 
    score_field=language_field,
    text_field="raw_content",
    score_type='object'
)
filtered_dataset = language_id_pipeline(input_dataset)

# drop the detailed classifier score
filtered_dataset.df[language_field] = filtered_dataset.df[language_field].apply(
    lambda score: score[1],meta = (language_field, 'object')
    )

# Split the dataset to corresponding language sub-folders
language_stats = separate_by_metadata(
    filtered_dataset.df, 
    language_data_output_path, 
    metadata_field=language_field
).compute()

print(f"Time taken for splitting language:{time.time()-t0}")

Reading 37848 files
Time taken for splitting language:4645.465864896774


The English dataset has 1,088,311,520 documents compared to 1,088,468,779 documents in the raw dataset. This is because the raw dataset is aleady detected and filtered to English dataset.

In [10]:
en_dataset_path = os.path.join(base_dir,"rpv2-2023-06-language/data/EN")
en_dataset = DocumentDataset.read_json(en_dataset_path, add_filename=True)

len(en_dataset)

Reading 37848 files


1088311520

[Optional] Removing the ID'ed data to save disk space:

In [None]:
!rm -rf "/lustre/fsw/portfolios/coreai/users/yayu/data.fs5/rpv2-2023-06-id"

In [None]:
ja_dataset_path = os.path.join(base_dir,"rpv2-2023-06-language/data/JA")
ja_dataset = DocumentDataset.read_json(ja_dataset_path, add_filename=True)

ja_dataset.df.head(1)

## 4.4 Text cleaning

Datasets may have improperly decoded unicode characters. Curator provides utilities to fix improperly decoded unicode characters based on the heuristics defined within the `ftfy` package.

In [4]:
import nemo_curator
from nemo_curator.modifiers import UnicodeReformatter

en_dataset_path = os.path.join(base_dir,"rpv2-2023-06-language/data/EN")
en_dataset = DocumentDataset.read_json(en_dataset_path, add_filename=True)

Reading 37848 files


Curator offers uses the `modify` method with `UnicodeReformatter` for text cleaning. It requires the following arguments:

In [5]:
# make directory for cleaned dataset
output_clean_dir = expand_outdir_and_mkdir(os.path.join(base_dir,"rpv2-2023-06-en-cleaned"))
# specify text field name and file type
input_text_field = "raw_content"
input_file_type = "jsonl"

In [6]:
t0 = time.time()
# specify clearner
cleaner = nemo_curator.Modify(
    UnicodeReformatter(), 
    text_field=input_text_field
)

# clean dataset and write to disk
cleaned_dataset = cleaner(en_dataset)
cleaned_dataset.to_json(output_clean_dir, write_to_filename=True)
print(f"Text cleaning took {time.time()-t0} s")

Writing to disk complete for 37848 partitions
Text cleaning took 6349.983360290527 s


[Optional] Removing intermediate data to save disk space:

In [9]:
!rm -rf "/lustre/fsw/portfolios/coreai/users/yayu/data.fs5/rpv2-2023-06-language/data/EN"

# 5. Deduplication
<a id="dedup"></a>



## 5.1 Exact Deduplication

Exact dedup computes a hash for the raw text of each document. Documents with the same hash value will be exact duplicates and will be removed. Curator provides GPU-accelerated exact deduplication using Rapids.

In [2]:
from nemo_curator.log import create_logger
from nemo_curator.modules import ExactDuplicates

def pre_imports():
    import cudf  # noqa: F401

In [4]:
scheduler_address = os.getenv('SCHEDULER_ADDRESS')
gpu_client = get_client(scheduler_address=scheduler_address)
print(f"Num Workers = {get_num_workers(gpu_client)}", flush=True)

gpu_client.run(pre_imports)
print("Pre imports complete")

Num Workers = 16
Pre imports complete


In [6]:
cleaned_dataset_path = os.path.join(base_dir,"rpv2-2023-06-en-cleaned")
log_dir = expand_outdir_and_mkdir(os.path.join(base_dir, "logs"))
input_id_field = 'id'
input_text_field = 'raw_content'
hash_method = 'md5'
output_dir = expand_outdir_and_mkdir(os.path.join(base_dir,"rpv2-2023-06-exact-dedup"))

In [7]:
t0 = time.time()
# Read the input dataset from the cleaned dataset dir
input_dataset = DocumentDataset.read_json(cleaned_dataset_path, backend='cudf')

# Perform exact dedup
exact_dups = ExactDuplicates(
    logger=log_dir,
    id_field=input_id_field,
    text_field=input_text_field,
    hash_method=hash_method,
    cache_dir=output_dir,
)
duplicates = exact_dups(dataset=input_dataset)
print(f"Exact dedup took:{time.time()-t0}")


Reading 37848 files
Exact dedup took:1275.6094808578491


Exact deduplication found 97,327,867 duplicated documents.

In [9]:
print(f"Number of exact duplicated file:{len(duplicates)}")

Number of exact duplicated file:97327867


Let's see the results of exact dedup:

In [10]:
duplicates_df = duplicates.df
duplicates_df.head()

Unnamed: 0,id,_hashes
0,rpv2-2023-06-1594900690,b31e61ba8cb85680f7acea426b9848fe
1,rpv2-2023-06-1004500292,72bb25bef9420164ac8bc86a2ae340ef
2,rpv2-2023-06-2727300658,0c5834608662294d3dfa64de71850448
3,rpv2-2023-06-1642700934,1a247f38a86b32e0a6162f892c80a198
4,rpv2-2023-06-0206000016,61a74bf725e1ba23c530a1e8fc71d554


In [11]:
duplicates_df.groupby('_hashes').agg({'id': 'count'}).head()

Unnamed: 0_level_0,id
_hashes,Unnamed: 1_level_1
7a724e20912f26144d90dbf74c6fe0ae,2
fab1be64dd1d1a20ec5e3a77b962a3e8,2
0d039804d82f3a375e19ca9cbb3d830a,2
e05b1c37967e7f4eec2392bd6e65b668,27
bb3f77234cb015a2c24710d22c0bfc57,2


In [12]:
duplicates_df[duplicates_df['_hashes'] == 'e05b1c37967e7f4eec2392bd6e65b668'].compute()

Unnamed: 0,id,_hashes
1036,rpv2-2023-06-2771406540,e05b1c37967e7f4eec2392bd6e65b668
1052,rpv2-2023-06-2443106203,e05b1c37967e7f4eec2392bd6e65b668
1063,rpv2-2023-06-0509306409,e05b1c37967e7f4eec2392bd6e65b668
1364,rpv2-2023-06-3063906432,e05b1c37967e7f4eec2392bd6e65b668
1490,rpv2-2023-06-2753207260,e05b1c37967e7f4eec2392bd6e65b668
1507,rpv2-2023-06-3001307073,e05b1c37967e7f4eec2392bd6e65b668
1538,rpv2-2023-06-0719006978,e05b1c37967e7f4eec2392bd6e65b668
1551,rpv2-2023-06-0700107191,e05b1c37967e7f4eec2392bd6e65b668
4793,rpv2-2023-06-0106826500,e05b1c37967e7f4eec2392bd6e65b668
4794,rpv2-2023-06-0117726560,e05b1c37967e7f4eec2392bd6e65b668


Let's verify if the documents with the same hash are exactly the same:

In [13]:
t0 = time.time()
dup_ex1 = input_dataset.df[input_dataset.df['id'] == 'rpv2-2023-06-2771406540'].compute()
print(f"Searching one duplicate took:{time.time()-t0}")

Searching one duplicate took:661.8512754440308


In [None]:
dup_ex1

In [None]:
print(dup_ex1.raw_content.iloc[0])

In [None]:
dup_ex2 = input_dataset.df[input_dataset.df['id'] == 'rpv2-2023-06-2443106203'].compute()
print(dup_ex2.raw_content.iloc[0])

In [None]:
dup_ex2

Now, we will remove the exact duplicates and write the remaining dataset to disk.

In [5]:
input_dataset = DocumentDataset.read_json(cleaned_dataset_path, add_filename=True, backend='cudf')
duplicates = DocumentDataset.read_parquet(os.path.join(output_dir,"_exact_duplicates.parquet"), backend='cudf')
duplicates_df = duplicates.df

Reading 37848 files
Reading 1 files


In [19]:
output_dir = expand_outdir_and_mkdir(os.path.join(base_dir,"rpv2-2023-06-exact-dup-removed"))

t0 = time.time()
docs_to_remove = duplicates_df.map_partitions(
    lambda x: x[x._hashes.duplicated(keep="first")]
)

# When there are few duplicates we can compute the results to a list and use `isin`.
result = input_dataset.df[
    ~input_dataset.df[input_id_field].isin(
        docs_to_remove[input_id_field].compute()
    )
]

write_to_disk(
    result,
    output_dir,
    write_to_filename=True,
    output_type='jsonl',
)

print(f"Removing exact duplicates took:{time.time()-t0}")

Writing to disk complete for 37848 partitions
Removing exact duplicates took:1563.168622970581


We can see that exact dedup removed 70,675,782 documents and we now have 1,017,635,738 documents left in the dataset.

In [20]:
len(docs_to_remove)

70675782

In [21]:
len(result)

1017635738

## 5.2 Fuzzy Deduplication

Fuzzy deduplication aims to find near-duplicated documents in our dataset. Near-duplicated documents are common in web crawl data due to plagiarism and mirror sites. Removing them can help improve the quality of trained models. In many cases, we can skip exact dedup and just perform fuzzy dedup as it will also find the exact duplicates. Thus, we will start with the cleaned dataset for fuzzy dedup.

Curator implements GPU-accelerated Fuzzy Deduplication based on minhash + LSH algorithm for finding similar documents across the dataset. Specifically, Fuzzy Deduplication include six steps:

- Compute minhashes
- Locality-Sensitive Hashing (LSH)
- Map buckets
- Jaccard shuffle
- Jaccard compute
- Connected components


In [3]:
def pre_imports():
    import cudf  # noqa: F401

scheduler_address = os.getenv('SCHEDULER_ADDRESS')
gpu_client = get_client(scheduler_address=scheduler_address)
print(f"Num Workers = {get_num_workers(gpu_client)}", flush=True)

gpu_client.run(pre_imports)
print("Pre imports complete")

Num Workers = 16
Pre imports complete


### 5.2.1 Compute minhashes

First, we will compute the minhash signature for each documents. For this purpose, each document will be represented by a set of n-grams. We will apply random hash functions on each element of the set. The minimum hash value generated by each hash function will be recorded and becomes a component of the MinHash signature. Thus, the length of the minhash signature will be the same as the number of hash functions. 

In [9]:
from nemo_curator import MinHash

input_data_dir = os.path.join(base_dir,"rpv2-2023-06-en-cleaned")
seed = 42
minhash_length = 260
char_ngram = 5
log_dir = expand_outdir_and_mkdir(os.path.join(base_dir, "logs"))
id_field = 'id'
text_field = 'raw_content'
minshah_output_dir = expand_outdir_and_mkdir(os.path.join(base_dir,"rpv2-2023-06-minhash"))

In [10]:
files = get_all_files_paths_under(root=input_data_dir, recurse_subdirectories=False)
files = [f for f in files if f.endswith(".jsonl")]
df = read_data(
    files,
    file_type="jsonl",
    backend="cudf",
    files_per_partition=1,
    add_filename=False,
)[[id_field, text_field]]

Reading 37848 files


In [33]:
t0 = time.time()

# Run MinHash() on input data
minhasher = MinHash(
    seed=seed,
    num_hashes=minhash_length,
    char_ngrams=char_ngram,
    use_64bit_hash=False,
    logger=log_dir,
    id_field=id_field,
    text_field=text_field,
    cache_dir=minshah_output_dir
)

result = minhasher(DocumentDataset(df)).df

print(f"Computing minhashes took:{time.time()-t0}")

Computing minhashes took:5161.864866495132


We can see some example outputs from the minhash computation.

In [12]:
result.head()

Unnamed: 0,id,_minhash_signature
0,rpv2-2023-06-0000000000,"[56978, 157261, 839276, 103231, 51779, 396833,..."
1,rpv2-2023-06-0000100000,"[4644772, 2991701, 2571423, 12369524, 50603761..."
2,rpv2-2023-06-0000200000,"[1312196, 17635, 1520869, 3337920, 2052016, 10..."
3,rpv2-2023-06-0000300000,"[5374828, 2268627, 4903126, 2134671, 1828983, ..."
4,rpv2-2023-06-0000400000,"[4999022, 2320370, 2068984, 3469276, 621627, 5..."


### 5.2.2 Minhash LSH

LSH() implements LSH algorithm which includes the following steps:

- Divide the minhash signature array into X different portions.

- For each portions, hash the minhash values into buckets. One document will be assigned to X buckets.

- Documents within the same bucket will be deemed similar. Since every document will be assigned X buckets and as long as two documents share 1 or more buckets they are deemed similar, the result of LSH will have more false positive as compared to false negative. The false positive cases will be filtered in following modules, namely jaccard compute.

In [14]:
from nemo_curator import LSH
from nemo_curator.utils.fuzzy_dedup_utils.id_mapping import convert_str_id_to_int

lsh_input_dir = os.path.join(base_dir,"rpv2-2023-06-minhash")
id_field = 'id'
output_bucket_dir = expand_outdir_and_mkdir(os.path.join(base_dir,"fuzzy-dedup-output-2023-06"))
num_bands = 20
buckets_per_shuffle = 1
minhash_field = '_minhash_signature'
minhash_length = 260
log_dir = os.path.join(base_dir, "logs")

In [3]:
t0 = time.time()

#Load MinHash output
df = dask_cudf.read_parquet(lsh_input_dir, blocksize="2GB", aggregate_files=True)
df = df.map_partitions(
    convert_str_id_to_int,
    id_column=id_field,
    meta=cudf.DataFrame(
        {minhash_field: [[1, 2, 3]], "doc_id": [1], "dataset_id": np.uint32(1)}
    ),
)

lsh = LSH(
    cache_dir=output_bucket_dir,
    num_hashes=minhash_length,
    num_buckets=num_bands,
    buckets_per_shuffle=buckets_per_shuffle,
    id_fields=["dataset_id", "doc_id"],
    minhash_field=minhash_field,
    logger=log_dir,
)

lsh_result = lsh(DocumentDataset(df))
print(f"LSH took {time.time()-t0} s")

LSH took 6116.864866495132 s


In [9]:
lsh_result.df.head()

Unnamed: 0,dataset_id,doc_id,_bucket_id
0,256213913,404307346,39666
1,256213913,993005579,456
2,256213913,25501850,11694
3,256213913,2092102624,18675
4,256213913,1677210450,53186


### 5.2.3 Map Buckets

After performing LSH, we processed each bucket and calculated an approximation of the all-pairs Jaccard
similarity in order to remove false positive duplicates introduced by LSH. For this purpose, we will randomly sample n "anchor" documents within each buckets and calculate the Jaccard similarity with everything remaining in the bucket.

In [11]:
from nemo_curator.modules.fuzzy_dedup import _MapBuckets
from nemo_curator.utils.fuzzy_dedup_utils.io_utils import (
    get_bucket_ddf_from_parquet_path,
    get_text_ddf_from_json_path_with_blocksize,
)

input_data_paths = [os.path.join(base_dir,"rpv2-2023-06-en-cleaned")]
num_files = None
text_ddf_blocksize = 256 #The block size for chunking jsonl files for text ddf in mb
id_field = 'id'
text_field = 'raw_content'
input_bucket_path = os.path.join(base_dir,"fuzzy-dedup-output-2023-06/_buckets.parquet")
input_bucket_field = '_bucket_id'
shuffle_type ='tasks'
log_dir = os.path.join(base_dir, "logs")
output_anchor_docs_with_bk_path = expand_outdir_and_mkdir(os.path.join(base_dir,"fuzzy-dedup-output-2023-06/anchor_docs_with_bk.parquet"))

In [12]:
# Read .jsonl input data
ddf_text = get_text_ddf_from_json_path_with_blocksize(
    input_data_paths=input_data_paths,
    num_files=num_files,
    blocksize=text_ddf_blocksize,
    id_column=id_field,
    text_column=text_field,
)

print(f"ddf_text.npartitions  = {ddf_text.npartitions}", flush=True)

Number of files being read for jaccard calculation = 37848
ddf_text.npartitions  = 21501


In [15]:
t0 = time.time()
num_workers = get_num_workers(gpu_client)

# Read "_buckets.parquet"
ddf_bk = get_bucket_ddf_from_parquet_path(
    input_bucket_path=input_bucket_path, 
    num_workers=num_workers
)

#Run _MapBuckets()
map_buckets = _MapBuckets(
    id_fields=["dataset_id", "doc_id"], 
    bucket_field=input_bucket_field, 
    logger=log_dir,
    text_field=text_field,
)

ddf_anchor_docs_with_bk = map_buckets.map_buckets_with_anchors(
    documents_df=ddf_text, 
    buckets_df=ddf_bk, 
    shuffle_type=shuffle_type
)

#Write to disk
ddf_anchor_docs_with_bk.to_parquet(
    output_anchor_docs_with_bk_path, 
    write_index=False
)

print(f"Mapping Bucket took {time.time()-t0} s")

Number of ddf_bk partitions = 102
Mapping Bucket took 711.1930673122406 s


In [16]:
ddf_anchor_docs_with_bk.head()

Unnamed: 0,dataset_id,doc_id,anchor_1_dataset_id,anchor_1_doc_id,anchor_0_dataset_id,anchor_0_doc_id,_output_partition_id
0,256213913,1440621805,256213913,520733492,256213913,2401230703,1461
1,256213913,821232404,256213913,371332453,256213913,821232404,3852
2,256213913,1787805617,256213913,1969113640,256213913,397634875,7811
3,256213913,658706900,256213913,658706900,256213913,675310236,3403
4,256213913,272735412,256213913,272735412,256213913,2250835581,5160


### 5.2.4 Jaccard Shuffle

We shuffle the documents within the dataset based on their bucket assignments, essentially distributing similar documents across different partitions or workers, enabling efficient parallel processing and deduplication in subsequent steps.

In [23]:
from nemo_curator.modules.fuzzy_dedup import _Shuffle

log_dir = os.path.join(base_dir, "logs")
input_anchor_docs_with_bk_path = os.path.join(base_dir,"fuzzy-dedup-output-2023-06/anchor_docs_with_bk.parquet")
output_shuffled_docs_path = expand_outdir_and_mkdir(
    os.path.join(base_dir, "fuzzy-dedup-output-2023-06/shuffled_docs.parquet")
)
bucket_mapping_ddf_blocksize = 256
parts_per_worker = 16
bucket_parts_per_worker = 256
id_field = 'id'
text_field = 'raw_content'

In [24]:
t0 = time.time()

shuffle = _Shuffle(
    id_fields=["dataset_id", "doc_id"],
    text_field=text_field,
    int_to_str_id=id_field,
    logger=log_dir,
)

shuffle.shuffle_docs_on_buckets(
    documents_df=ddf_text,
    bucket_w_anchors_path=input_anchor_docs_with_bk_path,
    output_shuffled_docs_path=output_shuffled_docs_path,
    bucket_mapping_df_blocksize=bucket_mapping_ddf_blocksize,
    parts_per_worker=parts_per_worker,
    bucket_parts_per_worker=bucket_parts_per_worker,
    partition_on="_output_partition_id",
)

print(f"Jaccard Shuffle took {time.time()-t0} s")

  0%|          | 0/1 [00:00<?, ?it/s]


Started processing bucket-map partitions 0 through 102 of 102
Using 256 text partitions.
Starting text bytes aware shuffle
Will write 19751632 rows to disk
Text-df partition  256/21501 completed in 39.92343091964722
Using 256 text partitions.
Starting text bytes aware shuffle
Will write 21446926 rows to disk
Text-df partition  512/21501 completed in 42.80106043815613
Using 256 text partitions.
Starting text bytes aware shuffle
Will write 19507857 rows to disk
Text-df partition  768/21501 completed in 37.60182189941406
Using 256 text partitions.
Starting text bytes aware shuffle
Will write 20414245 rows to disk
Text-df partition  1024/21501 completed in 67.66048169136047
Using 256 text partitions.
Starting text bytes aware shuffle
Will write 20284319 rows to disk
Text-df partition  1280/21501 completed in 40.508766412734985
Using 256 text partitions.
Starting text bytes aware shuffle
Will write 21424549 rows to disk
Text-df partition  1536/21501 completed in 40.96135187149048
Using 256

Task exception was never retrieved
future: <Task finished name='Task-53934' coro=<Client._gather.<locals>.wait() done, defined at /usr/local/lib/python3.10/dist-packages/distributed/client.py:2197> exception=AllExit()>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/distributed/client.py", line 2206, in wait
    raise AllExit()
distributed.client.AllExit


Text-df partition  5376/21501 completed in 44.53200340270996
Using 256 text partitions.
Starting text bytes aware shuffle
Will write 21130783 rows to disk
Text-df partition  5632/21501 completed in 54.21355414390564
Using 256 text partitions.
Starting text bytes aware shuffle
Will write 21283077 rows to disk
Text-df partition  5888/21501 completed in 41.81457304954529
Using 256 text partitions.
Starting text bytes aware shuffle
Will write 22384930 rows to disk
Text-df partition  6144/21501 completed in 45.46053504943848
Using 256 text partitions.
Starting text bytes aware shuffle
Will write 20776364 rows to disk
Text-df partition  6400/21501 completed in 40.972795248031616
Using 256 text partitions.
Starting text bytes aware shuffle
Will write 20072714 rows to disk
Text-df partition  6656/21501 completed in 43.9665105342865
Using 256 text partitions.
Starting text bytes aware shuffle
Will write 21287119 rows to disk
Text-df partition  6912/21501 completed in 46.75365734100342
Using 256

100%|██████████| 1/1 [1:07:49<00:00, 4069.43s/it]

Jaccard Shuffle took 4069.6330242156982 s





We can visualize the jaccard shuffle results for a single partition:

In [None]:
jaccard_shuffle_res = dd.read_parquet(os.path.join(output_shuffled_docs_path,"_output_partition_id=0"))
jaccard_shuffle_res.head()

### 5.2.5 Jaccard Compute

Now we have the jaccard pairs sampled, we can compute the Jaccard similarity score for all pairs.

In [5]:
from nemo_curator.modules.fuzzy_dedup import JaccardSimilarity

id_field = 'id'
text_field = 'raw_content'
ngram_size = 5
shuffled_docs_path = os.path.join(base_dir, "fuzzy-dedup-output-2023-06/shuffled_docs.parquet")
jaccard_results_path = expand_outdir_and_mkdir(
    os.path.join(base_dir, "fuzzy-dedup-output-2023-06/jaccard_similarity_results.parquet")
)

In [6]:
t0 = time.time()
jaccard = JaccardSimilarity(
    id_field=id_field ,
    text_field=text_field,
    anchor_id_fields=[f"anchor_{i}_{id_field}" for i in range(2)],
    ngram_width=ngram_size,
)

# Run actual computation
result_df = jaccard.jaccard_compute(shuffled_docs_path)

result_df.to_parquet(
    jaccard_results_path,
    write_index=False,
    write_metadata_file=False,
)

print(f"Jaccard Computing+Writing took {time.time() - t0} seconds")

Jaccard Computing+Writing took 5886.298990488052 seconds


In [7]:
jaccard_compute_res = dd.read_parquet(jaccard_results_path)
jaccard_compute_res.head()

Unnamed: 0,id_x,id_y,jaccard
0,256213913-1894624904,256213913-2346524957,0.956566
1,256213913-1785625062,256213913-2099725675,0.973642
2,256213913-1350425062,256213913-2930125142,1.0
3,256213913-1324822,256213913-1384203609,0.988306
4,256213913-1775024761,256213913-1540119774,0.906369


### 5.2.6 Connected Component

After all buckets were processed and duplicates (at the threshold) were approximately discovered,
we constructed a sparse document graph and found the connected components therein (using scipy). Each
connected component represents a set of documents that we consider similar enough to be duplicates, and
from which we select a single representative.

In [4]:
from nemo_curator.modules.fuzzy_dedup import ConnectedComponents

cache_dir = expand_outdir_and_mkdir(
    os.path.join(base_dir, "fuzzy-dedup-output-2023-06/cc-cache")
)
jaccard_pairs_path = os.path.join(base_dir, "fuzzy-dedup-output-2023-06/jaccard_similarity_results.parquet")
id_field = 'id'
jaccard_threshold = 0.8
output_path = expand_outdir_and_mkdir(
    os.path.join(base_dir, "fuzzy-dedup-output-2023-06/connected_components.parquet")
)

In [5]:
t0 = time.time()
components_stage = ConnectedComponents(
    cache_dir=cache_dir,
    jaccard_pairs_path=jaccard_pairs_path,
    id_column=id_field,
    convert_str_ids=True,
    jaccard_threshold=jaccard_threshold,
)
components_stage.cc_workflow(output_path=output_path)
print(f"Connected Component took {time.time()-t0} seconds")

batch_id = 0/33, time = 10.98209285736084
batch_id = 1/33, time = 7.240729331970215
batch_id = 2/33, time = 11.506417274475098
batch_id = 3/33, time = 10.567672729492188
batch_id = 4/33, time = 4.118508815765381
batch_id = 5/33, time = 11.475081443786621
batch_id = 6/33, time = 4.485937118530273
batch_id = 7/33, time = 7.7934770584106445
batch_id = 8/33, time = 12.659213781356812
batch_id = 9/33, time = 10.357794523239136
batch_id = 10/33, time = 15.211389780044556
batch_id = 11/33, time = 11.50840425491333
batch_id = 12/33, time = 6.360927104949951
batch_id = 13/33, time = 6.977228403091431
batch_id = 14/33, time = 14.863914489746094
batch_id = 15/33, time = 8.78640341758728
batch_id = 16/33, time = 17.97274613380432
batch_id = 17/33, time = 15.662312030792236
batch_id = 18/33, time = 12.669589042663574
batch_id = 19/33, time = 11.13182783126831
batch_id = 20/33, time = 4.032534837722778
batch_id = 21/33, time = 10.532259702682495
batch_id = 22/33, time = 11.531543016433716
batch_id =

Let's check the results of connected components step. We can see that 239,037,733 are identified as duplicates to be removed.

In [3]:
output_path = os.path.join(base_dir, "fuzzy-dedup-output-2023-06/connected_components.parquet")
# As i repartition (i dont need to shuffle the whole thing 
cc_result = dask_cudf.read_parquet(output_path, split_row_groups=False).repartition(npartitions=1)

# Set 'group' as the index and shuffle to ensure all same 'group' values are in the same partition
cc_result = cc_result.set_index('group', shuffle='tasks')

# Define a function to assign cumulative counts and filter duplicates
def assign_cumcount(df):
    df['cumcount'] = df.groupby(level=0).cumcount()
    df = df[df['cumcount'] >= 1]
    df = df.drop(columns=['cumcount'])
    return df

# Apply the function to each partition
docs_to_remove = cc_result.map_partitions(assign_cumcount, meta=cc_result)

# Reset the index if necessary
docs_to_remove = docs_to_remove.reset_index()

docs_to_remove = docs_to_remove[["dataset_id", "doc_id"]]
docs_to_remove = docs_to_remove.rename(columns={"dataset_id":"to_remove_dataset_id", "doc_id":"to_remove_doc_id"})
docs_to_remove = docs_to_remove.reset_index(drop=True).persist()
_ = wait(docs_to_remove)
del _ 

print("docs_to_remove", len(docs_to_remove))

docs_to_remove 239037733


We can examine some example duplicates.

In [4]:
cc_grouped = cc_result.groupby('group').agg({'id': 'count'})
cc_grouped.head()

Unnamed: 0_level_0,id
group,Unnamed: 1_level_1
123501402,27
83259859,2
266079136,3
119886209,6888
221343674,21


For example, let's look into group "119886209".

In [5]:
dup_group = cc_result[cc_result['group'] == 119886209].compute()
dup_group.head()

Unnamed: 0,dataset_id,doc_id,group,id
3,256213913,469622202,119886209,rpv2-2023-06-0469622202
37437,256213913,501608788,119886209,rpv2-2023-06-0501608788
60404,256213913,2341629062,119886209,rpv2-2023-06-2341629062
81405,256213913,1511229746,119886209,rpv2-2023-06-1511229746
148765,256213913,2369426855,119886209,rpv2-2023-06-2369426855


We can examine the first five documents in this component:

In [4]:
# read input dataset
input_data_dir = os.path.join(base_dir, "rpv2-2023-06-en-cleaned")
input_dataset = DocumentDataset.read_json(input_data_dir, add_filename=True)

Reading 37848 files


Let's visualize the content of these documents and see if they are similar.

In [8]:
t0 = time.time()
dup_ids = ['rpv2-2023-06-0469622202', 'rpv2-2023-06-0501608788', 'rpv2-2023-06-2341629062','rpv2-2023-06-1511229746','rpv2-2023-06-2369426855'] 
dup_examples = input_dataset.df[input_dataset.df['id'].isin(dup_ids)].compute()
print(f"Finding near duplicates with specific IDs took {time.time()-t0} seconds")

Finding near duplicates with specific IDs took 882.7411408424377 seconds


In [None]:
dup_examples

In [None]:
print('Example duplicate 1\n' + dup_examples.raw_content.iloc[0])
print('\n\nExample duplicate 2\n' + dup_examples.raw_content.iloc[1])
print('\n\nExample duplicate 3\n' + dup_examples.raw_content.iloc[1])
print('\n\nExample duplicate 4\n' + dup_examples.raw_content.iloc[1])

### 5.2.7 Duplicates Removal

Next, we will proceed to remove the duplicates identified from the dataset. We will first change the string ID to `doc_id` and `dataset_id` in the input dataset.

In [3]:
from helper import convert_str_id_to_int

input_dataset = DocumentDataset.read_json(os.path.join(base_dir, "rpv2-2023-06-en-cleaned"), backend="cudf")
input_df = input_dataset.df[['raw_content','id']]
meta = input_df._meta
meta['doc_id']=np.int64([0])
meta['dataset_id']=np.uint32([0])
input_df = input_df.map_partitions(
    convert_str_id_to_int,
    id_column="id",
    meta=meta,
)

Reading 37848 files


Then, we will perform a merge between the `input_df` and the `docs_to_remove` on the IDs and drop the fuzzy duplicates.

In [6]:
dedup_output_dir = expand_outdir_and_mkdir(os.path.join(base_dir, "rpv2-2023-06-deduped"))
deduped_df = input_df.merge(docs_to_remove,
                             left_on=['doc_id','dataset_id'],
                             right_on=["to_remove_doc_id", "to_remove_dataset_id"],
                             how='left')

deduped_df = deduped_df[deduped_df['to_remove_doc_id'].isna()].drop(columns=['to_remove_doc_id', "to_remove_dataset_id"]).reset_index(drop=True)

t0 = time.time()
deduped_df.to_parquet(dedup_output_dir)
print(f"Removing duplicates and writing deduped dataset took {time.time()-t0} seconds")

Removing duplicates and writing deduped dataset took 1241.3191509246826 seconds


To verify the results, we can confirm that we have 849,273,787 documents left compared to 1,088,311,520 in the input dataset, essentially removing 239,037,733 duplicates.

In [6]:
len(deduped_df)

849273787

In [7]:
len(input_df)

1088311520

## 5.3 Inter-snapshot Deduplication

So far we have deduplicated a single snapshot from rpv2. Pre-training dataet include multiple snapshots so we will often need to perform inter-snapshot deduplication. For this tutorial, we will demostrate deduplication across two snapshots as an example.

We first performed all the above steps for another snapshot `2023-14` and then combined the two deduped datasets into one and stored them in `rpv2-2023-06-and-14-deduped`.

Next, we will perform the fuzzy deduplication on the combined dataset.

### 5.3.1 Compute Minhash

In [4]:
from nemo_curator import MinHash
from nemo_curator import LSH
from nemo_curator.modules.fuzzy_dedup import _MapBuckets
from nemo_curator.modules.fuzzy_dedup import _Shuffle
from nemo_curator.modules.fuzzy_dedup import ConnectedComponents
from nemo_curator.modules.fuzzy_dedup import JaccardSimilarity

from nemo_curator.utils.file_utils import reshard_jsonl
from nemo_curator.utils.fuzzy_dedup_utils.id_mapping import convert_str_id_to_int
from nemo_curator.utils.fuzzy_dedup_utils.io_utils import (
    get_bucket_ddf_from_parquet_path,
    get_text_ddf_from_json_path_with_blocksize,
)

In [23]:
seed = 42
minhash_length = 260
char_ngram = 5
log_dir = expand_outdir_and_mkdir(os.path.join(base_dir, "logs"))
id_field = 'id'
text_field = 'raw_content'
minshah_output_dir = expand_outdir_and_mkdir(os.path.join(base_dir,"rpv2-2023-06-and-14-minhash"))

In [9]:
input_data_dir = os.path.join(base_dir,"rpv2-2023-06-and-14-deduped")

files = []
for file in os.listdir(input_data_dir):
    if file.endswith('.part'):
        new_file = file.replace('.part', '.jsonl')
        old_file_path = os.path.join(input_data_dir, file)
        new_file_path = os.path.join(input_data_dir, new_file)
        os.rename(old_file_path, new_file_path)
    files.append(new_file_path)


In [19]:
files = [f for f in files if f.endswith(".jsonl")]
df = read_data(
    files,
    file_type="jsonl",
    backend="cudf",
    files_per_partition=2,
    add_filename=False,
)[[id_field, text_field]]

Reading 72797 files


In [24]:
t0 = time.time()

# Run MinHash() on input data
minhasher = MinHash(
    seed=seed,
    num_hashes=minhash_length,
    char_ngrams=char_ngram,
    use_64bit_hash=False,
    logger=log_dir,
    id_field=id_field,
    text_field=text_field,
    cache_dir=minshah_output_dir
)

result = minhasher(DocumentDataset(df)).df

print(f"Computing minhashes took:{time.time()-t0}")

Computing minhashes took:6115.702769517899


In [25]:
result.head()

Unnamed: 0,id,_minhash_signature
0,rpv2-2023-06-0678400000,"[36422228, 15993596, 3538361, 16103012, 194100..."
1,rpv2-2023-06-0678500000,"[34662, 17635, 1112347, 293654, 313382, 160184..."
2,rpv2-2023-06-0678600000,"[15076006, 1801689, 3181854, 2949398, 5699436,..."
3,rpv2-2023-06-0678700000,"[13528976, 2438382, 26260517, 26187347, 249748..."
4,rpv2-2023-06-0678800000,"[2550974, 157261, 1536526, 1169030, 576861, 10..."


### 5.3.2 Minhash LSH

In [7]:
lsh_input_dir = os.path.join(base_dir,"rpv2-2023-06-and-14-minhash")
id_field = 'id'
output_bucket_dir = expand_outdir_and_mkdir(os.path.join(base_dir,"fuzzy-dedup-output-2023-06-and-14"))
num_bands = 20
buckets_per_shuffle = 1
minhash_field = '_minhash_signature'
minhash_length = 260
log_dir = os.path.join(base_dir, "logs")

In [8]:
t0 = time.time()

#Load MinHash output
df = dask_cudf.read_parquet(lsh_input_dir, blocksize="2GB", aggregate_files=True)
df = df.map_partitions(
    convert_str_id_to_int,
    id_column=id_field,
    meta=cudf.DataFrame(
        {minhash_field: [[1, 2, 3]], "doc_id": [1], "dataset_id": np.uint32(1)}
    ),
)

lsh = LSH(
    cache_dir=output_bucket_dir,
    num_hashes=minhash_length,
    num_buckets=num_bands,
    buckets_per_shuffle=buckets_per_shuffle,
    id_fields=["dataset_id", "doc_id"],
    minhash_field=minhash_field,
    logger=log_dir,
)

lsh_result = lsh(DocumentDataset(df))
print(f"LSH took {time.time()-t0} s")

LSH took 10536.635195493698 s


In [10]:
lsh_result.df.head()

Unnamed: 0,dataset_id,doc_id,_bucket_id
0,256213913,2480637085,74400
1,256213913,2079208983,88082
2,256213913,1142812586,7198
3,4217914658,3589401712,54808
4,256213913,1827931650,58134


### 5.3.3 Map Buckets

In [6]:
input_data_paths = [os.path.join(base_dir,"rpv2-2023-06-and-14-deduped")]
num_files = None
text_ddf_blocksize = 256 #The block size for chunking jsonl files for text ddf in mb
id_field = 'id'
text_field = 'raw_content'
input_bucket_path = os.path.join(base_dir,"fuzzy-dedup-output-2023-06-and-14/_buckets.parquet")
input_bucket_field = '_bucket_id'
shuffle_type ='tasks'
log_dir = os.path.join(base_dir, "logs")
output_anchor_docs_with_bk_path = expand_outdir_and_mkdir(os.path.join(base_dir,"fuzzy-dedup-output-2023-06-and-14/anchor_docs_with_bk.parquet"))

In [7]:
# Read .jsonl input data
ddf_text = get_text_ddf_from_json_path_with_blocksize(
    input_data_paths=input_data_paths,
    num_files=num_files,
    blocksize=text_ddf_blocksize,
    id_column=id_field,
    text_column=text_field,
)

print(f"ddf_text.npartitions  = {ddf_text.npartitions}", flush=True)

Number of files being read for jaccard calculation = 72797
ddf_text.npartitions  = 23876


In [14]:
t0 = time.time()
num_workers = get_num_workers(gpu_client)

# Read "_buckets.parquet"
ddf_bk = get_bucket_ddf_from_parquet_path(
    input_bucket_path=input_bucket_path, 
    num_workers=num_workers
)

#Run _MapBuckets()
map_buckets = _MapBuckets(
    id_fields=["dataset_id", "doc_id"], 
    bucket_field=input_bucket_field, 
    logger=log_dir,
    text_field=text_field,
)

ddf_anchor_docs_with_bk = map_buckets.map_buckets_with_anchors(
    documents_df=ddf_text, 
    buckets_df=ddf_bk, 
    shuffle_type=shuffle_type
)

#Write to disk
ddf_anchor_docs_with_bk.to_parquet(
    output_anchor_docs_with_bk_path, 
    write_index=False
)

print(f"Mapping Bucket took {time.time()-t0} s")

Number of ddf_bk partitions = 54
Mapping Bucket took 1034.9348919391632 s


In [15]:
ddf_anchor_docs_with_bk.head()

Unnamed: 0,dataset_id,doc_id,anchor_1_dataset_id,anchor_1_doc_id,anchor_0_dataset_id,anchor_0_doc_id,_output_partition_id
0,4217914658,518211850,4217914658,518211850,256213913,491920892,2004
1,4217914658,6364303356,256213913,2308804621,4217914658,6364303356,4246
2,256213913,2103535708,4217914658,1208111155,256213913,2103535708,4003
3,256213913,1359208912,4217914658,6342510538,256213913,1359208912,3738
4,256213913,162316349,256213913,162316349,4217914658,1033014280,4258


### 6.8.4 Jaccard Shuffle

In [4]:
log_dir = os.path.join(base_dir, "logs")
input_anchor_docs_with_bk_path = os.path.join(base_dir,"fuzzy-dedup-output-2023-06-and-14/anchor_docs_with_bk.parquet")
output_shuffled_docs_path = expand_outdir_and_mkdir(
    os.path.join(base_dir, "fuzzy-dedup-output-2023-06-and-14/shuffled_docs.parquet")
)
bucket_mapping_ddf_blocksize = 256
parts_per_worker = 16
bucket_parts_per_worker = 256
id_field = 'id'
text_field = 'raw_content'

In [8]:
t0 = time.time()

shuffle = _Shuffle(
    id_fields=["dataset_id", "doc_id"],
    text_field=text_field,
    int_to_str_id=id_field,
    logger=log_dir,
)

shuffle.shuffle_docs_on_buckets(
    documents_df=ddf_text,
    bucket_w_anchors_path=input_anchor_docs_with_bk_path,
    output_shuffled_docs_path=output_shuffled_docs_path,
    bucket_mapping_df_blocksize=bucket_mapping_ddf_blocksize,
    parts_per_worker=parts_per_worker,
    bucket_parts_per_worker=bucket_parts_per_worker,
    partition_on="_output_partition_id",
)

print(f"Jaccard Shuffle took {time.time()-t0} s")

  0%|          | 0/1 [00:00<?, ?it/s]


Started processing bucket-map partitions 0 through 54 of 54
Using 256 text partitions.
Starting text bytes aware shuffle
Will write 4620819 rows to disk
Text-df partition  256/23876 completed in 105.13463497161865
Using 256 text partitions.
Starting text bytes aware shuffle
Will write 4520986 rows to disk
Text-df partition  512/23876 completed in 100.3475558757782
Using 256 text partitions.
Starting text bytes aware shuffle
Will write 5232824 rows to disk
Text-df partition  768/23876 completed in 56.71416783332825
Using 256 text partitions.
Starting text bytes aware shuffle
Will write 4700161 rows to disk
Text-df partition  1024/23876 completed in 27.45123529434204
Using 256 text partitions.
Starting text bytes aware shuffle
Will write 4638892 rows to disk
Text-df partition  1280/23876 completed in 26.144277334213257
Using 256 text partitions.
Starting text bytes aware shuffle
Will write 4973176 rows to disk
Text-df partition  1536/23876 completed in 28.32722544670105
Using 256 text p

100%|██████████| 1/1 [49:22<00:00, 2962.52s/it]

Jaccard Shuffle took 2963.7552287578583 s





### 5.3.5 Jaccard Compute

In [9]:
id_field = 'id'
text_field = 'raw_content'
ngram_size = 5
shuffled_docs_path = os.path.join(base_dir, "fuzzy-dedup-output-2023-06-and-14/shuffled_docs.parquet")
jaccard_results_path = expand_outdir_and_mkdir(
    os.path.join(base_dir, "fuzzy-dedup-output-2023-06-and-14/jaccard_similarity_results.parquet")
)

In [10]:
t0 = time.time()
jaccard = JaccardSimilarity(
    id_field=id_field ,
    text_field=text_field,
    anchor_id_fields=[f"anchor_{i}_{id_field}" for i in range(2)],
    ngram_width=ngram_size,
)

# Run actual computation
result_df = jaccard.jaccard_compute(shuffled_docs_path)

result_df.to_parquet(
    jaccard_results_path,
    write_index=False,
    write_metadata_file=False,
)

print(f"Jaccard Computing+Writing took {time.time() - t0} seconds")

Jaccard Computing+Writing took 1300.0965530872345 seconds


### 5.3.6 Connected Component

In [5]:


cache_dir = expand_outdir_and_mkdir(
    os.path.join(base_dir, "fuzzy-dedup-output-2023-06-and-14/cc-cache")
)
jaccard_pairs_path = os.path.join(base_dir, "fuzzy-dedup-output-2023-06-and-14/jaccard_similarity_results.parquet")
id_field = 'id'
jaccard_threshold = 0.8
output_path = expand_outdir_and_mkdir(
    os.path.join(base_dir, "fuzzy-dedup-output-2023-06-and-14/connected_components.parquet")
)

In [6]:
t0 = time.time()
components_stage = ConnectedComponents(
    cache_dir=cache_dir,
    jaccard_pairs_path=jaccard_pairs_path,
    id_column=id_field,
    convert_str_ids=True,
    jaccard_threshold=jaccard_threshold,
)
components_stage.cc_workflow(output_path=output_path)
print(f"Connected Component took {time.time()-t0} seconds")

batch_id = 0/14, time = 4.411345481872559
batch_id = 1/14, time = 3.727839469909668
batch_id = 2/14, time = 4.708456754684448
batch_id = 3/14, time = 4.044265031814575
batch_id = 4/14, time = 4.739339113235474
batch_id = 5/14, time = 3.8557491302490234
batch_id = 6/14, time = 3.597414016723633
batch_id = 7/14, time = 4.3511903285980225
batch_id = 8/14, time = 3.7585947513580322
batch_id = 9/14, time = 3.653388738632202
batch_id = 10/14, time = 3.41691517829895
batch_id = 11/14, time = 4.114740610122681
batch_id = 12/14, time = 3.8741345405578613
batch_id = 13/14, time = 0.680595874786377
# of groups 222092448
# of docs removed 81764804
assert num_nodes:303857252==labels_df:303857252 passed
Connected Component took 117.0 seconds



### 5.3.7 Duplicates Removal

From the outputs of the Connect Component step, we can see that inter-snapshot dedup found 81,764,804 duplicates.

In [4]:
output_path = os.path.join(base_dir, "fuzzy-dedup-output-2023-06-and-14/connected_components.parquet")
# As i repartition (i dont need to shuffle the whole thing 
cc_result = dask_cudf.read_parquet(output_path, split_row_groups=False).repartition(npartitions=1)

# Set 'group' as the index and shuffle to ensure all same 'group' values are in the same partition
cc_result = cc_result.set_index('group', shuffle='tasks')

# Define a function to assign cumulative counts and filter duplicates
def assign_cumcount(df):
    df['cumcount'] = df.groupby(level=0).cumcount()
    df = df[df['cumcount'] >= 1]
    df = df.drop(columns=['cumcount'])
    return df

# Apply the function to each partition
docs_to_remove = cc_result.map_partitions(assign_cumcount, meta=cc_result)

# Reset the index if necessary
docs_to_remove = docs_to_remove.reset_index()

docs_to_remove = docs_to_remove[["dataset_id", "doc_id"]]
docs_to_remove = docs_to_remove.rename(columns={"dataset_id":"to_remove_dataset_id", "doc_id":"to_remove_doc_id"})
docs_to_remove = docs_to_remove.reset_index(drop=True).persist()
_ = wait(docs_to_remove)
del _ 

print("docs_to_remove", len(docs_to_remove))

docs_to_remove 81764804


Before proceeding to duplicates removal, we suggest resharding the data to fix potentially empty partitions due to duplicates removal for single snapshots.

In [10]:
output_resharded_dir = expand_outdir_and_mkdir("/lustre/fsw/portfolios/coreai/users/yayu/data.fs5/rpv2-2023-06-and-14-deduped-resharded")

t0 = time.time()
reshard_jsonl(
    '/lustre/fsw/portfolios/coreai/users/yayu/data.fs5/rpv2-2023-06-and-14-deduped',
    output_resharded_dir,
    output_file_size="100M",
    start_index=0,
    file_prefix="rpv2-2023-06-and-14-deduped",
)
print(f"Data sharding took:{time.time()-t0}")

Data sharding took:904.7163739204407


In [5]:
from helper import convert_str_id_to_int

input_dataset = DocumentDataset.read_json(os.path.join(base_dir, "rpv2-2023-06-and-14-deduped-resharded"), backend="cudf")
input_df = input_dataset.df[['raw_content','id']]
meta = input_df._meta
meta['doc_id']=np.int64([0])
meta['dataset_id']=np.uint32([0])
input_df = input_df.map_partitions(
    convert_str_id_to_int,
    id_column="id",
    meta=meta,
)

Reading 72780 files


In [7]:
dedup_output_dir = expand_outdir_and_mkdir(os.path.join(base_dir, "/rpv2-2023-06-and-14-inter-deduped"))
deduped_df = input_df.merge(docs_to_remove,
                             left_on=['doc_id','dataset_id'],
                             right_on=["to_remove_doc_id", "to_remove_dataset_id"],
                             how='left')

deduped_df = deduped_df[deduped_df['to_remove_doc_id'].isna()].drop(columns=['to_remove_doc_id', "to_remove_dataset_id"]).reset_index(drop=True)

t0 = time.time()
deduped_df.to_parquet(dedup_output_dir)
print(f"Removing duplicates and writing deduped dataset took {time.time()-t0} seconds")

Removing duplicates and writing deduped dataset took 2084.46063041687 seconds


We can verify that the deduped dataset has 1,585,546,179 documents, compared to 1,667,310,983 documents befoe dedup.

In [8]:
len(deduped_df)

1585546179

In [9]:
len(input_df)

1667310983

# 6. Quality Filtering
<a id="filter"></a>

Web crawled dataset often has low quality documents that we do not want the model to learn from. We can perform quality filtering to remove low quality data. NeMo Curator offers modules for both classifier-based and heuristic-based filtering. In this tutorial, we will perform heuristic filtering using a list of heuristic filters to improve data quality.

Curator provides a generic list of heuristic filters but for this tutorial, we only select 10 filters for demo purposes. The selected filters are given in `config/heuristic_filter_en.yaml`.

Heuristic filtering in Curator is a cpu module so we will need to use the cpu cluter.

In [2]:
scheduler_address = os.getenv('SCHEDULER_ADDRESS')
cpu_client = get_client(scheduler_address=scheduler_address)
print(f"Num Workers = {get_num_workers(cpu_client)}", flush=True)

Num Workers = 256


In [6]:
import nemo_curator
from nemo_curator.utils.config_utils import build_filter_pipeline

filter_config_file = os.path.join(base_dir, "config/heuristic_filter_en.yaml")
hf_input_data_dir = os.path.join(base_dir, "rpv2-2023-06-and-14-inter-deduped")
kept_document_dir =  expand_outdir_and_mkdir(os.path.join(base_dir,'rpv2-2023-06-and-14-heuristic-filtering','hf.parquet'))

In [4]:
t0 = time.time()

# Load dataset
dataset = DocumentDataset.read_parquet(hf_input_data_dir)

# construct pipeline from config
filter_pipeline = build_filter_pipeline(filter_config_file)

# filter data and write to disk
filtered_dataset = filter_pipeline(dataset)
filtered_dataset.to_parquet(kept_document_dir)

print(f"Time taken for Heuristic filtering: {time.time()-t0} s")

Reading 72780 files
Writing to disk complete for 72780 partitions
Time taken for Heuristic filtering: 5647.508106470108 s


After filitering, we have 1,229,679,047 documents left, removing 355,867,132 documents from the deduped dataset.

In [5]:
len(filtered_dataset)

1229679047

We can also examine some example low quality documents:

In [5]:
def get_dataframe_complement(original_df, filtered_df):
    def partition_complement(part_original_df, partition_info=None):
        if not partition_info:
            return part_original_df
        part_filtered_df = filtered_df.get_partition(partition_info["number"])
        complement_mask = ~part_original_df.index.isin(part_filtered_df.index.persist())
        complement_df = part_original_df[complement_mask]
        return complement_df

    return original_df.map_partitions(partition_complement)

original_df = dd.read_parquet(hf_input_data_dir)
filtered_df = dd.read_parquet(kept_document_dir)
removed_df = get_dataframe_complement(original_df, filtered_df)
removed_df_example = removed_df.head()

In [None]:
print(removed_df_example.raw_content.iloc[0])

In [None]:
print(removed_df_example.raw_content.iloc[1])