# Building Nemotron-CC data curation pipeline using Nemo Curator

This tutorial demonstrates how to build Nemotron-CC data curation pipeline (https://arxiv.org/pdf/2412.02595) using NeMo Curator. The NeMo Curator is a Python library that consists of a collection of scalable data-mining modules for curating natural language processing (NLP) data for training large language models (LLMs). It consists of easy to use modules for data download, extraction, language identification, quality filtering, deduplication to build high-quality data curation pipelines at scale from massive uncurated web corpora.

## Nemotron-cc curation pipeline components

- Common Crawl data download and extraction
- Language Identification and filtering
- Exact and fuzzy deduplication
- Heuristic filtering and Perplexity filtering
- Synthetic data generation
- Postprocessing

## Prerequisites

### System Requirements
Here is the hardware setting for this notebook

**GPU**: NVIDIA A100 80GiB.

**CUDA & Nvidia Drivers**: CUDA 12.2 with Driver 535.154.05

**OS**: ubuntu 22.04

### Getting NeMo Framework Training Container
- Get access to the container via https://developer.nvidia.com/nemo-framework
- Set your docker credentials 
    ```bash
    docker login nvcr.io

    Username: $oauthtoken
    Password: <Your NGC Key>
- Get NeMo NeMo Framework Training Container
    ```bash
    docker pull nvcr.io/nvidia/nemo:25.05.rc2


## 0. Env Setup

In [1]:
!pip install jsonlines

[33mDEPRECATION: Loading egg at /usr/local/lib/python3.12/dist-packages/lightning_thunder-0.2.0.dev0-py3.12.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m
[0m[33mDEPRECATION: Loading egg at /usr/local/lib/python3.12/dist-packages/looseversion-1.3.0-py3.12.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m
[0m[33mDEPRECATION: Loading egg at /usr/local/lib/python3.12/dist-packages/lightning_utilities-0.12.0.dev0-py3.12.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m
[0m[33mDEPRECATION: Loading egg at /usr/local/lib/python3.

In [2]:
%env CUDA_VISIBLE_DEVICES 0

env: CUDA_VISIBLE_DEVICES=0


In [1]:
import os

from nemo_curator.utils.distributed_utils import get_client, get_num_workers
from nemo_curator.utils.file_utils import get_all_files_paths_under, separate_by_metadata
from nemo_curator.utils.distributed_utils import read_data, write_to_disk
from nemo_curator.datasets import DocumentDataset
from helper import DataSizeTracker

import pandas as pd
import time
import cudf
import dask_cudf
import dask
import numpy as np
from dask.distributed import Client, LocalCluster
import jsonlines

  from optuna import progress_bar as pbar_module


In [6]:
def pre_imports():
    import cudf 

def check_jsonl_file(file_dir):
    for file in os.listdir(file_dir):
        if 'jsonl' not in file:
            continue
        with open(os.path.join(file_dir,file), 'r', encoding='utf-8') as f:
            first_line = f.readline()
            print(first_line)
        break

def extract_lines_with_id(file_path,target_list):
    with jsonlines.open(file_path) as reader:
        for obj in reader:
            if obj.get('id') in target_list:
                yield obj

def get_base_dataset_file_name(download_folder):
    files = os.listdir(download_folder)
    for file in files:
        if file.startswith('thwiki') and file.endswith(''):
            return file

In [3]:
cur_dir = os.getcwd()
print(cur_dir)
data_dir = f"{cur_dir}/data/"

/workspace/nemotron-cc


## 1. Download Common crawl dataset and extract using JustText

We will download only two snapshots of common crawl dataset, however the code works for any number of snapshots. We will use JustText and Trafilatura for extraction and FasttextLid for langauge identification.

In [8]:
from nemo_curator.download import download_common_crawl

 Start a CPU based Dask cluster. Please modify `n_workers` and `memory_limit` according to your hardware specification. To process TH wikipedia data, it's advised to have `memory_limit` greater than 12GB

In [9]:
client = get_client(cluster_type="cpu", n_workers=10, processes=True, memory_limit='16GiB')
client

0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status,

0,1
Dashboard: http://127.0.0.1:8787/status,Workers: 10
Total threads: 10,Total memory: 160.00 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:42307,Workers: 10
Dashboard: http://127.0.0.1:8787/status,Total threads: 10
Started: Just now,Total memory: 160.00 GiB

0,1
Comm: tcp://127.0.0.1:37885,Total threads: 1
Dashboard: http://127.0.0.1:36173/status,Memory: 16.00 GiB
Nanny: tcp://127.0.0.1:38263,
Local directory: /tmp/dask-scratch-space/worker-0f2u620f,Local directory: /tmp/dask-scratch-space/worker-0f2u620f

0,1
Comm: tcp://127.0.0.1:43867,Total threads: 1
Dashboard: http://127.0.0.1:44011/status,Memory: 16.00 GiB
Nanny: tcp://127.0.0.1:38019,
Local directory: /tmp/dask-scratch-space/worker-lvj3hq9h,Local directory: /tmp/dask-scratch-space/worker-lvj3hq9h

0,1
Comm: tcp://127.0.0.1:37637,Total threads: 1
Dashboard: http://127.0.0.1:37017/status,Memory: 16.00 GiB
Nanny: tcp://127.0.0.1:41173,
Local directory: /tmp/dask-scratch-space/worker-hvprzq9j,Local directory: /tmp/dask-scratch-space/worker-hvprzq9j

0,1
Comm: tcp://127.0.0.1:36547,Total threads: 1
Dashboard: http://127.0.0.1:34133/status,Memory: 16.00 GiB
Nanny: tcp://127.0.0.1:40461,
Local directory: /tmp/dask-scratch-space/worker-j3siod8z,Local directory: /tmp/dask-scratch-space/worker-j3siod8z

0,1
Comm: tcp://127.0.0.1:46183,Total threads: 1
Dashboard: http://127.0.0.1:37377/status,Memory: 16.00 GiB
Nanny: tcp://127.0.0.1:43483,
Local directory: /tmp/dask-scratch-space/worker-anu24suu,Local directory: /tmp/dask-scratch-space/worker-anu24suu

0,1
Comm: tcp://127.0.0.1:34113,Total threads: 1
Dashboard: http://127.0.0.1:45317/status,Memory: 16.00 GiB
Nanny: tcp://127.0.0.1:34849,
Local directory: /tmp/dask-scratch-space/worker-vi5kaip2,Local directory: /tmp/dask-scratch-space/worker-vi5kaip2

0,1
Comm: tcp://127.0.0.1:43501,Total threads: 1
Dashboard: http://127.0.0.1:33817/status,Memory: 16.00 GiB
Nanny: tcp://127.0.0.1:36769,
Local directory: /tmp/dask-scratch-space/worker-42eeqxim,Local directory: /tmp/dask-scratch-space/worker-42eeqxim

0,1
Comm: tcp://127.0.0.1:42421,Total threads: 1
Dashboard: http://127.0.0.1:35475/status,Memory: 16.00 GiB
Nanny: tcp://127.0.0.1:37105,
Local directory: /tmp/dask-scratch-space/worker-s70sl4v1,Local directory: /tmp/dask-scratch-space/worker-s70sl4v1

0,1
Comm: tcp://127.0.0.1:37883,Total threads: 1
Dashboard: http://127.0.0.1:36611/status,Memory: 16.00 GiB
Nanny: tcp://127.0.0.1:41677,
Local directory: /tmp/dask-scratch-space/worker-mwpjfmy2,Local directory: /tmp/dask-scratch-space/worker-mwpjfmy2

0,1
Comm: tcp://127.0.0.1:43833,Total threads: 1
Dashboard: http://127.0.0.1:40881/status,Memory: 16.00 GiB
Nanny: tcp://127.0.0.1:41847,
Local directory: /tmp/dask-scratch-space/worker-izlj2fuy,Local directory: /tmp/dask-scratch-space/worker-izlj2fuy


In [4]:
#Output
download_base_directory= os.path.join(data_dir,"cc_crawl")
download_output_directory = os.path.join(download_base_directory,"data")

#Relevant parameters
start_snapshot = "2024-46"
end_snapshot = "2024-51"
language = 'EN'
url_limit = 10

In [None]:
# Download and sample data
common_crawl = download_common_crawl(
    download_output_directory, 
    start_snapshot, 
    end_snapshot, 
    url_limit=url_limit,
    output_type="jsonl", # Default - "jsonl"
    algorithm=JusTextExtractor(), # Default - JusTextExtractor
).df().compute()

**[Optional]** Verify result

In [16]:
# List all the file in the output directory.
!ls {download_output_directory}

# Please replace your dataset file name accordingly.
# ! wc -l  {download_output_directory}/{YOUR DATASET FILE NAME}.jsonl

CC-MAIN-20241201162023-20241201192023-00000.warc.gz.jsonl
CC-MAIN-20241201162023-20241201192023-00001.warc.gz.jsonl
CC-MAIN-20241201162023-20241201192023-00002.warc.gz.jsonl
CC-MAIN-20241201162023-20241201192023-00003.warc.gz.jsonl
CC-MAIN-20241201162023-20241201192023-00004.warc.gz.jsonl
CC-MAIN-20241201162023-20241201192023-00005.warc.gz.jsonl
CC-MAIN-20241201162023-20241201192023-00006.warc.gz.jsonl
CC-MAIN-20241201162023-20241201192023-00007.warc.gz.jsonl
CC-MAIN-20241201162023-20241201192023-00008.warc.gz.jsonl
CC-MAIN-20241201162023-20241201192023-00009.warc.gz.jsonl


In [17]:
check_jsonl_file(download_output_directory)

{"text":"Создаем узнаваемость\n\nпринимаем заказы по телефону200-19-00ежедневно с 09:00 до 18:00 (без выходных)или по почте zakaz@19.uz (круглосуточно)доставим чехлы бесплатно от 2-х до 4-х днейпо будням с 10:00 до 18:00 в пределах города","language":"RUSSIAN","url":"http:\/\/19.uz\/product.php?cat=00&id=26","warc_id":"327c8605-a7dc-4496-9a74-ac394d20bcbe","source_id":"crawl-data-CC-MAIN-2024-51-segments-1733066035857.0-warc-CC-MAIN-20241201162023-20241201192023-00002.warc.gz"}



In [None]:
!rm -r {download_output_directory}/downloads

**[Optional]** Close the Dask cluster.You might encounter error such as `Caught signal 11`.It's OK, just rerun the cell again.

In [18]:
client.cluster.close()
client.shutdown()

Now that we have the dataset, it would be great to see how each step of the pipeline impacts the size of the dataset. Don't worry we have a small helper function which tracks this for you.

In [5]:
input_dataset = DocumentDataset.read_json(download_output_directory, backend='pandas')
print("Length of downloaded and extracted dataset:", len(input_dataset))
tracker = DataSizeTracker(len(input_dataset))

Reading 10 files with blocksize='1gb' / files_per_partition=None




Length of downloaded and extracted dataset: 157106


## 2.Language seperation and unicode fixing

In this section, we will be using a language classification model by fasttext to separate the TH wikipedia dataset based on the document major languages, and we will also fix the unicode in the documents. Detailed steps are:

1. Download fasttext model for text language detection
2. Construct a filter which uses the downloaded fasttext model to produce a language label to each document. 
3. Separate each document by the language label. This will create sub-folders for each languages under the output path and the documents under the same language will be output to a .jsonl file in the corresponding sub-folder.
4. Load .jsonl file in the folder of desirable language. In this example, `TH` folder will be loaded.
5. Apply `UnicodeReformatter` to the data and output the result in .jsonl format. 



In [19]:
from nemo_curator import ScoreFilter, Modify
from nemo_curator.filters import FastTextLangId
from nemo_curator.modifiers import UnicodeReformatter

**[Optional]** Start a cpu based Dask cluster.

In [None]:
# client = get_client(cluster_type="cpu", n_workers=10, processes=True, memory_limit='16GiB')
# client

Define parameters

In [7]:
# Input path
multilingual_data_path = download_output_directory

# Output path
language_base_output_path = os.path.join(data_dir,"language_sep")
language_data_output_path = os.path.join(language_base_output_path,"data")
language_separated_output_path = os.path.join(language_data_output_path,"language")
lang_sep_cleaned_data_output_path = os.path.join(language_data_output_path,"cleaned")

# Fasttext model path
model_path = language_base_output_path

# Define desired language
target_language = "EN"

# Define key in output .jsonl files to store the language information
language_field = "language"

Download fasttext model

In [None]:
!wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin -P {model_path}

Apply fasttext model to separate documents by their languages

In [None]:
t0 = time.time()

# Load dataset 
multilingual_dataset = DocumentDataset.read_json(multilingual_data_path, blocksize="64MiB", add_filename=True)

#Define Language separation pipeline
lang_filter = FastTextLangId(os.path.join(model_path,'lid.176.bin'))
language_id_pipeline = ScoreFilter(lang_filter, score_field=language_field, score_type='object')
filtered_dataset = language_id_pipeline(multilingual_dataset)

# The language separation pipeline will produce a result looks like ['EN',0.96873], we only want to keep the 'EN' label and drop the detailed classifier score
filtered_dataset.df[language_field] = filtered_dataset.df[language_field].apply(lambda score: score[1],meta = (language_field, 'object'))

# Split the dataset to corresponding language sub-folders
language_stats = separate_by_metadata(filtered_dataset.df, language_separated_output_path, metadata_field=language_field).compute()

print(f"Time taken for splitting language:{time.time()-t0}")

Load `UnicodeReformatter` to reformat any unicode appeared in the desired language dataset

In [None]:
t0 = time.time()

# Read the language specific data and fix the unicode in it
lang_data_path = os.path.join(language_separated_output_path, target_language)
lang_data = DocumentDataset.read_json(lang_data_path, blocksize="64MiB", add_filename=True)

cleaner = Modify(UnicodeReformatter())
cleaned_data = cleaner(lang_data)

# Write the cleaned_data
cleaned_data.to_json(lang_sep_cleaned_data_output_path, write_to_filename=True)

print(f"Time taken for fixing unicode:{time.time()-t0}")

**[Optional]** Verify the result. We can see that some documents has been removed from TH wikipedia dataset since the number of lines in this output file is less than the original file 

In [21]:
# List all the file in the output directory.
! ls {lang_sep_cleaned_data_output_path}

# Please replace your dataset file name accordingly.
# ! wc -l  {lang_sep_cleaned_data_output_path}/{YOUR DATASET FILE NAME}.jsonl

CC-MAIN-20241201162023-20241201192023-00000.warc.gz.jsonl
CC-MAIN-20241201162023-20241201192023-00001.warc.gz.jsonl
CC-MAIN-20241201162023-20241201192023-00002.warc.gz.jsonl
CC-MAIN-20241201162023-20241201192023-00003.warc.gz.jsonl
CC-MAIN-20241201162023-20241201192023-00004.warc.gz.jsonl
CC-MAIN-20241201162023-20241201192023-00005.warc.gz.jsonl
CC-MAIN-20241201162023-20241201192023-00006.warc.gz.jsonl
CC-MAIN-20241201162023-20241201192023-00007.warc.gz.jsonl
CC-MAIN-20241201162023-20241201192023-00008.warc.gz.jsonl
CC-MAIN-20241201162023-20241201192023-00009.warc.gz.jsonl


In [22]:
check_jsonl_file(os.path.join(language_separated_output_path,'EN'))

{"language":"EN","source_id":"crawl-data-CC-MAIN-2024-51-segments-1733066035857.0-warc-CC-MAIN-20241201162023-20241201192023-00002.warc.gz","text":"Cook pasta according to package instructions. Saute ham in oil to brown it a bit and then remove from pan and set aside. Cook onion 4 minutes then add mushrooms and continue cooking another 4 minutes. Mix in flour, rosemary and pepper then gradually add milk and bring to a boil and cook 2 minutes to thicken it up. Then reduce heat and add peas and sour cream and cook 2 minutes. Mix with the drained pasta and ham and heat through.\n\nJoin us on Facebook and Get Notified When New Recipes Are Posted\n\nAbout 400 Calories or Less\n\nWhat's a foodie to do when she passes the 40 mark and her metabolism comes to a screeching halt? (hint...denial did not work !) So I've put together a collection of everyday meals that are simple to make, under 400 calories, yet so savory and delicious that you won't miss your old favorites and standbys. I've tested

In [9]:
lang_id_dataset = DocumentDataset.read_json(os.path.join(language_separated_output_path,'EN'), backend='pandas')
tracker.record_size("Language Identification", len(lang_id_dataset))
tracker.print_summary()

Reading 10 files with blocksize='1gb' / files_per_partition=None




Original Size: 157106
Language Identification: 74194, Incremental Reduction: 82912 (52.77%)
Overall Reduction: 82912 (52.77%)


**[Optional]** Close the Dask cluster.

In [None]:
# client.cluster.close()
# client.shutdown()

## 4.Data Deduplication

We will perform both Exact and Fuzzy deduplication in this part of the tutorial.

In exact deduplication, the document text is hashed into unique string using certain hashing algorithm, such as 'md5'. The documents with exact hashed values are having identical text. We will output the `ID` of duplicated documents for removal later. The function used is `ExactDuplicates()`. Arguments for this function include:
- `id_field`: Key in input file for identifying document ID
- `text_field`: Key in input file which contains document text.
- `hash_method`: Hashing algorithm used. Default is `md5`
- `cache_dir`: If specified, the duplicated document IDs will be output to the `cache_dir`. Otherwise, the IDs will not be saved

Also, we are going to use GPU dask cluster to accelerate computation for deduplication (both exact and fuzzy)


Before performing deduplication, it's crucial to ensure each document in our dataset has a unique identifier. While some datasets like Common Crawl might have a `source_id` field, it's often insufficient for uniquely identifying individual records. To address this, we'll generate and assign unique IDs to each document, following the format `<prefix>_<id>`. This unified `id` field is particularly useful when working with multiple datasets, as it allows us to easily track the origin of removed documents during the deduplication process. We'll use the `AddID()` function from the NeMo Curator library to achieve this. The function's key parameters are:
- `id_field`: The field to be added to the input JSON file. If this key already exists, its value will be replaced with the generated ID.
- `id_prefix`: A prefix string to be added to the beginning of each generated ID (e.g., 'doc_id').
- `start_index`: The starting index for the ID sequence. If set to `None`, an unordered ID scheme is used for faster processing. In this notebook, we set it to 0 for easier reference and tracking.

In [23]:
from nemo_curator import AddId

**[Optional]** If there is no running Dask cluster, start CPU based Dask cluster.

In [None]:
# cluster = LocalCluster(n_workers=10, processes=True, memory_limit='16GB')
# client = Client(cluster)

In [12]:
#Input
add_id_input_data_dir = lang_sep_cleaned_data_output_path

#Output
added_id_output_path = os.path.join(data_dir,"add_id/cleaned")

#Format of output ID will be <prefix>_<id>, Define prefix here
add_ID_id_prefix="EN_CC"

In [None]:
t0 = time.time()
# Read input files
dataset = DocumentDataset.read_json(add_id_input_data_dir,add_filename=True)

# Run AddID() on the input dataset
add_id = AddId(id_field='id',id_prefix=add_ID_id_prefix,start_index=0)
id_dataset = add_id(dataset)

#Output files
id_dataset.to_json(added_id_output_path, write_to_filename=True)

print(f"Time taken for add ID:{time.time()-t0}")

In [25]:
check_jsonl_file(added_id_output_path)

{"language":"EN","source_id":"crawl-data-CC-MAIN-2024-51-segments-1733066035857.0-warc-CC-MAIN-20241201162023-20241201192023-00002.warc.gz","text":"Cook pasta according to package instructions. Saute ham in oil to brown it a bit and then remove from pan and set aside. Cook onion 4 minutes then add mushrooms and continue cooking another 4 minutes. Mix in flour, rosemary and pepper then gradually add milk and bring to a boil and cook 2 minutes to thicken it up. Then reduce heat and add peas and sour cream and cook 2 minutes. Mix with the drained pasta and ham and heat through.\n\nJoin us on Facebook and Get Notified When New Recipes Are Posted\n\nAbout 400 Calories or Less\n\nWhat's a foodie to do when she passes the 40 mark and her metabolism comes to a screeching halt? (hint...denial did not work !) So I've put together a collection of everyday meals that are simple to make, under 400 calories, yet so savory and delicious that you won't miss your old favorites and standbys. I've tested

Close Dask cluster. This cell needs to be run as we are starting a new GPU Dask cluster in the following task

In [None]:
# client.cluster.close()
# client.shutdown()

Now, lets start off with Exact Deduplication process.

In [26]:
from nemo_curator.modules import ExactDuplicates

Start a GPU based Dask cluster. Since GPU based Dask cluster involves setting several arguments, we will use the `get_client()` wrapper function to quickly set up. 

In [None]:
client = get_client(cluster_type = 'gpu', set_torch_to_use_rmm=False)
print(f"Number of dask worker:{get_num_workers(client)}")
client.run(pre_imports)
client

If you encounter the following error
`get_client() missing 1 required positional argument: 'args'`:

This is probably because the `nemo_curator` library is not updated to the newer version. Please run the following line in the terminal, following instruction in our [GitHub](https://github.com/nicoleeeluo/NeMo-Curator/tree/main) repo, and restart the notebook. Intermediate result of the previous section has been saved to local, you can start from this section after updating.

In [None]:
#pip install --extra-index-url https://pypi.nvidia.com ".[cuda12x]"

Define parameters

In [None]:
#Input
exact_dedup_input_dataset_dir = added_id_output_path

#Output
exact_dedup_base_output_path = os.path.join(data_dir,"exact_dedup")
exact_dedup_log_dir = os.path.join(exact_dedup_base_output_path,'log')
exact_dedup_output_dir = os.path.join(exact_dedup_base_output_path,'data')

#Parameters for ExactDuplicates()
exact_dedup_dataset_id_field = "id"
exact_dedup_dataset_text_field = "text"


In [None]:
!mkdir -p {exact_dedup_log_dir}
!mkdir -p {exact_dedup_output_dir}

Apply exact deduplication

In [None]:
t0 = time.time()
# Read input dataset
input_dataset = DocumentDataset.read_json(exact_dedup_input_dataset_dir, backend='cudf')

#Run exact deduplication to the input
exact_dup = ExactDuplicates(
    logger=exact_dedup_log_dir,
    id_field=exact_dedup_dataset_id_field,
    text_field=exact_dedup_dataset_text_field,
    hash_method="md5",
    cache_dir=exact_dedup_output_dir #Duplicated document ID list is output to the cache_dir
)
duplicates = exact_dup(dataset=input_dataset)

print(f"Number of exact duplicated file:{len(duplicates)}")

print(f"Time taken for exact duplicate:{time.time()-t0}")

**[Optional]** Verify the output duplicated ID. We can group by the `_hashes` to get the list of duplicated documents having the same _hashes and use `extract_lines_with_id()` to verify that those documents are indeed exact duplicates. Please note that the `id` might changes, therefore, please replace the `target_list` when necessary

In [None]:
exact_dedup_res = pd.read_parquet(os.path.join(exact_dedup_output_dir,"_exact_duplicates.parquet"))
print(f"Number of exact duplicated document:{len(exact_dedup_res)}")
exact_dedup_res.head()

In [None]:
duplicated_list = exact_dedup_res.groupby('_hashes')['id'].agg(list).reset_index().head()
duplicated_list

Using the duplicated id shown above, check the content to see if it's exact duplicates

In [None]:
# example_duplicates = duplicated_list["id"].to_list()[0][0:4]

# for line in extract_lines_with_id(os.path.join(exact_dedup_input_dataset_dir,'{YOUR DATASET FILE NAME}'),example_duplicates):
#     print(line)

**[Optional]** You might choose to close Dask cluster here

In [None]:
# client.cluster.close()
# client.shutdown()

Now, lets perform Fuzzy Deduplication.

Fuzzy deduplication involves 3 to 5 intermediate steps to generate duplicates. Refer to https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/gpudeduplication.html for details.

Fuzzy deduplication in this example is a GPU implementation of MinhashLSH algorithm. This algorithm measures similarity based on statistics but not semantic meanings of text. There are a few concepts to be introduced before heading into fuzzy deduplication.

This algorithm has following steps in a high-level:
1. Compute minhash for each document.
2. Run Locality Sensitive Hashing (LSH) based on the minhash which further assign buckets to each document. Each document will be assigned to multiple buckets. Documents within the same bucket are deemed to be similar.
3. **[Optional]**: Run pair-wise Jaccard similarity within documents in each bucket to remove false positive cases within the buckets.
4. Based on the Buckets and jaccard values between documents (if computed), transform documents across buckets (deemed similar) into a graph and run the connected components algorithm. For a group of connected components in the graph, they are the final similar document groups and the IDs within each groups will be output for duplicate removal.
More detailed explanation please refer to https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/gpudeduplication.html#fuzzy-deduplication.

In this section, we will use the fuzzy deduplication wrapper offered by NeMo curator instead of running each step individually.

**If there is not running Dask cluster, start a GPU Dask cluster here**

In [None]:
# client = get_client(cluster_type = 'gpu', set_torch_to_use_rmm=False)
# print(f"Number of dask worker:{get_num_workers(client)}")
# client.run(pre_imports)

In [27]:
from nemo_curator import FuzzyDuplicates, FuzzyDuplicatesConfig

In [28]:
#Input
fuzzy_dedup_data_path = added_id_output_path
#Output
fuzzy_dedup_base_output_path = os.path.join(data_dir,"fuzzy_wrapper")
fuzzy_dedup_log_dir = os.path.join(fuzzy_dedup_base_output_path,'log')
fuzzy_dedup_no_false_positive_cache_dir = os.path.join(fuzzy_dedup_base_output_path,'cache_nofp')
fuzzy_dedup_false_positive_cache_dir = os.path.join(fuzzy_dedup_base_output_path,'cache_fp')
fuzzy_dedup_output_dir = os.path.join(fuzzy_dedup_base_output_path,'data')
#Specify dataset name
dataset_name = 'EN_CC'

#Relevant parameters
id_field = 'id'
text_field = 'text'
filetype = "parquet"

!mkdir -p {fuzzy_dedup_base_output_path}
!mkdir -p {fuzzy_dedup_log_dir}
!mkdir -p {fuzzy_dedup_no_false_positive_cache_dir}
!mkdir -p {fuzzy_dedup_false_positive_cache_dir}
!mkdir -p {fuzzy_dedup_output_dir}

**[Optional]** If the cache folder is not empty, please CLEAR the folder before proceeding

In [None]:
# !rm -r {fuzzy_dedup_no_false_positive_cache_dir}

In [None]:
t0 = time.time()

input_dataset = DocumentDataset.read_json(fuzzy_dedup_data_path, backend='cudf')

fuzzy_dedup_config = FuzzyDuplicatesConfig(
    cache_dir=fuzzy_dedup_no_false_positive_cache_dir,
    id_field=id_field,
    text_field=text_field,
    seed=10,
    char_ngrams=24,
    num_buckets=20,
    hashes_per_bucket=13,
    use_64_bit_hash=False,
    buckets_per_shuffle=5,
    false_positive_check=False,
)

fuzzy_dup = FuzzyDuplicates(logger=fuzzy_dedup_log_dir, config=fuzzy_dedup_config)
duplicates = fuzzy_dup(dataset=input_dataset)

duplicates.to_parquet(fuzzy_dedup_output_dir, write_to_filename=False)

print(f"Time taken for Fuzzy Deduplication (No False Positive Check): {time.time()-t0} s")


In [29]:
fuzzy_dedup_res = pd.read_parquet(fuzzy_dedup_output_dir)
fuzzy_dedup_res.head()

Unnamed: 0,group,id
0,6872,EN_CC-0000066724
1,10232,EN_CC-0000006501
2,6874,EN_CC-0000019908
3,10234,EN_CC-0000047960
4,2872,EN_CC-0000004459


This section removes duplicate documents identified by exact and fuzzy deduplication. It loads the deduplication results and the input dataset, then removes the identified duplicates using DataFrame operations.

Define parameters

In [13]:
#Input
dataset_dir = added_id_output_path

#Output
dudped_output_dir = os.path.join(data_dir,"remove_duplicate/result.parquet")

#Relevant parameters
input_id_field = 'id'
id_prefix = add_ID_id_prefix

!mkdir -p {dudped_output_dir}

We will first process the result of exact deduplication. Since result of exact deduplication contains original ID used in input dataset, it is more straightforward to deal with.

In [None]:
#Load .jsonl dataset
input_dataset = DocumentDataset.read_json(dataset_dir, backend='cudf')

#Load exact deduplicate result and extract list of duplicated document ID
exact_duplicates = DocumentDataset.read_parquet(os.path.join(exact_dedup_output_dir,"_exact_duplicates.parquet"), backend='cudf')
exact_docs_to_remove = exact_duplicates.df.map_partitions(
    lambda x: x[x._hashes.duplicated(keep="first")]
)

#Remove the duplicated document from input dataset
result = input_dataset.df[
    ~input_dataset.df[input_id_field].isin(exact_docs_to_remove[input_id_field].compute())
]

In [None]:
#Loads result from fuzzy dedup wrapper
fuzzy_duplicates = pd.read_parquet(fuzzy_dedup_output_dir)

#Generate list of near duplicate document ID
fuzzy_docs_to_remove = fuzzy_duplicates[fuzzy_duplicates.duplicated(subset=['group'], keep='first')]

In [None]:
#Remove near duplicates
result = result[~result[input_id_field].isin(fuzzy_docs_to_remove[input_id_field])]

#Save final result to local
result.to_parquet(dudped_output_dir, write_to_filename=True)

Verify the result of duplicate removal. We can see that the number of document in resultant document is less than the original dataset 

In [15]:
res = pd.read_parquet(dudped_output_dir)
tracker.record_size("Exact and Fuzzy deduplication", len(res))
tracker.print_summary()

Original Size: 157106
Language Identification: 74194, Incremental Reduction: 82912 (52.77%)
Exact and Fuzzy deduplication: 62984, Incremental Reduction: 11210 (15.11%)
Overall Reduction: 94122 (59.91%)


Close the GPU Dask Cluster.You might encounter error such as `Caught signal 11`.It's OK, just rerun the cell again.

In [None]:
client.cluster.close()
client.shutdown()

## 4. Heuristic and Perplexity Fitlering

In this section, we will apply multiple heuristic filters to the dataset, record the heuristic score for documents and documents removed for each filter. For each heuristic filter, the filter calculates a quality scores based on user defined heuristics/algorithms and classifies documents into high quality documents or low quality documents if the quality score is above the user defined threshold.

For detailed implementation and description of each heuristic filter, please refer to `./NeMo-Curator/nemo-curator/filters/heuristics_filter.py`. For customized heuristic filter implementation, user shall follow the sample implementations, write customized filters and update the .yaml files accordingly.

In [42]:
from nemo_curator.utils.config_utils import build_filter_pipeline
from nemo_curator import Score, ScoreFilter
from nemo_curator.utils.file_utils import expand_outdir_and_mkdir

**[Optional]** The following cell is to remove warning from dask.

In [None]:
import warnings

# Disable the metadata warning
warnings.filterwarnings("ignore", module="dask.dataframe.core")

Create a CPU Dask Cluster.

In [None]:
client = get_client(cluster_type="cpu", n_workers=10, processes=True, memory_limit='16GiB')
client

Define some helper functions

In [None]:
def get_dataframe_complement(original_df, filtered_df):
    def partition_complement(part_original_df, partition_info=None):
        if not partition_info:
            return part_original_df
        part_filtered_df = filtered_df.get_partition(partition_info["number"])
        complement_mask = ~part_original_df.index.isin(part_filtered_df.index.persist())
        complement_df = part_original_df[complement_mask]
        return complement_df

    return original_df.map_partitions(partition_complement)

def write_scores(df, output_dir):
    for column in df.columns:
        output_path = os.path.join(output_dir, f"{column}.txt")
        df[column].to_csv(output_path, single_file=True, encoding="utf-8", header=False, index=False, mode="a")

def get_score_fields(pipeline):
    score_fields = []
    for nc_module in pipeline.modules:
        if isinstance(nc_module, Score) or isinstance(nc_module, ScoreFilter):
            if nc_module.score_field:
                score_fields.append(nc_module.score_field)
    return score_fields

Define parameters

In [17]:
#Input
HF_input_data_dir = dudped_output_dir
input_file_type = 'parquet'
batch_size = 1

#Output
HF_base_output_path = os.path.join(data_dir,'heuristic_filtering')
kept_document_dir =  os.path.join(HF_base_output_path,'data','hq.parquet')
removed_document_dir =  os.path.join(HF_base_output_path,'data','lq.parquet')
output_document_score_dir =  os.path.join(HF_base_output_path,'data','score')
output_file_type = 'parquet'

#Relevant parameters
filter_config_file = './config/heuristic_filter_en.yaml'
input_id_field = 'id'

!mkdir -p {kept_document_dir}
!mkdir -p {removed_document_dir}
!mkdir -p {output_document_score_dir}

Run heuristic filtering

In [None]:
t0 = time.time()

#Load filters from config
filter_pipeline = build_filter_pipeline(filter_config_file)
score_fields = get_score_fields(filter_pipeline)

# Load dataset
dataset = DocumentDataset.read_parquet(HF_input_data_dir, files_per_partition=1, blocksize=None, backend='pandas', add_filename=True)

filtered_dataset = filter_pipeline(dataset)

# Write scores of retained doucment to separate directory
output_df = filtered_dataset.df[[input_id_field, *score_fields]]
write_scores(output_df, output_document_score_dir)

# Remove scores from dataset df
filtered_dataset = DocumentDataset(filtered_dataset.df.drop(columns=score_fields))

# Output filtered dataset
filtered_dataset.to_parquet(kept_document_dir, write_to_filename=True)

print(f"Time taken for Heuristic filtering: {time.time()-t0} s")

**[Optional]** Verify the result.

In [18]:
res = pd.read_parquet(kept_document_dir)
res.head()

Unnamed: 0,id,language,source_id,text,url,warc_id
0,EN_CC-0000000000,EN,crawl-data-CC-MAIN-2024-51-segments-1733066035...,Media coverage\n\nParallels Coming to the 2014...,http://028zq.com/news/shownews.php?id=16&lang=en,deee4ce0-c404-4ff3-8700-0b45e8ac6b5e
1,EN_CC-0000000001,EN,crawl-data-CC-MAIN-2024-51-segments-1733066035...,techspace-skywatch\n\nA network of Autonomous ...,http://2014.spaceappschallenge.org/project/tec...,6c17be7f-7253-4703-aead-0aa6e53bcb5d
2,EN_CC-0000000002,EN,crawl-data-CC-MAIN-2024-51-segments-1733066035...,lemon rosemary chicken\n\nLeave a Reply\n\nJoi...,http://400caloriesorless.com/?attachment_id=3867,f70dc315-8ee5-4cd1-b685-975f36f15e90
3,EN_CC-0000000004,EN,crawl-data-CC-MAIN-2024-51-segments-1733066035...,Teen Patti Master – Update APK Download & Get ...,http://789mgmslots.com/2024/08/26/teen-patti-m...,48640b9c-d821-4c3a-bf31-a16e3eb30f0f
4,EN_CC-0000000005,EN,crawl-data-CC-MAIN-2024-51-segments-1733066035...,Lump sum payment specialists since 1992\n\nFor...,http://GRANOFFENTERPRISES.COM/,da02f4d5-57bb-4dde-9089-cacb273d2362


In [19]:
tracker.record_size("Heuristic Filtering", len(res))
tracker.print_summary()

Original Size: 157106
Language Identification: 74194, Incremental Reduction: 82912 (52.77%)
Exact and Fuzzy deduplication: 62984, Incremental Reduction: 11210 (15.11%)
Heuristic Filtering: 47107, Incremental Reduction: 15877 (25.21%)
Overall Reduction: 109999 (70.02%)


Close the CPU Dask Cluster

In [None]:
client.cluster.close()
client.shutdown()