<img src="./images/DLI_Header.png" style="width: 400px;">


# 2. Advanced Data Processing


In this notebook, we will use NeMo Curator to perform several crutial data cleaning steps, such as language detection and filtering, topic classification, and deduplication. 

This notebook is structured as follows:
- First, we will explore language detection and filtering to separate our multilingual dataset by language.
- Next, we will dive into topic classification to categorize the datasets into relevant themes.
- Finally, we will explore document deduplication, covering both exact and fuzzy methods.


**[2.1 Language Separation](#2.1-Language-Separation)<br>**
**[2.2 Domain Classification](#2.2-Domain-Classification)<br>**
**[2.3 Documents Deduplication](#2.3-Deduplication)<br>**




***************
### Environment Setup



In [1]:
import warnings

# Ignore any warning
warnings.filterwarnings("ignore")

The next cell starts a Dask LocalCluster on your GPU cluster. 

In [2]:
from nemo_curator.utils.distributed_utils import get_client, get_num_workers


def pre_imports():
    import cudf


client = get_client(cluster_type="gpu", set_torch_to_use_rmm=False)

print(f"Number of dask worker:{get_num_workers(client)}")
client.run(pre_imports)


stdout:



stderr:

Traceback (most recent call last):
  File "<string>", line 7, in <module>
  File "/home/aibeceles/lession_01_wsl/nemo_env_clean/lib/python3.12/site-packages/numba_cuda/numba/cuda/cudadrv/runtime.py", line 111, in get_version
    self.cudaRuntimeGetVersion(ctypes.byref(rtver))
    ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aibeceles/lession_01_wsl/nemo_env_clean/lib/python3.12/site-packages/numba_cuda/numba/cuda/cudadrv/runtime.py", line 65, in __getattr__
    self._initialize()
  File "/home/aibeceles/lession_01_wsl/nemo_env_clean/lib/python3.12/site-packages/numba_cuda/numba/cuda/cudadrv/runtime.py", line 51, in _initialize
    self.lib = open_cudalib('cudart')
               ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aibeceles/lession_01_wsl/nemo_env_clean/lib/python3.12/site-packages/numba_cuda/numba/cuda/cudadrv/libs.py", line 84, in open_cudalib
    return ctypes.CDLL(path)
           ^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/ctypes/__init__.py", line 379, in __in

cuDF Spilling is enabled
Number of dask worker:1


{'tcp://127.0.0.1:40403': None}

2026-01-02 17:16:00,247 - tornado.application - ERROR - Exception in callback <bound method SystemMonitor.update of <SystemMonitor: cpu: 6 memory: 3729 MB fds: 157>>
Traceback (most recent call last):
  File "/home/aibeceles/lession_01_wsl/nemo_env_clean/lib/python3.12/site-packages/tornado/ioloop.py", line 945, in _run
    val = self.callback()
          ^^^^^^^^^^^^^^^
  File "/home/aibeceles/lession_01_wsl/nemo_env_clean/lib/python3.12/site-packages/distributed/system_monitor.py", line 210, in update
    gpu_metrics = nvml.real_time()
                  ^^^^^^^^^^^^^^^^
  File "/home/aibeceles/lession_01_wsl/nemo_env_clean/lib/python3.12/site-packages/distributed/diagnostics/nvml.py", line 370, in real_time
    "utilization": _get_utilization(h),
                   ^^^^^^^^^^^^^^^^^^^
  File "/home/aibeceles/lession_01_wsl/nemo_env_clean/lib/python3.12/site-packages/distributed/diagnostics/nvml.py", line 339, in _get_utilization
    return pynvml.nvmlDeviceGetUtilizationRates(h).gpu


Let's load the multilingual dataset.

In [3]:
from nemo_curator.datasets import DocumentDataset

multilingual_data_path = "./original_data"
multilingual_dataset = DocumentDataset.read_json(
    multilingual_data_path, add_filename=True
)

Reading 1 files with blocksize='1gb' / files_per_partition=None


In [4]:
# check the data
multilingual_dataset.head()

Unnamed: 0,file_name,text,timestamp,url
0,file.json,Dragon Ball: Le 20e film de la sage sortira le...,2019-01-21 03:52:10,https://cultinfos.com/buzz/332814-dragon-ball-...
1,file.json,Cours D'histoire Des États Européens: Depuis L...,2019-01-17 23:25:39,https://www.bookvoed.ru/book?id=1433688
2,file.json,Se realizó una jornada de promoción del buentr...,2018-04-21 07:38:28,http://www.desarrollosocial.gob.ar/noticias/se...
3,file.json,Restaurantes con Web Y Telefono Y Dias Y Horar...,2020-08-11 16:33:05,http://mendoza.guia.clarin.com/restaurantes-co...
4,file.json,Responsable qualité - Intérim : Emploi et recr...,2020-08-07 01:17:37,https://images3.meteojob.com/Emploi-Interim-Re...


## 2.1 Language Separation

In this section, we will use a language classification model by [fasttext](https://fasttext.cc/docs/en/language-identification.html). 


Let's first create the output folders and download the fasttext model for text language detection:


In [5]:
import os

language_base_output_path = "./curated/04_language_separation"
language_separated_output_path = os.path.join(language_base_output_path, "language")

# Create directories (with parents as needed)
os.makedirs(language_base_output_path, exist_ok=True)
os.makedirs(language_separated_output_path, exist_ok=True)


In [6]:
language_separated_output_path

'./curated/04_language_separation/language'

Let's create the filter which uses the downloaded fasttext model.

In [None]:
# Download fasttext language classification model(this needs to be done hidden in the env)
# !wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin -P {language_separated_output_path}

In [7]:
from nemo_curator import ScoreFilter
from nemo_curator.filters import FastTextLangId

lang_filter = FastTextLangId("lid.176.bin")
language_field = "language"
language_id_pipeline = ScoreFilter(
    lang_filter, score_field=language_field, score_type="object"
)

Now, let's apply the language detection filter on our multilingual dataset. 

In [8]:
# Apply language separation to our multilingual dataset
filtered_dataset = language_id_pipeline(multilingual_dataset)

Let's check the detected language for each sample. 

Notice the new fields `language` in the output with the language code `FR/EN/ES`and the classification score. 

In [10]:
# check the detected language per item
filtered_dataset.head(3)

Unnamed: 0,file_name,text,timestamp,url,language
0,file.json,Dragon Ball: Le 20e film de la sage sortira le...,2019-01-21 03:52:10,https://cultinfos.com/buzz/332814-dragon-ball-...,"[0.9175292253494263, FR]"
1,file.json,Cours D'histoire Des États Européens: Depuis L...,2019-01-17 23:25:39,https://www.bookvoed.ru/book?id=1433688,"[0.5166642069816589, FR]"
2,file.json,Se realizó una jornada de promoción del buentr...,2018-04-21 07:38:28,http://www.desarrollosocial.gob.ar/noticias/se...,"[0.9740189909934998, ES]"


Let's separate documents by the language label and save each language separately. This will create sub-folders for each languages under the output path.


In [11]:
# Save separated languages and get stats
from nemo_curator.utils.file_utils import separate_by_metadata

filtered_dataset.df[language_field] = filtered_dataset.df[language_field].apply(
    lambda score: score[1], meta=(language_field, "object")
)
language_stats = separate_by_metadata(
    filtered_dataset.df, language_separated_output_path, metadata_field=language_field
).compute()

In [12]:
# check the language distribution stats
print(f"Number of document:{len(multilingual_dataset)}")
print(f"Number of filtered document:{len(filtered_dataset)}")

print("Language separation stats and  ", language_stats)

Number of document:400
Number of filtered document:396
Language separation stats and   {'FR': 194, 'ES': 194, 'EN': 8}


We can check the output jsonl file per language.

In [13]:
# check first element for French
! head -n 1 {language_separated_output_path}/FR/file.jsonl |jq

/bin/bash: line 1: jq: command not found


In [14]:
# check first element for spanish
! head -n 1 {language_separated_output_path}/ES/file.jsonl |jq

/bin/bash: line 1: jq: command not found


## 2.2 Domain Classification

Nemo Curator supports various text classification models allowing data annotation, useful for cleaning and data blending. Check the documentation for [distributed data classification](https://github.com/NVIDIA/NeMo-Curator/blob/main/tutorials/distributed_data_classification/README.md).


Each classifier is available on Hugging Face Hub. When run with NeMo Curator, they are accelerated using RAPIDS [CrossFit](https://github.com/rapidsai/crossfit) library.


In this section, we will experiment with the `MultilingualDomainClassifier` a Multilingual Domain Classifier that support 52 languages and annotate 26 domain classes:

`Arts_and_Entertainment`, `Autos_and_Vehicles`, `Adult`,`Beauty_and_Fitness`, `Books_and_Literature`, `Business_and_Industrial`, `Computers_and_Electronics`, `Finance`, `Food_and_Drink`, `Games`, `Health`, `Hobbies_and_Leisure`, `Home_and_Garden`, `Internet_and_Telecom`, `Jobs_and_Education`, `Law_and_Government`, `News`, `Online_Communities`, `People_and_Society`, `Pets_and_Animals`, `Real_Estate`, `Science`, `Sensitive_Subjects`, `Shopping`, `Sports`, `Travel_and_Transportation`

The model architecture is a transformer-based encoder Deberta V3 Base available on Hugging Face Hub. Learn more about the classifier [MultilingualDomainClassifier Model's Card](https://huggingface.co/nvidia/multilingual-domain-classifier).


Let's set the output folder for domain classification.

In [19]:
import cudf
import dask_cudf
from nemo_curator.classifiers import MultilingualDomainClassifier

domain_output_path = "./curated/05_domain_classification"

# Create directory (with parents if needed)
os.makedirs(domain_output_path, exist_ok=True)

First, let's apply the Multilingual Domain Classifier on a toy multilingual dataset. Let's create the dataset with multiple languages and topics.

In [21]:
# Create sample DataFrame
text = [
    # French
    "Il adore les chats.",
    # English
    "Investing in index funds is a popular strategy for long-term financial growth.",
    # Spanish
    "Ir de compras en el centro comercial es una excelente manera de encontrar ofertas y descubrir nuevas tiendas.",
    # Polish
    "Dzięki wykorzystaniu analizy danych programy treningowe dla sportowców stały się bardziej wyrafinowane.",
    # Arabic
    ".تقدم التطورات الحديثة في العلاج الجيني أملاً جديدًا لعلاج الاضطرابات الوراثية",
]
df = cudf.DataFrame({"text": text})

toy_dataset = DocumentDataset(dask_cudf.from_cudf(df, npartitions=1))

RuntimeError: CUDA error at: /tmp/pip-build-env-i5c_dpkz/normal/lib/python3.12/site-packages/librmm/include/rmm/cuda_stream_view.hpp:106: cudaErrorMemoryAllocation out of memory

We can define the `MultilingualDomainClassifier` filter as follows. 

On its first run, it will download the DeBERTa model from the Hugging Face Hub.

In [22]:
# create the classifier
domain_classifier = MultilingualDomainClassifier(batch_size=1024)

AcceleratorError: CUDA error: out of memory
Search for `cudaErrorMemoryAllocation' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


Now, let's run the filter on our multilingual multi topics toy samples.

In [23]:
%%time
result_domain = domain_classifier(dataset=toy_dataset)

CPU times: user 7 μs, sys: 0 ns, total: 7 μs
Wall time: 10.7 μs


NameError: name 'domain_classifier' is not defined

2026-01-02 17:17:03,995 - tornado.application - ERROR - Exception in callback <bound method SystemMonitor.update of <SystemMonitor: cpu: 4 memory: 1506 MB fds: 31>>
Traceback (most recent call last):
  File "/home/aibeceles/lession_01_wsl/nemo_env_clean/lib/python3.12/site-packages/tornado/ioloop.py", line 945, in _run
    val = self.callback()
          ^^^^^^^^^^^^^^^
  File "/home/aibeceles/lession_01_wsl/nemo_env_clean/lib/python3.12/site-packages/distributed/system_monitor.py", line 210, in update
    gpu_metrics = nvml.real_time()
                  ^^^^^^^^^^^^^^^^
  File "/home/aibeceles/lession_01_wsl/nemo_env_clean/lib/python3.12/site-packages/distributed/diagnostics/nvml.py", line 370, in real_time
    "utilization": _get_utilization(h),
                   ^^^^^^^^^^^^^^^^^^^
  File "/home/aibeceles/lession_01_wsl/nemo_env_clean/lib/python3.12/site-packages/distributed/diagnostics/nvml.py", line 339, in _get_utilization
    return pynvml.nvmlDeviceGetUtilizationRates(h).gpu
 

Check the outputs. Notice the new field `domain_pred`. Example of expected outputs: 
```
Il adore les chats.	                                Pets_and_Animals
Investing in index funds is a popular strategy...	Finance
Ir de compras en el centro comercial es una ex...	Shopping
Dzięki wykorzystaniu analizy danych programy t...	Sports
.تقدم التطورات الحديثة في العلاج الجيني أملاً ...	        Health
...
```

In [None]:
# check the results
result_domain.head()

Now, let's use the `MultilingualDomainClassifier` to process our previously filtered multilingual corpus (French and Spanish).

In [None]:
# load the filtered data
from nemo_curator.datasets import DocumentDataset

multilingual_data_path = "./curated/01_clean_and_unify"
multilingual_dataset = DocumentDataset.read_json(multilingual_data_path, backend="cudf")

# Domain classification
multilingual_result_domain = domain_classifier(dataset=multilingual_dataset)

Let's check the output. Expected to see an aditional field `domain_pred`:
```
text                                            		domain_pred
Dragon Ball: Le 20e film de la sage sortira le...		Arts_and_Entertainment
Cours D'histoire Des États Européens: Depuis L...		Books_and_Literature
Se realizó una jornada de promoción del buentr...		People_and_Society
...
```

Execute the following cell to review the topic predictions:


In [None]:
# check the domain classification
multilingual_result_domain.head()

Let's now save the output.

In [None]:
# save
result_domain.to_json(domain_output_path)

We can check the saved outputs by executing the next cell:

In [None]:
! head -n 1 {domain_output_path}/0.part | jq

## 2.3 Deduplication

Document-level deduplication aims to reduce the occurrence of duplicate and near-duplicate documents in a dataset. This is crucial for datasets cleaning, reducing redundancy, and ensuring that models are trained on diverse and unique data.

In this section, we will explore both the Exact and Fuzzy deduplication. Both functionalities are supported in NeMo Curator and accelerated using the [RAPIDS](https://rapids.ai/) library.


Remember, we created our multilingual (Spanish and French) dataset by deduplicating each sample once.
Before running deduplication, we need to ensure that each document in the dataset has a unique ID. We can use the `add_id` module within NeMo Curator to accomplish this.

In [None]:
# create output folders
from nemo_curator import AddId

data_dir = "curated/06_add_id"
added_id_output_path = os.path.join(data_dir, "add_id/cleaned")
!mkdir -p {data_dir}

dataset_fr = DocumentDataset.read_json(
    os.path.join(language_separated_output_path, "FR/"), add_filename=True
)
dataset_es = DocumentDataset.read_json(
    os.path.join(language_separated_output_path, "ES/"), add_filename=True
)

### 2.3.1 Add Unique ID

Let's start by adding a unique ID for out dataset separated per language (Spanish and French)  

Let's run the `AddId` on the French corpus by running the next cell. The Format of output ID will be `<prefix>_<id>` where `prefix` is provided and `id` is a generated unique number. 

Let's apply the `AddId` function to the French corpus by running the next cell. The output ID format will be `<prefix>_<id>`, where `prefix` is specified by the user, and `id` is a uniquely generated number.


Example of expected output:
```
text	                                         		id
Dragon Ball: Le 20e film de la sage sortira le...		FR_data-0000000000
Cours D'histoire Des États Européens: Depuis L...		FR_data-0000000001
...
```

Execute the following cell to apply `AddId` to the French corpus, user prefix here is set to `FR_data`:

In [None]:
%%time
# Define user's prefix
FR_add_ID_id_prefix = "FR_data"

add_id = AddId(id_field="id", id_prefix=FR_add_ID_id_prefix, start_index=0)
id_dataset_fr = add_id(dataset_fr)

Let's check the outputs. Notice the new field `id`.

In [None]:
# check outputs
id_dataset_fr.head(3)

We can save the outputs in their designated folder.

In [None]:
id_dataset_fr.to_json(os.path.join(added_id_output_path, "FR/"), write_to_filename=True)

#### Exercice:  Add Unique ID for Spanish data.
Make sure to replace the `# Your code here`. If you get stuck, refer to the solution below.

In [None]:
ES_add_ID_id_prefix = # Your code here

add_id = AddId(id_field="id", id_prefix=ES_add_ID_id_prefix, start_index=0)
id_dataset_es = # Your code here

# save to relevant folder
id_dataset_es.to_json(os.path.join(added_id_output_path, "ES/"), write_to_filename=True)

In [None]:
# solution
ES_add_ID_id_prefix = "ES_data"

add_id = AddId(id_field="id", id_prefix=ES_add_ID_id_prefix, start_index=0)
id_dataset_es = add_id(dataset_es)

# save to relevant folder
id_dataset_es.to_json(os.path.join(added_id_output_path, "ES/"), write_to_filename=True)

### 2.3.2 Exact Deduplication

Exact Deduplication consists in identifying and removing duplicate documents that are exactly identical within a dataset. This process helps eliminate redundant data, prevents models from overfitting on repeated examples, and ensures that training and test sets do not contain the same samples, which could otherwise lead to misleading evaluation metrics.

In [NeMo Curator](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/gpudeduplication.html), exact deduplication works by hashing each document and keeping only one document per hash, and it can be run on both GPU ([CuDF](https://docs.rapids.ai/api/cudf)) and CPU ([Pandas](https://pandas.pydata.org/)) based backends.


Let's create the folders for the exact deduplication. We will save the output results in `/data`, temporary files in `/cache`, and logs in `/log`.


In [None]:
data_dir_es = "curated/07_Deduplicate/exact/ES"

exact_dedup_log_dir_es = os.path.join(data_dir_es, "log")
exact_dedup_cache_dir_es = os.path.join(data_dir_es, "cache")
exact_dedup_output_dir_es = os.path.join(data_dir_es, "data")

# Create all required directories
os.makedirs(exact_dedup_log_dir_es, exist_ok=True)
os.makedirs(exact_dedup_cache_dir_es, exist_ok=True)
os.makedirs(exact_dedup_output_dir_es, exist_ok=True)

Before running exact deduplication in NeMo Curator, the dataset needs to present a unique ID for each document (sample). We already added these unique IDs in the previous step in the field `"id"`.

We will be running the exact deduplication on the GPU using cudf backend.

In [None]:
id_field = "id"
input_dataset_es = DocumentDataset.read_json(
    os.path.join(added_id_output_path, "ES/"), backend="cudf", add_filename=True
)

Execute the next cell to run the exact deduplication on the Spanish dataset. This should take about 10 seconds to process.

We can use `perform_removal=True` to apply the duplicate removal directly on the dataset. But, for the sake of this exercise, we will first show the deduplication identifification before actually applying the removal.

In [None]:
%%time
from nemo_curator.modules import ExactDuplicates

# run exact deducplicate
exact_dup_es = ExactDuplicates(
    logger=exact_dedup_log_dir_es,
    id_field="id",
    text_field="text",
    hash_method="md5",
    cache_dir=exact_dedup_cache_dir_es,
)
duplicates_es = exact_dup_es(dataset=input_dataset_es)
exact_docs_to_remove_es = duplicates_es.df.map_partitions(
    lambda x: x[x._hashes.duplicated(keep="first")]
)

Check how many detected documents have duplicates:

In [None]:
print(f"Number of documents in the original data:{len(input_dataset_es)}")
print(f"Number of documents to be removed:{len(exact_docs_to_remove_es)}")

Check some duplicate documents: 

Example of output: 
```
     id                  _hashes
18   ES_data-0000000146 2f610eed57653fbe68328fbaf3274c2a
20   ES_data-0000000148  e473009ec2e1a246de93fea08488ca4c
21   ES_data-0000000149  066347c8a96bc73056a9f172e4d9710

```

In [None]:
exact_docs_to_remove_es.head(3)

Now, apply the deduplication removal and save the results to the output data folder.

In [None]:
result_es = input_dataset_es.df[
    ~input_dataset_es.df[id_field].isin(exact_docs_to_remove_es[id_field].compute())
]
DocumentDataset(result_es).to_json(exact_dedup_output_dir_es, write_to_filename=True)

Check saved output file.

In [None]:
! head -n 1 {exact_dedup_output_dir_es}/file.jsonl |jq

#### Exercice: Run Exact Desuplication for the French data.

Run the same exact deduplication for the French data. 

Let's first create the relevant folders and set the dataset and id field.

In [None]:
data_dir_fr = "curated/07_Deduplicate/exact/FR"

exact_dedup_log_dir_fr = os.path.join(data_dir_fr, "log")
exact_dedup_cache_dir_fr = os.path.join(data_dir_fr, "cache")
exact_dedup_output_dir_fr = os.path.join(data_dir_fr, "data")
!mkdir -p {exact_dedup_log_dir_fr}
!mkdir -p {exact_dedup_cache_dir_fr}
!mkdir -p {exact_dedup_output_dir_fr}

id_field = "id"
input_dataset_fr = DocumentDataset.read_json(
    os.path.join(added_id_output_path, "FR/"), backend="cudf", add_filename=True
)

Run the deduplication. Make sure to replace the `# Your code here`. If you get stuck, refer to the solution below.

In [None]:
# run exact deduplicate
exact_dup_fr = # Your code here
duplicates_fr = # Your code here
exact_docs_to_remove_fr = # Your code here

In [None]:
# solution
# run exact deducplicate
exact_dup_fr = ExactDuplicates(
    logger=exact_dedup_log_dir_fr,
    id_field="id",
    text_field="text",
    hash_method="md5",
    cache_dir=exact_dedup_cache_dir_fr,
)

duplicates_fr = exact_dup_fr(dataset=input_dataset_fr)
exact_docs_to_remove_fr = duplicates_fr.df.map_partitions(
    lambda x: x[x._hashes.duplicated(keep="first")]
)

Check how many detected documents have duplicates:

In [None]:
print(f"Number of documents in the original data:{len(input_dataset_fr)}")
print(f"Number of documents to be removed:{len(exact_docs_to_remove_fr)}")

Now, apply the deduplication removal and save the results to the output data folder.

In [None]:
result_fr = input_dataset_fr.df[
    ~input_dataset_fr.df[id_field].isin(exact_docs_to_remove_fr[id_field].compute())
]
DocumentDataset(result_fr).to_json(exact_dedup_output_dir_fr, write_to_filename=True)

In [None]:
client.cluster.close()
client.shutdown()

In [None]:
import IPython

IPython.Application.instance().kernel.do_shutdown(True)  # automatically restarts kernel

### 2.3.3 Fuzzy Deduplication

Removing near-duplicates is referred to as fuzzy deduplication at the document level, which is based on Jaccard similarity scores.

This approach can be broken down into the following stages:
- **Stage 1 - Minhash + LSH:** The first step involves generating MinHash signatures for the documents. NeMo Curator currently supports character-based n-grams for MinHashing. Then, the Locality Sensitive Hashing (LSH) is performed to identify candidate duplicates.
- **Stage 2 - LSH Buckets to Graph edgelist:** LSH buckets are directly converted to edges for the connected components computation.
- **Stage 3 - Connect Components:** Since LSH is an approximate method, documents that are near duplicates may end up in different buckets, with some overlapping documents between them. A GPU-accelerated connected components algorithm is used to identify all connected components in the graph formed by the edges between documents within the same bucket. The output of this step is a list of document IDs and the groups they belong to.

All documents within the same group are considered near duplicates, and results can then be used to remove them from the corpus.
For more information, refer to the Deduplication documentation of [NeMo Curator](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/gpudeduplication.html).


There are no near-duplicates in out example datasets. However, to demonstrate the process, let's run fuzzy deduplication on the French dataset and go through the steps involved.

Let's create fisrt the output folder.

In [None]:
import os

fuzzy_dedup_log_dir_fr = "curated/07_Deduplicate/fuzzy_wrapper/FR"
os.makedirs(fuzzy_dedup_log_dir_fr, exist_ok=True)

data_dir = "curated/06_add_id"
added_id_output_path = os.path.join(data_dir, "add_id/cleaned")
os.makedirs(added_id_output_path, exist_ok=True)  # Creates "curated/06_add_id/add_id/cleaned"

Let's start the Dask client. Make sure that you have stopped the previous one before proceeding.

In [None]:
from dask.distributed import Client
from nemo_curator.utils.import_utils import gpu_only_import, gpu_only_import_from

cudf = gpu_only_import("cudf")
dask_cudf = gpu_only_import("dask_cudf")
LocalCUDACluster = gpu_only_import_from("dask_cuda", "LocalCUDACluster")

cluster = LocalCUDACluster(n_workers=1)
client = Client(cluster)

In [None]:
os.environ["CUDF_SPILL"] = "on"

We will use the `FuzzyDuplicates` method from NeMo Curator to run the fuzzy deduplication process on the French dataset. This will allow us to identify and handle any near-duplicates based on similarity scores.

You should see the three stages logged during the process.

In [None]:
fuzzy_dedup_log_dir_fr = "curated/07_Deduplicate/fuzzy_wrapper/FR"

data_dir = "curated/06_add_id"
added_id_output_path = os.path.join(data_dir, "add_id/cleaned")
input_fr = os.path.join(added_id_output_path, "FR/file.jsonl")

In [None]:
from nemo_curator import FuzzyDuplicates, FuzzyDuplicatesConfig
from nemo_curator.datasets import DocumentDataset

config = FuzzyDuplicatesConfig(
    cache_dir=fuzzy_dedup_log_dir_fr,  # must be cleared between runs
    id_field="id",
    text_field="text",
    seed=42,
    char_ngrams=24,
    num_buckets=20,
    hashes_per_bucket=13,
    use_64_bit_hash=False,
    buckets_per_shuffle=2,
    false_positive_check=False,
)


# Initialize the deduplication object
FuzzyDups = FuzzyDuplicates(config=config, logger="./")

# load the dataset
dataset_fr = DocumentDataset.read_json(
    input_files=input_fr,
    backend="cudf",  # FuzzyDuplicates only supports datasets with the cuDF backend.
)

# run Fuzzy Duplicate
duplicate_docs = FuzzyDups(dataset_fr)

The result from the connected components stage is a list of document IDs and the group they belong to. All documents in the same group are considered near duplicates. 

```
id	                group
FR_data-0000000062	46
FR_data-0000000013	47
FR_data-0000000104	160
FR_data-0000000185	161
FR_data-0000000155	65
...
```
Let's check the outputs. Notice the `group` field.

In [None]:
duplicate_docs.head(3)

These groups can be then used to remove the near duplicates from the corpus.

Let's run that by executing the next cell.

In [None]:
docs_to_remove = duplicate_docs.df.map_partitions(
    lambda x: x[x.group.duplicated(keep="first")]
)
result = dataset_fr.df[~dataset_fr.df["id"].isin(docs_to_remove["id"].compute())]

Check how many detected documents have duplicates:

In [None]:
print(f"Number of documents in the original data : {len(dataset_fr)}")
print(f"Number of documents to be removed : {len(result)}")

#### [Optional] Explore further Deduplication on downstream tasks

Large Language Models are typically evaluated based on their performance on downstream tasks using unseen test data. However, when working with extensive datasets, there is a risk of test data leaking into the model's training set. 

To mitigate this, NeMo Curator provides a Decontamination strategy, in order to ensure that any document sections appearing in downstream tasks are removed from the training set. 

You can explore this in more detail in the [task decontamination](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/taskdecontamination.html) of NeMo Curator documentation. 

---
<h2 style="color:green;">Congratulations!</h2>


In this notebook, you have used NeMo Curator to apply several data cleaning steps, including language detection and filtering, topic classification and document deduplication. These steps help ensure that the dataset is clean, diverse, and free from redundant data, improving the quality of the data used for training and evaluation.

Before moving on to the next notebook, make sure to stop the Dask cluster. Please run the next cell.

In [None]:
import IPython

IPython.Application.instance().kernel.do_shutdown(True)  # automatically restarts kernel

Move to the next notebook to explore synthetic data generation with NeMo Curator. This will allow us to learn how to create artificial data for various tasks, enhancing the diversity and richness of our dataset.

Let's move to the [synthetic_data_generation](03_synthetic_data_generation.ipynb).

<img src="./images/DLI_Header.png" style="width: 400px;">
