## Processing High-Quality Vietnamese Data: Viettel’s Success with NVIDIA NeMo Curator

Open-source [large language models (LLMs)](https://www.nvidia.com/en-us/glossary/large-language-models/) excel in English but struggle with other languages, especially in Southeast Asia. This is primarily due to a lack of training data in these languages, limited understanding of local cultures, and insufficient tokens to capture unique linguistic structures and expressions. To fully meet customer needs, enterprises in non-English-speaking countries must go beyond generic models and customize them to capture the nuances of their local languages, ensuring a seamless and impactful customer experience.

In this tutorial, we will use NeMo Curator to process high-quality [Vietnamese data](https://huggingface.co/datasets/VTSNLP/vietnamese_curated_dataset). We will guide you through the data curation pipeline used and share sample code for each stage.

## Table of Contents
- **1. [Prerequisites and Environment setups](#prerequisites-and-environment-setups)**
- **2. [Data Collecting](#data-collecting)**
- **3. [Data Curation flow](#data-curation-flow)**
    - a. [Unicode reformatting](#unicode-reformatting)
    - b. [Adding Custom IDs to Documents](#adding-custom-ids-to-documents)
    - c. [Exact deduplication](#exact-deduplication)
    - d. [Heuristic Quality Filtering](#heuristic-quality-filtering)
    - e. [Classifier-based Quality Filtering](#classifier-based-quality-filtering)

## Prerequisites and Environment setups

Install NeMo Curator by following the instructions to install the CPU and CUDA-accelerated modules in the README file of the [NeMo Curator repository](https://github.com/NVIDIA/NeMo-Curator/tree/main). 

Next, install these additional packages:

In [1]:
!pip install datasets
!pip install jsonlines

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting jsonlines
  Downloading jsonlines-4.0.0-py3-none-any.whl.metadata (1.6 kB)
Downloading jsonlines-4.0.0-py3-none-any.whl (8.7 kB)
Installing collected packages: jsonlines
Successfully installed jsonlines-4.0.0
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m


To proceed with data processing, we need to set up a Dask environment. Dask is a flexible, open-source library that enables parallel and distributed computing in Python, allowing us to scale computations across multiple cores or even clusters. By distributing tasks, Dask makes the data handling process significantly faster and more efficient.

**Note:** This notebook was run on a single DGX A100 GPU, with a 128-core CPU and 2TB of RAM to handle the dataset size. Depending on your dataset and computing resources, you may need to adjust the Dask worker configuration below accordingly.

In [None]:
from dask.distributed import Client, LocalCluster

# Start a Dask cluster with 12 workers, each limited at 64GB of memory. 
# You might need to adjust these numbers according to your computing resources.
cluster = LocalCluster(n_workers=12, processes=True, memory_limit="64GB")
client = Client(cluster)

## Data Collecting

Each dataset is accessed and downloaded using the Hugging Face Hub, with additional steps required for OSCAR (the Vietnamese subset dataset, version 23.01, an aggregation of web-crawled data) due to its access restrictions. For OSCAR, you need to accept the conditions on the [dataset page](https://huggingface.co/datasets/oscar-corpus/OSCAR-2301) and use a [Hugging Face access token](https://huggingface.co/docs/hub/en/security-tokens) for downloading.

**Download and Convert Datasets to Parquet**

The conversion of dataset into Parquet format facilitates efficient handling and processing of large datasets.

In [3]:
import os
from datasets import load_dataset as load_hf_dataset
from datasets import DownloadConfig 

data_dir = "./datasets/"
download_config = DownloadConfig(num_proc=4)

# Load and save Vietnamese Wikipedia dataset
# In this experiment, we'll focus exclusively on the Wikipedia dataset to have a faster runtime and streamline the process.
ds = load_hf_dataset("wikimedia/wikipedia", "20231101.vi")
ds["train"].to_parquet(os.path.join(data_dir, "wiki_vi_231101.parquet"))

# Load and save Vietnamese news corpus
ds = load_hf_dataset("jetaudio/binhvq_news")
ds["train"].to_parquet(os.path.join(data_dir, "binhvq_news_train.parquet"))

# Load and save OSCAR dataset
ds = load_hf_dataset("oscar-corpus/OSCAR-2301", language="vi", token=True, download_config=download_config, trust_remote_code=True)
ds["train"].to_parquet(os.path.join(data_dir, "oscar_vi.parquet"))

# Load and save C4 dataset
ds = load_hf_dataset("allenai/c4", data_files="multilingual/c4-vi.*.json.gz", download_config=download_config, trust_remote_code=True)
ds["train"].to_parquet(os.path.join(data_dir, "c4_vi.parquet"))

  from .autonotebook import tqdm as notebook_tqdm
Downloading readme: 100%|██████████| 131k/131k [00:00<00:00, 894kB/s] 
Downloading data: 100%|██████████| 291M/291M [00:02<00:00, 105MB/s]  
Downloading data: 100%|██████████| 71.0M/71.0M [00:00<00:00, 82.4MB/s]
Downloading data: 100%|██████████| 50.9M/50.9M [00:00<00:00, 70.6MB/s]
Downloading data: 100%|██████████| 316M/316M [00:03<00:00, 102MB/s]  
Generating train split: 100%|██████████| 1288680/1288680 [00:04<00:00, 319004.20 examples/s]
Creating parquet from Arrow format: 100%|██████████| 1289/1289 [00:04<00:00, 262.30ba/s]


1617830227

**Combine and Standardize Format**

We then combine them into a single dataset, keeping only the "text" column. 

In [4]:
from datasets import concatenate_datasets

# Combine datasets and standardize format
datasets = [os.path.join(data_dir, file) for file in ["wiki_vi_231101.parquet", "c4_vi.parquet", "oscar_vi.parquet", "binhvq_news_train.parquet"]]

data_files = {"train": datasets[0]}
ds = load_hf_dataset("parquet", data_files=data_files)
ds = ds["train"].remove_columns([col for col in ds["train"].column_names if col != "text"])

for d in datasets[1:]:
    ds_ = load_hf_dataset("parquet", data_files={"train": d})
    ds_ = ds_["train"].remove_columns([col for col in ds_["train"].column_names if col != "text"])
    ds = concatenate_datasets([ds, ds_])

Generating train split: 1288680 examples [00:03, 332051.80 examples/s]


**Shard the Combined Dataset**

The combined dataset is then sharded into smaller chunks. Sharding is performed to distribute the data evenly across multiple workers in the Dask cluster, facilitating efficient parallel processing during the data curation stages.

In [5]:
# Define paths for raw data
raw_data_directory = os.path.join(data_dir, "raw")

# Shard the dataset
num_shards = 256
for shard_idx in range(num_shards):
    shard = ds.shard(index=shard_idx, num_shards=num_shards)
    shard.to_parquet(os.path.join(raw_data_directory, f"{shard_idx}.parquet"))

Creating parquet from Arrow format: 100%|██████████| 6/6 [00:00<00:00, 97.09ba/s]
Creating parquet from Arrow format: 100%|██████████| 6/6 [00:00<00:00, 116.71ba/s]
Creating parquet from Arrow format: 100%|██████████| 6/6 [00:00<00:00, 112.64ba/s]
Creating parquet from Arrow format: 100%|██████████| 6/6 [00:00<00:00, 107.82ba/s]
Creating parquet from Arrow format: 100%|██████████| 6/6 [00:00<00:00, 115.92ba/s]
Creating parquet from Arrow format: 100%|██████████| 6/6 [00:00<00:00, 117.47ba/s]
Creating parquet from Arrow format: 100%|██████████| 6/6 [00:00<00:00, 114.83ba/s]
Creating parquet from Arrow format: 100%|██████████| 6/6 [00:00<00:00, 109.00ba/s]
Creating parquet from Arrow format: 100%|██████████| 6/6 [00:00<00:00, 112.05ba/s]
Creating parquet from Arrow format: 100%|██████████| 6/6 [00:00<00:00, 114.88ba/s]
Creating parquet from Arrow format: 100%|██████████| 6/6 [00:00<00:00, 112.04ba/s]
Creating parquet from Arrow format: 100%|██████████| 6/6 [00:00<00:00, 113.70ba/s]
Creat

## Data Curation flow

### Unicode reformatting

Unicode reformatting is an essential preprocessing step to ensure that text data is standardized and free of encoding errors, which are common in web-crawled datasets.

In [8]:
from nemo_curator import Modify
from nemo_curator.modifiers import UnicodeReformatter
from nemo_curator.utils.distributed_utils import read_data, write_to_disk
from nemo_curator.utils.file_utils import get_all_files_paths_under
from nemo_curator.datasets import DocumentDataset

# Define paths for Unicode formatted data
unicode_formatted_output_path = os.path.join(data_dir, "formatted")

# Load the raw data
def load_dataset(input_data_dir, file_type="parquet"):
    files = list(get_all_files_paths_under(input_data_dir))
    raw_data = read_data(files, file_type=file_type, backend="pandas", add_filename=True)
    dataset = DocumentDataset(raw_data)

    return dataset

raw_data = load_dataset(raw_data_directory, file_type="parquet")

# Initialize the Unicode reformatter
cleaner = Modify(UnicodeReformatter())

# Apply Unicode reformatting
cleaned_data = cleaner(raw_data)

# Save the cleaned data to disk
write_to_disk(cleaned_data.df, unicode_formatted_output_path, write_to_filename=True, output_type="parquet")

Reading 256 files
Writing to disk complete for 256 partitions


### Adding Custom IDs to Documents

Before proceeding with further curation steps, it is advisable to preprocess the dataset by adding a unique ID to each document. These IDs serve as trackers that help in identifying duplicate or low-quality documents throughout the curation process, ensuring that each document remains uniquely identifiable throughout processing. <br>

NeMo Curator offers an `AddId` class, which allows users to insert custom IDs into documents using a specified prefix format, such as `<prefix>_<id>`. 

In [None]:
from nemo_curator import AddId

# Define paths for input data and output with added IDs
add_id_input_data_dir = unicode_formatted_output_path
added_id_output_path = os.path.join(data_dir, "add_id")
add_ID_id_prefix = "VI_"

# Load the formatted dataset
dataset = DocumentDataset.read_parquet(add_id_input_data_dir)

# Initialize the AddId class with a specified prefix and start index
add_id = AddId(id_field="id", id_prefix=add_ID_id_prefix, start_index=0)

# Apply the ID addition to the dataset
id_dataset = add_id(dataset)

# Save the dataset with added IDs to disk
write_to_disk(id_dataset.df, output_path=added_id_output_path, write_to_filename=True, output_type="parquet")

Reading 256 files
Writing to disk complete for 256 partitions


### Exact deduplication

Exact deduplication removes identical duplicates from the dataset. By eliminating exact duplicates, we ensure that each data point contributes uniquely to the training process, enhancing the diversity and overall quality of the dataset.

In this stage, we’ll leverage GPU acceleration by utilizing a Dask CUDA cluster. Since the current cluster is CPU-based, we need to shut it down and start a new one with GPU support.

To close the existing cluster:


In [10]:
client.cluster.close()
client.shutdown()

Then, to initialize the GPU Dask cluster:

In [None]:
from nemo_curator.utils.distributed_utils import get_client

def pre_imports():
    import cudf 

client = get_client(cluster_type="gpu", set_torch_to_use_rmm=False)
client.run(pre_imports)

{'tcp://127.0.0.1:38425': None}

**Below is the implementation for exact deduplication:**

Imports and directory preparation:

In [13]:
import os
from nemo_curator.modules import ExactDuplicates
from nemo_curator.datasets import DocumentDataset

# Define input and output paths
exact_dedup_input_dataset_dir = added_id_output_path
exact_dedup_base_output_path = os.path.join(data_dir, "exact_dedup")
exact_dedup_log_dir = os.path.join(exact_dedup_base_output_path, "log")
exact_dedup_output_dir = os.path.join(exact_dedup_base_output_path, "data")
deduped_output_dir = os.path.join(data_dir, "remove_duplicate")

# Create directories for logs and output
!mkdir -p {exact_dedup_log_dir}
!mkdir -p {exact_dedup_output_dir}
!mkdir -p {deduped_output_dir}

Set parameters and load dataset:

In [14]:
# Parameters for ExactDuplicates
exact_dedup_dataset_id_field = "id"
exact_dedup_dataset_text_field = "text"

# Load the input dataset
input_dataset = DocumentDataset.read_parquet(exact_dedup_input_dataset_dir, backend="cudf")

Reading 256 files


Initialize and run deduplication:

In [None]:
# Initialize and run exact deduplication
exact_dup = ExactDuplicates(
    logger=exact_dedup_log_dir,
    id_field=exact_dedup_dataset_id_field,
    text_field=exact_dedup_dataset_text_field,
    hash_method="md5",
    cache_dir=exact_dedup_output_dir,
)
duplicates = exact_dup(dataset=input_dataset)

print(f"Number of exact duplicate files: {len(duplicates)}")



Number of exact duplicate files: 751


Remove duplicates and save final dataset:

In [None]:
# Load the dataset and exact duplicates to identify and remove duplicate IDs
input_dataset = DocumentDataset.read_parquet(added_id_output_path, backend="cudf")
exact_duplicates = DocumentDataset.read_parquet(
    os.path.join(exact_dedup_output_dir, "_exact_duplicates.parquet"), backend="cudf"
)

# Extract list of duplicate document IDs
exact_docs_to_remove = exact_duplicates.df.map_partitions(
    lambda x: x[x._hashes.duplicated(keep="first")]
)

# Remove duplicated documents from the input dataset
result = input_dataset.df[
    ~input_dataset.df[exact_dedup_dataset_id_field].isin(exact_docs_to_remove[exact_dedup_dataset_id_field].compute())
]

# Save the final deduplicated dataset
write_to_disk(result, output_path=deduped_output_dir, write_to_filename=True, output_type="parquet")

Reading 256 files
Reading 1 files
Writing to disk complete for 256 partitions


Close the GPU Dask cluster:

In [17]:
client.cluster.close()
client.shutdown()

### Heuristic Quality Filtering

Heuristic quality filtering is designed to enhance the quality of the dataset by removing low-quality content based on predefined heuristics. This approach involves applying a series of filters to the dataset to eliminate undesirable data characteristics such as excessive special characters, overly short or long texts, or other criteria that could negatively impact model performance.

We use a YAML file to define the heuristic filters. The configuration can be found [here](https://raw.githubusercontent.com/NVIDIA/NeMo-Curator/main/config/heuristic_filter_non-en.yaml). This file lists the filtering criteria and settings used to build a filter pipeline. You can customize the filters or change thresholds based on your needs. The `filter_pipeline` helper reads the YAML settings and applies each filter to the dataset step by step.\n

Recreate a CPU Dask cluster:

In [18]:
# Start a Dask cluster with 12 workers, each limited at 64GB of memory. 
# You might need to adjust these numbers according to your computing resources

cluster = LocalCluster(n_workers=12, processes=True, memory_limit="64GB")
client = Client(cluster)

In [19]:
from nemo_curator.utils.config_utils import build_filter_pipeline
import warnings

# Define paths for input data and output data after heuristic filtering
HF_input_data_dir = deduped_output_dir
HF_output_path = os.path.join(data_dir, "heuristic_filtering")

# Create a directory for the configuration file if it doesn't exist
os.makedirs("config", exist_ok=True)
# Download the YAML configuration file for heuristic filtering
!wget https://raw.githubusercontent.com/NVIDIA/NeMo-Curator/main/config/heuristic_filter_non-en.yaml -O ./config/heuristic_filter_non-en.yaml

# Specify the path to the configuration file
filter_config_file = "./config/heuristic_filter_non-en.yaml"
os.makedirs(HF_output_path, exist_ok=True)

# Load the filters from the YAML configuration file
filter_pipeline = build_filter_pipeline(filter_config_file)

# Load the dataset
dataset = DocumentDataset.read_parquet(HF_input_data_dir, backend="pandas")

# Suppress specific warnings during filtering
with warnings.catch_warnings():
    warnings.simplefilter("ignore", category=UserWarning)
    # Apply the heuristic filters to the dataset
    result_data = filter_pipeline(dataset)
 
    # Save the filtered dataset to disk
    result_data.to_parquet(HF_output_path, write_to_filename=True)

--2024-10-24 09:56:51--  https://raw.githubusercontent.com/NVIDIA/NeMo-Curator/main/config/heuristic_filter_non-en.yaml
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3911 (3.8K) [text/plain]
Saving to: ‘./config/heuristic_filter_non-en.yaml’


2024-10-24 09:56:51 (60.4 MB/s) - ‘./config/heuristic_filter_non-en.yaml’ saved [3911/3911]

Reading 256 files
Writing to disk complete for 256 partitions


### Classifier-based Quality Filtering

Classifier-based filtering uses a trained classifier model to sort content as high or low quality, offering a smarter and more flexible way to handle diverse datasets that simple rules might miss.

**Prepare Data for Training Classifier**

To train a quality classifier, we need representative samples of both high-quality and low-quality content. For high-quality data, we use articles from Wikipedia's Vietnamese edition, which are generally well-structured and reliable. The low-quality samples come from unfiltered crawled Vietnamese news corpus.

In [None]:
import os
from datasets import load_dataset as load_hf_dataset

In [21]:
# Paths for high-quality and low-quality sample data
hq_samples_path = os.path.join(data_dir, "classifier_filtering/train_samples/hq")
lq_samples_path = os.path.join(data_dir, "classifier_filtering/train_samples/lq")

# Load and shard the high-quality dataset
ds = load_hf_dataset("wikimedia/wikipedia", "20231101.vi")
num_shards = 8
for shard_idx in range(num_shards):
    shard = ds["train"].shard(index=shard_idx, num_shards=num_shards)
    shard.to_parquet(os.path.join(hq_samples_path, f"{shard_idx}.parquet"))

# Load and shard the low-quality dataset
ds = load_hf_dataset("vietgpt/binhvq_news_vi",split="train[:100000]")
num_shards = 32
for shard_idx in range(num_shards):
    shard = ds.shard(index=shard_idx, num_shards=num_shards)
    shard.to_parquet(os.path.join(lq_samples_path, f"{shard_idx}.parquet"))

Creating parquet from Arrow format: 100%|██████████| 162/162 [00:04<00:00, 36.29ba/s]
Creating parquet from Arrow format: 100%|██████████| 162/162 [00:04<00:00, 35.98ba/s]
Creating parquet from Arrow format: 100%|██████████| 162/162 [00:04<00:00, 36.65ba/s]
Creating parquet from Arrow format: 100%|██████████| 162/162 [00:04<00:00, 36.56ba/s]
Creating parquet from Arrow format: 100%|██████████| 162/162 [00:04<00:00, 36.52ba/s]
Creating parquet from Arrow format: 100%|██████████| 162/162 [00:04<00:00, 36.80ba/s]
Creating parquet from Arrow format: 100%|██████████| 162/162 [00:04<00:00, 36.05ba/s]
Creating parquet from Arrow format: 100%|██████████| 162/162 [00:04<00:00, 34.72ba/s]
Downloading readme: 100%|██████████| 507/507 [00:00<00:00, 2.16MB/s]
Downloading data: 100%|██████████| 398M/398M [00:03<00:00, 108MB/s]  
Downloading data: 100%|██████████| 414M/414M [00:03<00:00, 109MB/s]  
Downloading data: 100%|██████████| 434M/434M [00:04<00:00, 102MB/s]  
Downloading data: 100%|██████████

**Training Classifier**

The classifier is trained using FastText, which offers an efficient and effective method for text classification. 

In [None]:
from nemo_curator import Modify
from nemo_curator.utils.distributed_utils import write_to_disk
from nemo_curator.datasets import DocumentDataset

In [26]:
from nemo_curator.modifiers import FastTextLabelModifier
import fasttext
import random

# Function to create labeled samples
def create_samples(data_path, label, num_samples):
    raw_dataset = DocumentDataset.read_parquet(data_path, backend="pandas")
    label_quality = Modify(FastTextLabelModifier(label))
    labeled_dataset = label_quality(raw_dataset)
    labeled_samples = labeled_dataset.df.sample(frac=num_samples / len(labeled_dataset.df))
    
    return labeled_samples["text"].compute().values.tolist()

# Prepare training data
low_quality_samples = create_samples(lq_samples_path, "__label__lq", 100000)
high_quality_samples = create_samples(hq_samples_path, "__label__hq", 100000)
train_samples = low_quality_samples + high_quality_samples
random.shuffle(train_samples)

# Save training data to a file
train_file = "./cf_model_fasttext.train"
with open(train_file, "w", encoding="utf-8") as f:
    for sample in train_samples:
        f.write(sample + "\n")

# Train the FastText classifier
model = fasttext.train_supervised(input=train_file, lr=0.01, dim=100, epoch=5, wordNgrams=2)
model_path = "./cf_model_fasttext_model.bin"
model.save_model(model_path)

Reading 32 files
Reading 8 files


Read 24M words
Number of words:  843245
Number of labels: 2
Progress: 100.2% words/sec/thread:   36585 lr: -0.000018 avg.loss:  0.052146 ETA:   0h 0m 0s100.0% words/sec/thread:   36600 lr:  0.000000 avg.loss:  0.052146 ETA:   0h 0m 0s


**Classify and Filter the Dataset**

Once trained, the classifier is used to filter the dataset, categorizing documents into high and low quality based on the learned distinctions.

In [None]:
from nemo_curator.filters import FastTextQualityFilter
from nemo_curator import ScoreFilter

# Define paths and load the dataset
CF_input_data_dir = HF_output_path
CF_output_path = os.path.join(data_dir, "classifier_filtering/output")
target_dataset = DocumentDataset.read_parquet(CF_input_data_dir, "parquet")

# Set up the filtering pipeline
filter_pipeline = ScoreFilter(FastTextQualityFilter(model_path), score_field="quality_score", score_type=float)
filtered_dataset = filter_pipeline(target_dataset)

# Save the filtered dataset
write_to_disk(filtered_dataset.df, output_path=CF_output_path, write_to_filename=True, output_type="parquet")

Reading 256 files
Writing to disk complete for 256 partitions


Close the CPU Dask cluster:

In [33]:
client.cluster.close()
client.shutdown()

We have completed the notebook! For other techniques such as Fuzzy Deduplication or PII redaction, you can go to [NeMo Curator example scripts](https://github.com/NVIDIA/NeMo-Curator/tree/main/examples).