Megatron tokenization pipeline #1259

asolergi-nv · 2025-11-20T18:59:05Z

Description

In this PR I’m including the MegatronTokenizerWriter, which tokenizes and produces the .bin and .idx files required for training with Megatron and its dataloading solution.

The .bin file contains the tokenized documents. We will use 4 bytes per token if the vocabulary size is greater than 2**16; otherwise, we’ll use 2 bytes per token.
The .idx file contains metadata about the .bin file, mainly the number of tokenized documents and their lengths. More details about this can be found in the close method of MegatronTokenizerWriter.

At first, I tried creating a CompositeStage using TokenizerStage, but as we already discussed, TokenizerStage caused OOM issues. To address this, I added a batch_size argument that controls how many documents we tokenize at once, write to disk, and then immediately discard.

I’ve also included the tokenizer-test folder, which contains the test.sh script I used to verify that the produced files match those created by Megatron’s preprocess_data.py script. To run the checks, you only need to set the DATA_ROOT folder in the script and execute it; it will clone Megatron, start the Ray server, download and convert the TinyStories dataset to JSONL, tokenize the dataset with 8 different configurations, and finally confirm that the files generated by Curator and the Megatron script match.

We perform this validation using 4 different tokenizers, including in the dataset one sample with all tokenizer-specific special tokens, and toggling the append_eod config (also present in the Megatron script). Of these 4 tokenizers, GPT-2 uses 2 bytes per token since its vocabulary size is ≤ 2**16.

I’m now writing some unit tests, similar to the ones in tests/stages/text/io/writer/test_jsonl.py. I’d like to know whether…

I should write any specific documentation
I should include any tutorial. Perhaps I could add an option to use the MegatronTokenizerWriter in tutorials/text/tinystories/main.py — what do you think?

Let me know your thoughts!

Usage

# Define the processing stages
stages = [
      # Read the data from the JSONL files
      JsonlReader(
          file_paths="data/raw"
          fields="text",
      ),
      # Tokenize the data
      MegatronTokenizerWriter(
          path="data/tokenized-dataset",
          model_identifier="unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit",
          append_eod=True,
          text_field="text",
      ),
]

# Create a pipeline with the stages
pipeline = Pipeline(
      name="megatron-tokenize",
      description="Tokenize dataset for Megatron-LM.",
      stages=stages,
)

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

Signed-off-by: asolergi-nv <asolergibert@nvidia.com>

greptile-apps

Additional Comments (2)

nemo_curator/stages/text/io/writer/megatron_tokenizer.py, line 133 (link)

logic: If write_idx_data() fails, the .bin file remains on disk without its .idx metadata file. Consider wrapping both file writes in a single try-except or moving idx write into the existing try block to clean up both files on failure
nemo_curator/stages/text/io/writer/utils.py, line 23 (link)

syntax: Docstring says "Batch an iterable into lists" but the function returns tuples. Update docstring to say "tuples" instead of "lists"

_{9 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

asolergi-nv added 20 commits November 11, 2025 13:31

Add token_size to TokenizerStage metadata

04ee65a

Signed-off-by: asolergi-nv <asolergibert@nvidia.com>

First draft

5c6a5b5

Signed-off-by: asolergi-nv <asolergibert@nvidia.com>

bugs

637589b

Signed-off-by: asolergi-nv <asolergibert@nvidia.com>

First working prototype

63514a0

Signed-off-by: asolergi-nv <asolergibert@nvidia.com>

Guard against missing vocab_size and eos_token_id

53d0298

Signed-off-by: asolergi-nv <asolergibert@nvidia.com>

Merge branch 'NVIDIA-NeMo:main' into megatron_tokenizer

06f9972

Before fixing OOM

a1df6c0

Signed-off-by: asolergi-nv <asolergibert@nvidia.com>

No OOM!

d96fb94

Signed-off-by: asolergi-nv <asolergibert@nvidia.com>

Merge branch 'NVIDIA-NeMo:main' into megatron_tokenizer

205bc4f

Undo tokenizer changes

ef36d4d

Signed-off-by: asolergi-nv <asolergibert@nvidia.com>

Match!

c991006

Signed-off-by: asolergi-nv <asolergibert@nvidia.com>

A bit of cleaning

f97b7db

Signed-off-by: asolergi-nv <asolergibert@nvidia.com>

move batched to utils

2166cc0

Signed-off-by: asolergi-nv <asolergibert@nvidia.com>

v4: Remove document indices list

5dccc22

Signed-off-by: asolergi-nv <asolergibert@nvidia.com>

v5: Larger writes

6d106fe

Signed-off-by: asolergi-nv <asolergibert@nvidia.com>

Ready

146ac2b

Signed-off-by: asolergi-nv <asolergibert@nvidia.com>

Add scripts checks

af8cbc4

Signed-off-by: asolergi-nv <asolergibert@nvidia.com>

Remove comments

52bd4c5

Signed-off-by: asolergi-nv <asolergibert@nvidia.com>

nits

e52a7b3

Signed-off-by: asolergi-nv <asolergibert@nvidia.com>

More nits

300a9b9

Signed-off-by: asolergi-nv <asolergibert@nvidia.com>

copy-pr-bot bot temporarily deployed to test November 20, 2025 18:59 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci November 20, 2025 18:59 Inactive

copy-pr-bot bot temporarily deployed to test January 5, 2026 14:51 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci January 5, 2026 14:51 Inactive

greptile-apps bot reviewed Jan 5, 2026

View reviewed changes

sarahyurick merged commit 030ce4f into NVIDIA-NeMo:main Jan 5, 2026
47 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Megatron tokenization pipeline #1259

Megatron tokenization pipeline #1259

Uh oh!

asolergi-nv commented Nov 20, 2025

Uh oh!

greptile-apps bot left a comment •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Megatron tokenization pipeline #1259

Megatron tokenization pipeline #1259

Uh oh!

Conversation

asolergi-nv commented Nov 20, 2025

Description

Usage

Checklist

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Additional Comments (2)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

greptile-apps bot left a comment •

edited

Loading