feat: Optimize SFT dataloader with slice-based tokenization and caching by jlee-lila · Pull Request #1695 · NovaSky-AI/SkyRL

jlee-lila · 2026-05-20T17:52:01Z

feat: Optimize SFT dataloader with slice-based tokenization and caching

Summary

Optimizes the SFT dataloader with two major improvements:

Slice-based parallel tokenization - eliminates pickle overhead by having workers load their own data slices
Tokenized dataset caching - persistent NFS-safe caching for reuse across training runs

Performance Improvements

Slice-Based Parallel Tokenization

Scale	Before	After	Speedup
100K examples	383.7s (sequential)	43.2s (16 workers)	8.89x
~940K examples	1711s (old manual MP)	1647.7s (slice-based)	1.04x

Key benefits:

8.89x speedup at 100K scale
2.27x speedup at 1M scale vs sequential
Memory efficient: workers load data directly from HuggingFace
No pickle serialization overhead

Tokenized Dataset Caching

Operation	Time (10K examples)	Speedup
First run (tokenize + save)	21.4s	1.00x
Cache hit (load from disk)	0.4s	56.7x

Key benefits:

56.7x speedup on cache hits
NFS-safe for multi-node training
Automatic cache key based on dataset + tokenization params
Configurable via cache_dir, force_recache, disable_cache

Changes

1. Slice-Based Tokenization (`skyrl/train/sft_trainer.py`)

Added _tokenize_chat_slice_worker() - worker for chat format with slice loading
Added _tokenize_alpaca_slice_worker() - worker for Alpaca format with slice loading
Added _parse_dataset_split() - parses split strings like "train[:100000]" into base split + indices
Updated _load_and_tokenize() - uses slice-based loading when num_workers > 0

How it works:

# Each worker loads its own slice directly from HuggingFace
dataset_slice = load_dataset(dataset_name, split=f"{base_split}[{start_idx}:{end_idx}]")

2. Dataset Caching (`skyrl/train/sft_trainer.py`, `skyrl/train/config/sft_config.py`)

Added _compute_cache_key() - deterministic hash of dataset + tokenization params
Added _get_cache_path(), _load_from_cache(), _save_to_cache() - cache I/O
Updated _load_and_tokenize() - checks cache before tokenizing
Added config fields: cache_dir, force_recache, disable_cache

Cache key includes:

Dataset name + split
Model path (tokenizer identity)
Max length
Messages key, tools key, system key
Train-on-what mode

NFS-safe atomic writes:

# Write to temp file, then atomic rename
temp_path = cache_path + ".tmp"
with open(temp_path, "wb") as f:
    pickle.dump(tokenized, f)
os.rename(temp_path, cache_path)  # Atomic on NFS

Configuration

Slice-Based Tokenization

cfg = SFTConfig(
    num_workers=16,  # Number of parallel workers (0 = sequential)
    ...
)

Caching

cfg = SFTConfig(
    cache_dir="/mnt/nfs/cache/skyrl",  # NFS path for multi-node
    # cache_dir="",  # Default: ~/.cache/skyrl/tokenized_datasets
    force_recache=False,  # Set True to ignore cache
    disable_cache=False,  # Set True to disable caching
    ...
)

Use Cases

Hyperparameter sweeps - tokenize once, reuse across all runs
Multi-node training - share cache via NFS across nodes
Development - instant dataset loading during iteration
Large datasets - amortize tokenization cost across many runs

Testing

Tested with:

10K examples (cache validation)
100K examples (slice-based validation)
~940K examples (1M scale validation)

Test scripts available in commit history.

Breaking Changes

None - all changes are backward compatible. Default behavior unchanged.

Notes

Cache uses pickle format for fast serialization
Cache key is deterministic and collision-resistant (SHA256 hash)
Workers spawn cleanly without Ray fork conflicts
Atomic writes prevent corruption on NFS

Related Issues

Addresses performance concerns with SFT dataset loading at scale.

Adds multiprocessing-based parallel tokenization with slice-based HF loading to eliminate pickle overhead. Includes tokenized dataset caching (pickle) with NFS support for multi-node training. New config options: num_workers, cache_dir, force_recache, disable_cache. Co-Authored-By: SumanthRH <sumanthrh@anyscale.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: jlee-lila <jlee@lila.ai>

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

SumanthRH · 2026-05-22T00:59:25Z

I've made some improvements:

Cleaned up unused code
Simplied default for cache dir
Changed the save format to be HF dataset since the file can get quite large for 1M rows. Note that the load format with SkyRL is still a list, which means we are loading it in memory. This will be fixed soon in a follow up PR

gemini-code-assist

Code Review

This pull request implements parallel tokenization and on-disk caching for SFT datasets to optimize training startup. It introduces configuration settings for worker counts and cache paths, implements multiprocessing worker functions for data slicing, and adds logic to store tokenized datasets in an arrow-backed format. The PR also includes comprehensive tests for the new parallel processing and caching functionality. Feedback points out concurrency risks in the cache-saving process on shared filesystems and suggests a more robust approach for generating cache keys to prevent potential collisions.

gemini-code-assist · 2026-05-22T01:00:22Z

+        temp_path = cache_path + ".tmp"
+        # Clean up any stale temp dir from an interrupted prior run.
+        if os.path.isdir(temp_path):
+            shutil.rmtree(temp_path)
+        dataset.save_to_disk(temp_path)
+        # If a previous cache exists at the final path, drop it before
+        # rename so the swap is the only visible state change.
+        if os.path.isdir(cache_path):
+            shutil.rmtree(cache_path)
+        os.rename(temp_path, cache_path)


The current cache saving logic is not safe for multi-node training on shared filesystems (NFS). \n\n1. temp_path is not unique across processes or nodes, which can lead to data corruption if multiple workers attempt to write to the same cache simultaneously.\n2. The shutil.rmtree(temp_path) at the start of the write process can delete a directory that another process is currently using.\n3. The shutil.rmtree(cache_path) followed by os.rename is not atomic for directories on many systems, creating a race condition where one process might delete the successful output of another.\n\nA safer approach is to use a unique temporary directory and an atomic os.rename, ensuring that concurrent writes do not interfere with each other.

temp_path = f"{cache_path}.tmp.{random.getrandbits(64):x}"\n try:\n dataset.save_to_disk(temp_path)\n if os.path.isdir(cache_path):\n shutil.rmtree(cache_path, ignore_errors=True)\n os.rename(temp_path, cache_path)\n finally:\n if os.path.isdir(temp_path):\n shutil.rmtree(temp_path, ignore_errors=True)

this is a rare scenario with concurrent training runs for the same model and dataset name (i.e same cache key).

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

SumanthRH force-pushed the optimize-sft-tokenization branch 2 times, most recently from a25e19d to c937e47 Compare May 20, 2026 20:52

SumanthRH force-pushed the optimize-sft-tokenization branch from c937e47 to dc25e14 Compare May 20, 2026 20:53

SumanthRH added 4 commits May 20, 2026 21:05

remove unwanted mocking

82086ac

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

simplify cache dir default

b0123db

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

save as a hf dataset

51b55c8

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

43dd224

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

SumanthRH marked this pull request as ready for review May 22, 2026 00:57

gemini-code-assist Bot reviewed May 22, 2026

View reviewed changes

x

61f1061

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

SumanthRH approved these changes May 22, 2026

View reviewed changes

x

6a61db2

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

SumanthRH merged commit ccc181e into NovaSky-AI:main May 22, 2026
4 of 5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Optimize SFT dataloader with slice-based tokenization and caching#1695

feat: Optimize SFT dataloader with slice-based tokenization and caching#1695
SumanthRH merged 7 commits into
NovaSky-AI:mainfrom
jlee-lila:optimize-sft-tokenization

jlee-lila commented May 20, 2026

Uh oh!

SumanthRH commented May 22, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 22, 2026

Uh oh!

SumanthRH May 22, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jlee-lila commented May 20, 2026

feat: Optimize SFT dataloader with slice-based tokenization and caching

Summary

Performance Improvements

Slice-Based Parallel Tokenization

Tokenized Dataset Caching

Changes

1. Slice-Based Tokenization (skyrl/train/sft_trainer.py)

2. Dataset Caching (skyrl/train/sft_trainer.py, skyrl/train/config/sft_config.py)

Configuration

Slice-Based Tokenization

Caching

Use Cases

Testing

Breaking Changes

Notes

Related Issues

Uh oh!

SumanthRH commented May 22, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

SumanthRH May 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. Slice-Based Tokenization (`skyrl/train/sft_trainer.py`)

2. Dataset Caching (`skyrl/train/sft_trainer.py`, `skyrl/train/config/sft_config.py`)