Skip to content

fix: use data_dir for directory paths in ShardedDataset#1301

Merged
yeyu-nvidia merged 1 commit intomainfrom
yeyu/fix-sharded-dataset-data-dir
Apr 20, 2026
Merged

fix: use data_dir for directory paths in ShardedDataset#1301
yeyu-nvidia merged 1 commit intomainfrom
yeyu/fix-sharded-dataset-data-dir

Conversation

@yeyu-nvidia
Copy link
Copy Markdown
Contributor

@yeyu-nvidia yeyu-nvidia commented Apr 20, 2026

Summary

  • datasets' resolve_pattern only matches entries with type=="file", so passing a bare directory path as data_files to load_dataset results in FileNotFoundError even when the directory exists on disk
  • Detect directory paths in ShardedDataset._load_dataset() and pass them via data_dir instead of data_files

Reproduction

from datasets import load_dataset
# This fails with FileNotFoundError:
load_dataset("json", data_files="/path/to/data_directory")
# This works:
load_dataset("json", data_dir="/path/to/data_directory")

Test plan

  • Verify existing EAGLE3/DFlash training pipelines that pass directory paths work
  • Verify file path and glob patterns still work (falls through to data_files)
  • Verify data_files=None (no data_files arg) still works

🤖 Generated with Claude Code

Summary by CodeRabbit

Bug Fixes

  • Fixed an issue with dataset loading that prevented proper handling of directory-based data sources. Directories are now correctly detected and processed during dataset initialization.

@yeyu-nvidia yeyu-nvidia requested a review from a team as a code owner April 20, 2026 18:08
@yeyu-nvidia yeyu-nvidia requested a review from realAsma April 20, 2026 18:08
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 20, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 00f7c386-7d25-43cb-abfb-8b2025d1ece7

📥 Commits

Reviewing files that changed from the base of the PR and between 97d1531 and 45e866e.

📒 Files selected for processing (1)
  • modelopt/torch/utils/plugins/transformers_dataset.py

📝 Walkthrough

Walkthrough

The ShardedDataset._load_dataset() method is updated to detect directory paths in self.data_files and pass them via the data_dir parameter instead of data_files when calling load_dataset.

Changes

Cohort / File(s) Summary
Dataset loading parameter handling
modelopt/torch/utils/plugins/transformers_dataset.py
Modified _load_dataset() to conditionally route directory paths to data_dir parameter and file paths to data_files parameter when invoking load_dataset().

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes


Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 1 warning)

Check name Status Explanation Resolution
Security Anti-Patterns ❌ Error The pull request contains two # nosec comments in lines 176-177 of transformers_dataset.py, which directly violates the security coding policy that explicitly prohibits # nosec comments. Remove the # nosec comments from lines 176-177 and follow the official exception process with justification and approval from @NVIDIA/modelopt-setup-codeowners.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and concisely describes the main change: using data_dir parameter for directory paths in ShardedDataset, which is the core fix in this PR.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch yeyu/fix-sharded-dataset-data-dir

Comment @coderabbitai help to get the list of available commands and usage tips.

datasets' resolve_pattern only matches entries with type=="file",
so passing a bare directory path as data_files results in
FileNotFoundError even when the directory exists on disk. Detect
directory paths and pass them via data_dir instead.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ye Yu <yeyu@nvidia.com>
@yeyu-nvidia yeyu-nvidia force-pushed the yeyu/fix-sharded-dataset-data-dir branch from cc6f899 to 45e866e Compare April 20, 2026 18:10
@yeyu-nvidia yeyu-nvidia enabled auto-merge (squash) April 20, 2026 18:12
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 20, 2026

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-04-20 19:10 UTC

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 20, 2026

Codecov Report

❌ Patch coverage is 60.00000% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.69%. Comparing base (97d1531) to head (45e866e).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...delopt/torch/utils/plugins/transformers_dataset.py 60.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1301      +/-   ##
==========================================
+ Coverage   75.39%   75.69%   +0.30%     
==========================================
  Files         462      462              
  Lines       49955    49960       +5     
==========================================
+ Hits        37662    37817     +155     
+ Misses      12293    12143     -150     
Flag Coverage Δ
examples 41.56% <60.00%> (+0.85%) ⬆️
gpu 58.59% <0.00%> (-0.55%) ⬇️
regression 14.85% <60.00%> (+0.06%) ⬆️
unit 52.39% <0.00%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@yeyu-nvidia yeyu-nvidia merged commit 289a239 into main Apr 20, 2026
46 checks passed
@yeyu-nvidia yeyu-nvidia deleted the yeyu/fix-sharded-dataset-data-dir branch April 20, 2026 19:10
@yeyu-nvidia yeyu-nvidia added the cherry-pick-0.44.0 After code freeze, cherry-pick to release branch for next rc (bulk update). Only for bug fixes / doc label Apr 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cherry-pick-0.44.0 After code freeze, cherry-pick to release branch for next rc (bulk update). Only for bug fixes / doc

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants