Skip to content

Support Unix shell-style wildcards in dataset filter configuration#1146

Merged
rapids-bot[bot] merged 3 commits intoNVIDIA:developfrom
thepatrickchin:develop
Nov 5, 2025
Merged

Support Unix shell-style wildcards in dataset filter configuration#1146
rapids-bot[bot] merged 3 commits intoNVIDIA:developfrom
thepatrickchin:develop

Conversation

@thepatrickchin
Copy link
Member

@thepatrickchin thepatrickchin commented Nov 4, 2025

Description

This PR adds support for Unix shell-style wildcards (*, ?, [seq], [!seq]) in dataset filter configurations.

In below example, dataset entries starting with iproute2_ are included except for those with 3-digit suffix, and suffixes from 60-65

filter:
  allowlist:
    field:
      id:
        - iproute2_*
  denylist:
    field:
      id:
        - iproute2_???
        - iproute2_6[0-5]

By Submitting this PR I confirm:

  • I am familiar with the Contributing Guidelines.
  • We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
    • Any contribution which contains commits that are not Signed-Off will not be accepted.
  • When the PR is ready for review, new or existing tests cover these changes.
  • When the PR is ready for review, the documentation is up to date with these changes.

Summary by CodeRabbit

  • New Features
    • Added Unix shell-style wildcard pattern matching to dataset filters (supports *, ?, and character classes); numeric values are treated as strings for pattern matching.
  • Tests
    • Added comprehensive tests for allowlist, denylist, mixed patterns, single-character and character-class matching, numeric-string handling, and performance of exact-match paths.
  • Documentation
    • Updated filter guidance to note wildcard support.

@thepatrickchin thepatrickchin requested a review from a team as a code owner November 4, 2025 07:49
@coderabbitai
Copy link

coderabbitai bot commented Nov 4, 2025

Walkthrough

Unix shell-style wildcard matching (fnmatch) was added to DatasetFilter via a new static _match_wildcard_patterns() method, and apply_filters() now uses wildcard-aware matching for allowlist and denylist. Tests were added to validate *, ?, character classes, mixed exact/wildcard entries, and numeric handling.

Changes

Cohort / File(s) Summary
Wildcard filtering implementation
src/nat/eval/dataset_handler/dataset_filter.py
Added static method _match_wildcard_patterns(series, patterns) using fnmatch; imported fnmatch; updated apply_filters() to apply wildcard-aware allowlist (keep matches) and denylist (remove matches); extended class docstring to note wildcard support.
Wildcard filtering tests
tests/nat/eval/dataset_handler/test_dataset_filter.py
Added tests covering *, ?, character classes ([...]), mixed exact and wildcard entries, numeric values handled as strings, and a regression/performance test ensuring exact-match behavior when no wildcards present (uses existing sample_df fixture).

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant DatasetFilter.apply_filters
    participant DatasetFilter._match_wildcard_patterns
    participant fnmatch

    Caller->>DatasetFilter.apply_filters: apply_filters(df, allowlist, denylist)
    alt allowlist present
        DatasetFilter.apply_filters->>DatasetFilter._match_wildcard_patterns: (series, allowlist)
        DatasetFilter._match_wildcard_patterns->>fnmatch: fnmatch.fnmatch(str(value), pattern) per value
        fnmatch-->>DatasetFilter._match_wildcard_patterns: boolean masks
        DatasetFilter._match_wildcard_patterns-->>DatasetFilter.apply_filters: combined mask (any pattern)
        DatasetFilter.apply_filters->>DatasetFilter.apply_filters: df = df[mask]  (keep matching rows)
    end
    alt denylist present
        DatasetFilter.apply_filters->>DatasetFilter._match_wildcard_patterns: (series, denylist)
        DatasetFilter._match_wildcard_patterns->>fnmatch: fnmatch.fnmatch(str(value), pattern) per value
        fnmatch-->>DatasetFilter._match_wildcard_patterns: boolean masks
        DatasetFilter._match_wildcard_patterns-->>DatasetFilter.apply_filters: combined mask (any pattern)
        DatasetFilter.apply_filters->>DatasetFilter.apply_filters: df = df[~mask] (remove matching rows)
    end
    DatasetFilter.apply_filters-->>Caller: filtered DataFrame
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Review src/nat/eval/dataset_handler/dataset_filter.py for correct string conversion and handling of None/non-string numeric values when calling fnmatch.
  • Verify combined mask logic (any pattern per row) and that allowlist/denylist semantics are preserved.
  • Run and inspect tests/nat/eval/dataset_handler/test_dataset_filter.py to ensure edge cases (*, ?, [...]), mixed entries, and numeric-string behavior are covered.

Pre-merge checks and finishing touches

✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and accurately describes the main change: adding Unix shell-style wildcard support to dataset filter configuration.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f97a47d and 71a0daf.

📒 Files selected for processing (2)
  • src/nat/eval/dataset_handler/dataset_filter.py (4 hunks)
  • tests/nat/eval/dataset_handler/test_dataset_filter.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/nat/eval/dataset_handler/test_dataset_filter.py
🧰 Additional context used
📓 Path-based instructions (5)
**/*.py

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

**/*.py: In code comments use the abbreviations: nat (API namespace/CLI), nvidia-nat (package), NAT (env var prefixes); never use these abbreviations in documentation
Follow PEP 20 and PEP 8 for Python style
Run yapf with column_limit=120; yapf is used for formatting (run second)
Indent with 4 spaces (no tabs) and end each file with a single trailing newline
Use ruff (ruff check --fix) as a linter (not formatter) per pyproject.toml; fix warnings unless explicitly ignored
Respect Python naming schemes: snake_case for functions/variables, PascalCase for classes, UPPER_CASE for constants
When re-raising exceptions, use bare raise to preserve stack trace; log with logger.error(), not logger.exception()
When catching and logging without re-raising, use logger.exception() to capture full stack trace
Provide Google-style docstrings for every public module, class, function, and CLI command
Docstring first line must be a concise description ending with a period
Validate and sanitize all user input, especially in web or CLI interfaces
Prefer httpx with SSL verification enabled by default and follow OWASP Top-10 recommendations
Use async/await for I/O-bound work (HTTP, DB, file I/O)
Cache expensive computations with functools.lru_cache or an external cache when appropriate
Leverage NumPy vectorized operations when beneficial and feasible

Files:

  • src/nat/eval/dataset_handler/dataset_filter.py
{src/**/*.py,packages/*/src/**/*.py}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

{src/**/*.py,packages/*/src/**/*.py}: All importable Python code must live under src/ or packages//src/
All public APIs must have Python 3.11+ type hints on parameters and return values
Prefer typing/collections.abc abstractions (e.g., Sequence over list)
Use typing.Annotated for units or metadata when useful
Treat pyright warnings as errors during development

Files:

  • src/nat/eval/dataset_handler/dataset_filter.py
src/nat/**/*

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

src/nat/**/* contains core functionality; changes should prioritize backward compatibility

Files:

  • src/nat/eval/dataset_handler/dataset_filter.py

⚙️ CodeRabbit configuration file

This directory contains the core functionality of the toolkit. Changes should prioritize backward compatibility.

Files:

  • src/nat/eval/dataset_handler/dataset_filter.py
{**/*.py,**/*.sh,**/*.md,**/*.toml,**/*.y?(a)ml,**/*.json,**/*.txt,**/*.ini,**/*.cfg,**/*.ipynb}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

{**/*.py,**/*.sh,**/*.md,**/*.toml,**/*.y?(a)ml,**/*.json,**/*.txt,**/*.ini,**/*.cfg,**/*.ipynb}: Every file must start with the standard SPDX Apache-2.0 header
Confirm copyright years are up to date when a file is changed
All source files must include the SPDX Apache-2.0 header template (copy from an existing file)

Files:

  • src/nat/eval/dataset_handler/dataset_filter.py
**/*

⚙️ CodeRabbit configuration file

**/*: # Code Review Instructions

  • Ensure the code follows best practices and coding standards. - For Python code, follow
    PEP 20 and
    PEP 8 for style guidelines.
  • Check for security vulnerabilities and potential issues. - Python methods should use type hints for all parameters and return values.
    Example:
    def my_function(param1: int, param2: str) -> bool:
        pass
  • For Python exception handling, ensure proper stack trace preservation:
    • When re-raising exceptions: use bare raise statements to maintain the original stack trace,
      and use logger.error() (not logger.exception()) to avoid duplicate stack trace output.
    • When catching and logging exceptions without re-raising: always use logger.exception()
      to capture the full stack trace information.

Documentation Review Instructions - Verify that documentation and comments are clear and comprehensive. - Verify that the documentation doesn't contain any TODOs, FIXMEs or placeholder text like "lorem ipsum". - Verify that the documentation doesn't contain any offensive or outdated terms. - Verify that documentation and comments are free of spelling mistakes, ensure the documentation doesn't contain any

words listed in the ci/vale/styles/config/vocabularies/nat/reject.txt file, words that might appear to be
spelling mistakes but are listed in the ci/vale/styles/config/vocabularies/nat/accept.txt file are OK.

Misc. - All code (except .mdc files that contain Cursor rules) should be licensed under the Apache License 2.0,

and should contain an Apache License 2.0 header comment at the top of each file.

  • Confirm that copyright years are up-to date whenever a file is changed.

Files:

  • src/nat/eval/dataset_handler/dataset_filter.py
🧠 Learnings (1)
📚 Learning: 2025-10-22T22:02:12.903Z
Learnt from: CR
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 0
File: .cursor/rules/general.mdc:0-0
Timestamp: 2025-10-22T22:02:12.903Z
Learning: Applies to **/*.py : Docstring first line must be a concise description ending with a period

Applied to files:

  • src/nat/eval/dataset_handler/dataset_filter.py
🔇 Additional comments (4)
src/nat/eval/dataset_handler/dataset_filter.py (4)

16-17: LGTM!

The fnmatch import is appropriate for implementing Unix shell-style wildcard matching.


29-29: LGTM!

The docstring correctly documents wildcard support with proper reStructuredText escaping.


39-64: LGTM! Clean implementation with proper type hints and resolved static analysis issues.

The method correctly implements wildcard matching using list comprehension to avoid the lambda capture warning. String conversion ensures consistent pattern matching for both string and numeric values.


74-75: LGTM! Filter logic correctly implements wildcard-aware matching.

Both allowlist (keep matches) and denylist (remove matches with ~) use the new wildcard matching method correctly while maintaining backward compatibility with exact string matches.

Also applies to: 81-82


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/nat/eval/dataset_handler/dataset_filter.py (1)

29-29: Consider adding more detail about supported wildcard patterns.

While the current documentation mentions wildcard support, consider expanding it to include examples of the supported patterns (*, ?, [seq], [!seq]) for better developer experience.

Example addition to the docstring:

-        - Supports Unix shell-style wildcards (*, ?, [seq], [!seq]) for string matching
+        - Supports Unix shell-style wildcards for string matching:
+            * `*` matches everything
+            * `?` matches any single character
+            * `[seq]` matches any character in seq
+            * `[!seq]` matches any character not in seq
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1904bb4 and b38a643.

📒 Files selected for processing (2)
  • src/nat/eval/dataset_handler/dataset_filter.py (4 hunks)
  • tests/nat/eval/dataset_handler/test_dataset_filter.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (8)
**/*.py

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

**/*.py: In code comments use the abbreviations: nat (API namespace/CLI), nvidia-nat (package), NAT (env var prefixes); never use these abbreviations in documentation
Follow PEP 20 and PEP 8 for Python style
Run yapf with column_limit=120; yapf is used for formatting (run second)
Indent with 4 spaces (no tabs) and end each file with a single trailing newline
Use ruff (ruff check --fix) as a linter (not formatter) per pyproject.toml; fix warnings unless explicitly ignored
Respect Python naming schemes: snake_case for functions/variables, PascalCase for classes, UPPER_CASE for constants
When re-raising exceptions, use bare raise to preserve stack trace; log with logger.error(), not logger.exception()
When catching and logging without re-raising, use logger.exception() to capture full stack trace
Provide Google-style docstrings for every public module, class, function, and CLI command
Docstring first line must be a concise description ending with a period
Validate and sanitize all user input, especially in web or CLI interfaces
Prefer httpx with SSL verification enabled by default and follow OWASP Top-10 recommendations
Use async/await for I/O-bound work (HTTP, DB, file I/O)
Cache expensive computations with functools.lru_cache or an external cache when appropriate
Leverage NumPy vectorized operations when beneficial and feasible

Files:

  • src/nat/eval/dataset_handler/dataset_filter.py
  • tests/nat/eval/dataset_handler/test_dataset_filter.py
{src/**/*.py,packages/*/src/**/*.py}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

{src/**/*.py,packages/*/src/**/*.py}: All importable Python code must live under src/ or packages//src/
All public APIs must have Python 3.11+ type hints on parameters and return values
Prefer typing/collections.abc abstractions (e.g., Sequence over list)
Use typing.Annotated for units or metadata when useful
Treat pyright warnings as errors during development

Files:

  • src/nat/eval/dataset_handler/dataset_filter.py
src/nat/**/*

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

src/nat/**/* contains core functionality; changes should prioritize backward compatibility

Files:

  • src/nat/eval/dataset_handler/dataset_filter.py

⚙️ CodeRabbit configuration file

This directory contains the core functionality of the toolkit. Changes should prioritize backward compatibility.

Files:

  • src/nat/eval/dataset_handler/dataset_filter.py
{**/*.py,**/*.sh,**/*.md,**/*.toml,**/*.y?(a)ml,**/*.json,**/*.txt,**/*.ini,**/*.cfg,**/*.ipynb}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

{**/*.py,**/*.sh,**/*.md,**/*.toml,**/*.y?(a)ml,**/*.json,**/*.txt,**/*.ini,**/*.cfg,**/*.ipynb}: Every file must start with the standard SPDX Apache-2.0 header
Confirm copyright years are up to date when a file is changed
All source files must include the SPDX Apache-2.0 header template (copy from an existing file)

Files:

  • src/nat/eval/dataset_handler/dataset_filter.py
  • tests/nat/eval/dataset_handler/test_dataset_filter.py
**/*

⚙️ CodeRabbit configuration file

**/*: # Code Review Instructions

  • Ensure the code follows best practices and coding standards. - For Python code, follow
    PEP 20 and
    PEP 8 for style guidelines.
  • Check for security vulnerabilities and potential issues. - Python methods should use type hints for all parameters and return values.
    Example:
    def my_function(param1: int, param2: str) -> bool:
        pass
  • For Python exception handling, ensure proper stack trace preservation:
    • When re-raising exceptions: use bare raise statements to maintain the original stack trace,
      and use logger.error() (not logger.exception()) to avoid duplicate stack trace output.
    • When catching and logging exceptions without re-raising: always use logger.exception()
      to capture the full stack trace information.

Documentation Review Instructions - Verify that documentation and comments are clear and comprehensive. - Verify that the documentation doesn't contain any TODOs, FIXMEs or placeholder text like "lorem ipsum". - Verify that the documentation doesn't contain any offensive or outdated terms. - Verify that documentation and comments are free of spelling mistakes, ensure the documentation doesn't contain any

words listed in the ci/vale/styles/config/vocabularies/nat/reject.txt file, words that might appear to be
spelling mistakes but are listed in the ci/vale/styles/config/vocabularies/nat/accept.txt file are OK.

Misc. - All code (except .mdc files that contain Cursor rules) should be licensed under the Apache License 2.0,

and should contain an Apache License 2.0 header comment at the top of each file.

  • Confirm that copyright years are up-to date whenever a file is changed.

Files:

  • src/nat/eval/dataset_handler/dataset_filter.py
  • tests/nat/eval/dataset_handler/test_dataset_filter.py
{tests/**/*.py,examples/*/tests/**/*.py}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

Unit tests live in tests/ or examples/*/tests and use markers defined in pyproject.toml (e.g., integration)

Files:

  • tests/nat/eval/dataset_handler/test_dataset_filter.py
{tests/**/*.py,examples/*/tests/**/*.py,packages/*/tests/**/*.py}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

{tests/**/*.py,examples/*/tests/**/*.py,packages/*/tests/**/*.py}: Use pytest with pytest-asyncio for async code
Test functions should be named with test_ prefix using snake_case
Extract frequently repeated test code into pytest fixtures
Pytest fixtures should define the name argument in @pytest.fixture and the function should be prefixed fixture_ using snake_case
Mock external services with pytest_httpserver or unittest.mock; avoid live endpoints
Mark slow tests with @pytest.mark.slow to skip in default test suite
Mark integration tests requiring external services with @pytest.mark.integration and follow nat-tests guidelines

Files:

  • tests/nat/eval/dataset_handler/test_dataset_filter.py
tests/**/*.py

⚙️ CodeRabbit configuration file

tests/**/*.py: - Ensure that tests are comprehensive, cover edge cases, and validate the functionality of the code. - Test functions should be named using the test_ prefix, using snake_case. - Any frequently repeated code should be extracted into pytest fixtures. - Pytest fixtures should define the name argument when applying the pytest.fixture decorator. The fixture
function being decorated should be named using the fixture_ prefix, using snake_case. Example:
@pytest.fixture(name="my_fixture")
def fixture_my_fixture():
pass

Files:

  • tests/nat/eval/dataset_handler/test_dataset_filter.py
🧬 Code graph analysis (1)
tests/nat/eval/dataset_handler/test_dataset_filter.py (2)
src/nat/data_models/dataset_handler.py (2)
  • EvalFilterConfig (47-49)
  • EvalFilterEntryConfig (42-44)
src/nat/eval/dataset_handler/dataset_filter.py (2)
  • DatasetFilter (23-83)
  • apply_filters (65-83)
🪛 Ruff (0.14.3)
src/nat/eval/dataset_handler/dataset_filter.py

60-60: Function definition does not bind loop variable pattern_str

(B023)

🔇 Additional comments (3)
src/nat/eval/dataset_handler/dataset_filter.py (2)

16-16: Good choice of library for wildcard matching.

Using fnmatch from the standard library is appropriate for implementing Unix shell-style wildcard patterns with no additional dependencies.


73-74: Wildcard matching integration looks correct.

The integration of wildcard pattern matching into both allowlist and denylist filtering is implemented correctly. The change maintains backward compatibility since exact values (without wildcards) still match as expected with fnmatch.

Also applies to: 80-81

tests/nat/eval/dataset_handler/test_dataset_filter.py (1)

107-192: Excellent and comprehensive test coverage!

The test suite thoroughly validates the wildcard functionality with coverage for:

  • Basic wildcard patterns (*) in both allowlist and denylist
  • Single-character wildcards (?)
  • Character sets ([seq])
  • Mixed wildcard and exact value patterns
  • Numeric value handling (converted to strings)
  • Backward compatibility with exact matches

All tests follow proper conventions with clear naming, descriptive docstrings, and specific assertions. This provides strong confidence in the implementation.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/nat/eval/dataset_handler/dataset_filter.py (1)

52-52: Consider documenting the string conversion behavior.

Converting all values to strings (line 52) enables wildcard matching but means that numeric values like 123 will be matched as "123". While this behavior is necessary for wildcard patterns like iproute2_6[0-5] to work, it could be surprising for users filtering on numeric columns.

Consider adding a note in the method's docstring or the class docstring that all values are converted to strings for pattern matching.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b38a643 and 6a156fa.

📒 Files selected for processing (1)
  • src/nat/eval/dataset_handler/dataset_filter.py (4 hunks)
🧰 Additional context used
📓 Path-based instructions (5)
**/*.py

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

**/*.py: In code comments use the abbreviations: nat (API namespace/CLI), nvidia-nat (package), NAT (env var prefixes); never use these abbreviations in documentation
Follow PEP 20 and PEP 8 for Python style
Run yapf with column_limit=120; yapf is used for formatting (run second)
Indent with 4 spaces (no tabs) and end each file with a single trailing newline
Use ruff (ruff check --fix) as a linter (not formatter) per pyproject.toml; fix warnings unless explicitly ignored
Respect Python naming schemes: snake_case for functions/variables, PascalCase for classes, UPPER_CASE for constants
When re-raising exceptions, use bare raise to preserve stack trace; log with logger.error(), not logger.exception()
When catching and logging without re-raising, use logger.exception() to capture full stack trace
Provide Google-style docstrings for every public module, class, function, and CLI command
Docstring first line must be a concise description ending with a period
Validate and sanitize all user input, especially in web or CLI interfaces
Prefer httpx with SSL verification enabled by default and follow OWASP Top-10 recommendations
Use async/await for I/O-bound work (HTTP, DB, file I/O)
Cache expensive computations with functools.lru_cache or an external cache when appropriate
Leverage NumPy vectorized operations when beneficial and feasible

Files:

  • src/nat/eval/dataset_handler/dataset_filter.py
{src/**/*.py,packages/*/src/**/*.py}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

{src/**/*.py,packages/*/src/**/*.py}: All importable Python code must live under src/ or packages//src/
All public APIs must have Python 3.11+ type hints on parameters and return values
Prefer typing/collections.abc abstractions (e.g., Sequence over list)
Use typing.Annotated for units or metadata when useful
Treat pyright warnings as errors during development

Files:

  • src/nat/eval/dataset_handler/dataset_filter.py
src/nat/**/*

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

src/nat/**/* contains core functionality; changes should prioritize backward compatibility

Files:

  • src/nat/eval/dataset_handler/dataset_filter.py

⚙️ CodeRabbit configuration file

This directory contains the core functionality of the toolkit. Changes should prioritize backward compatibility.

Files:

  • src/nat/eval/dataset_handler/dataset_filter.py
{**/*.py,**/*.sh,**/*.md,**/*.toml,**/*.y?(a)ml,**/*.json,**/*.txt,**/*.ini,**/*.cfg,**/*.ipynb}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

{**/*.py,**/*.sh,**/*.md,**/*.toml,**/*.y?(a)ml,**/*.json,**/*.txt,**/*.ini,**/*.cfg,**/*.ipynb}: Every file must start with the standard SPDX Apache-2.0 header
Confirm copyright years are up to date when a file is changed
All source files must include the SPDX Apache-2.0 header template (copy from an existing file)

Files:

  • src/nat/eval/dataset_handler/dataset_filter.py
**/*

⚙️ CodeRabbit configuration file

**/*: # Code Review Instructions

  • Ensure the code follows best practices and coding standards. - For Python code, follow
    PEP 20 and
    PEP 8 for style guidelines.
  • Check for security vulnerabilities and potential issues. - Python methods should use type hints for all parameters and return values.
    Example:
    def my_function(param1: int, param2: str) -> bool:
        pass
  • For Python exception handling, ensure proper stack trace preservation:
    • When re-raising exceptions: use bare raise statements to maintain the original stack trace,
      and use logger.error() (not logger.exception()) to avoid duplicate stack trace output.
    • When catching and logging exceptions without re-raising: always use logger.exception()
      to capture the full stack trace information.

Documentation Review Instructions - Verify that documentation and comments are clear and comprehensive. - Verify that the documentation doesn't contain any TODOs, FIXMEs or placeholder text like "lorem ipsum". - Verify that the documentation doesn't contain any offensive or outdated terms. - Verify that documentation and comments are free of spelling mistakes, ensure the documentation doesn't contain any

words listed in the ci/vale/styles/config/vocabularies/nat/reject.txt file, words that might appear to be
spelling mistakes but are listed in the ci/vale/styles/config/vocabularies/nat/accept.txt file are OK.

Misc. - All code (except .mdc files that contain Cursor rules) should be licensed under the Apache License 2.0,

and should contain an Apache License 2.0 header comment at the top of each file.

  • Confirm that copyright years are up-to date whenever a file is changed.

Files:

  • src/nat/eval/dataset_handler/dataset_filter.py
🔇 Additional comments (3)
src/nat/eval/dataset_handler/dataset_filter.py (3)

16-17: LGTM! Clean import of standard library module.

The fnmatch import is correctly positioned and necessary for Unix shell-style wildcard matching.


29-29: LGTM! Clear documentation of wildcard support.

The docstring update clearly documents the new Unix shell-style wildcard functionality.


39-64: Good resolution of the static analysis warning.

The list comprehension approach (lines 60-61) effectively addresses the B023 warning from the previous review by avoiding lambda capture of loop variables. The implementation is clean and the type hints and docstring follow the coding guidelines.

@willkill07 willkill07 added improvement Improvement to existing functionality non-breaking Non-breaking change labels Nov 4, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6a156fa and f97a47d.

📒 Files selected for processing (1)
  • src/nat/eval/dataset_handler/dataset_filter.py (4 hunks)
🧰 Additional context used
📓 Path-based instructions (5)
**/*.py

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

**/*.py: In code comments use the abbreviations: nat (API namespace/CLI), nvidia-nat (package), NAT (env var prefixes); never use these abbreviations in documentation
Follow PEP 20 and PEP 8 for Python style
Run yapf with column_limit=120; yapf is used for formatting (run second)
Indent with 4 spaces (no tabs) and end each file with a single trailing newline
Use ruff (ruff check --fix) as a linter (not formatter) per pyproject.toml; fix warnings unless explicitly ignored
Respect Python naming schemes: snake_case for functions/variables, PascalCase for classes, UPPER_CASE for constants
When re-raising exceptions, use bare raise to preserve stack trace; log with logger.error(), not logger.exception()
When catching and logging without re-raising, use logger.exception() to capture full stack trace
Provide Google-style docstrings for every public module, class, function, and CLI command
Docstring first line must be a concise description ending with a period
Validate and sanitize all user input, especially in web or CLI interfaces
Prefer httpx with SSL verification enabled by default and follow OWASP Top-10 recommendations
Use async/await for I/O-bound work (HTTP, DB, file I/O)
Cache expensive computations with functools.lru_cache or an external cache when appropriate
Leverage NumPy vectorized operations when beneficial and feasible

Files:

  • src/nat/eval/dataset_handler/dataset_filter.py
{src/**/*.py,packages/*/src/**/*.py}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

{src/**/*.py,packages/*/src/**/*.py}: All importable Python code must live under src/ or packages//src/
All public APIs must have Python 3.11+ type hints on parameters and return values
Prefer typing/collections.abc abstractions (e.g., Sequence over list)
Use typing.Annotated for units or metadata when useful
Treat pyright warnings as errors during development

Files:

  • src/nat/eval/dataset_handler/dataset_filter.py
src/nat/**/*

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

src/nat/**/* contains core functionality; changes should prioritize backward compatibility

Files:

  • src/nat/eval/dataset_handler/dataset_filter.py

⚙️ CodeRabbit configuration file

This directory contains the core functionality of the toolkit. Changes should prioritize backward compatibility.

Files:

  • src/nat/eval/dataset_handler/dataset_filter.py
{**/*.py,**/*.sh,**/*.md,**/*.toml,**/*.y?(a)ml,**/*.json,**/*.txt,**/*.ini,**/*.cfg,**/*.ipynb}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

{**/*.py,**/*.sh,**/*.md,**/*.toml,**/*.y?(a)ml,**/*.json,**/*.txt,**/*.ini,**/*.cfg,**/*.ipynb}: Every file must start with the standard SPDX Apache-2.0 header
Confirm copyright years are up to date when a file is changed
All source files must include the SPDX Apache-2.0 header template (copy from an existing file)

Files:

  • src/nat/eval/dataset_handler/dataset_filter.py
**/*

⚙️ CodeRabbit configuration file

**/*: # Code Review Instructions

  • Ensure the code follows best practices and coding standards. - For Python code, follow
    PEP 20 and
    PEP 8 for style guidelines.
  • Check for security vulnerabilities and potential issues. - Python methods should use type hints for all parameters and return values.
    Example:
    def my_function(param1: int, param2: str) -> bool:
        pass
  • For Python exception handling, ensure proper stack trace preservation:
    • When re-raising exceptions: use bare raise statements to maintain the original stack trace,
      and use logger.error() (not logger.exception()) to avoid duplicate stack trace output.
    • When catching and logging exceptions without re-raising: always use logger.exception()
      to capture the full stack trace information.

Documentation Review Instructions - Verify that documentation and comments are clear and comprehensive. - Verify that the documentation doesn't contain any TODOs, FIXMEs or placeholder text like "lorem ipsum". - Verify that the documentation doesn't contain any offensive or outdated terms. - Verify that documentation and comments are free of spelling mistakes, ensure the documentation doesn't contain any

words listed in the ci/vale/styles/config/vocabularies/nat/reject.txt file, words that might appear to be
spelling mistakes but are listed in the ci/vale/styles/config/vocabularies/nat/accept.txt file are OK.

Misc. - All code (except .mdc files that contain Cursor rules) should be licensed under the Apache License 2.0,

and should contain an Apache License 2.0 header comment at the top of each file.

  • Confirm that copyright years are up-to date whenever a file is changed.

Files:

  • src/nat/eval/dataset_handler/dataset_filter.py
🔇 Additional comments (4)
src/nat/eval/dataset_handler/dataset_filter.py (4)

16-17: LGTM: Appropriate import for wildcard matching.

The fnmatch import is correctly placed and is the standard library module for Unix shell-style pattern matching.


39-64: LGTM: Clean implementation that addresses previous static analysis warning.

The method correctly implements wildcard pattern matching using fnmatch. The list comprehension approach (lines 60-61) properly addresses the B023 static analysis warning from the previous review by avoiding lambda capture.

Note: Converting values to strings (line 52) means numeric patterns are matched as strings (e.g., pattern "12*" matches numeric ID 123). This is appropriate for the intended use case with patterns like "iproute2_*".


74-75: LGTM: Allowlist filtering correctly updated for wildcard support.

The change from exact-match isin() to _match_wildcard_patterns() properly enables Unix shell-style wildcard matching for allowlist filters.


81-82: LGTM: Denylist filtering correctly updated for wildcard support.

The change from exact-match isin() to _match_wildcard_patterns() properly enables Unix shell-style wildcard matching for denylist filters. The negation operator ~ correctly excludes matching rows.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 5, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Patrick Chin <8509935+thepatrickchin@users.noreply.github.com>
Updated the pattern matching logic in the DatasetFilter class to utilize list comprehension instead of a lambda function for improved performance and readability. This change enhances the efficiency of checking patterns against the input series.

Signed-off-by: Patrick Chin <8509935+thepatrickchin@users.noreply.github.com>
- Also include types to docstring args and returns

Signed-off-by: Patrick Chin <8509935+thepatrickchin@users.noreply.github.com>
@willkill07
Copy link
Member

/merge

@willkill07
Copy link
Member

/ok to test 71a0daf

@rapids-bot rapids-bot bot merged commit 771bb19 into NVIDIA:develop Nov 5, 2025
17 checks passed
saglave pushed a commit to snps-scm13/SNPS-NeMo-Agent-Toolkit that referenced this pull request Dec 11, 2025
…VIDIA#1146)

This PR adds support for Unix shell-style wildcards `(*, ?, [seq], [!seq])`  in dataset filter configurations.

In below example, dataset entries starting with `iproute2_` are included except for those with 3-digit suffix, and suffixes from 60-65
```yaml
filter:
  allowlist:
    field:
      id:
        - iproute2_*
  denylist:
    field:
      id:
        - iproute2_???
        - iproute2_6[0-5]
```
## By Submitting this PR I confirm:
- I am familiar with the [Contributing Guidelines](https://github.com/NVIDIA/NeMo-Agent-Toolkit/blob/develop/docs/source/resources/contributing.md).
- We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
  - Any contribution which contains commits that are not Signed-Off will not be accepted.
- When the PR is ready for review, new or existing tests cover these changes.
- When the PR is ready for review, the documentation is up to date with these changes.

## Summary by CodeRabbit

* **New Features**
  * Added Unix shell-style wildcard pattern matching to dataset filters (supports *, ?, and character classes); numeric values are treated as strings for pattern matching.
* **Tests**
  * Added comprehensive tests for allowlist, denylist, mixed patterns, single-character and character-class matching, numeric-string handling, and performance of exact-match paths.
* **Documentation**
  * Updated filter guidance to note wildcard support.

Authors:
  - Patrick Chin (https://github.com/thepatrickchin)

Approvers:
  - Will Killian (https://github.com/willkill07)

URL: NVIDIA#1146
Signed-off-by: Sangharsh Aglave <aglave@synopsys.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improvement to existing functionality non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants