fix: langchain web ingest script must not always add CUDA documents by willkill07 · Pull Request #1018 · NVIDIA/NeMo-Agent-Toolkit

willkill07 · 2025-10-16T13:39:44Z

Description

By default, we always consumed the CUDA docs in every collection.

This contributed toward inconsistent behavior with Automated Description Generation for retriever collections.

This PR also unifies the configuration file descriptions and properly adds docker volumes to .gitignore.

Closes

By Submitting this PR I confirm:

I am familiar with the Contributing Guidelines.
We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
- Any contribution which contains commits that are not Signed-Off will not be accepted.
When the PR is ready for review, new or existing tests cover these changes.
When the PR is ready for review, the documentation is up to date with these changes.

Summary by CodeRabbit

Chores
- Updated Git ignore patterns for vector database volume directories
- Simplified configuration settings in custom function examples
- Enhanced default parameter handling for documentation ingestion script

Signed-off-by: Will Killian <wkillian@nvidia.com>

coderabbitai · 2025-10-16T13:40:12Z

Walkthrough

Configuration files are updated to remove NIM LLM base URL settings, expand deployment volume ignore patterns, and enhance workflow configuration options with explicit tool naming and logging controls. A command-line script is modified to support optional URL arguments with automatic fallback to default CUDA documentation sources.

Changes

Cohort / File(s)	Summary
Git and Build Configuration `.gitignore`	Replaces single literal path (`deploy/compose/volumes`) with recursive glob patterns (`/deploy/compose/volumes`, `/deploy/volumes`) to broadly ignore volume directories across directory hierarchies.
Automated Description Generation Config `examples/custom_functions/automated_description_generation/src/nat_automated_description_generation/configs/config.yml`	Removes `base_url` field from `nim_llm` configuration.
Automated Description Generation Config (No Auto) `examples/custom_functions/automated_description_generation/src/nat_automated_description_generation/configs/config_no_auto.yml`	Removes `base_url` from `nim_llm`; updates `cuda_tool` function description; adds `workflow` fields: `tool_names` (list), `verbose` (boolean), and `llm_name` (string).
Web Ingestion Script `scripts/langchain_web_ingest.py`	Changes `--urls` argument default from predefined `CUDA_URLS` list to empty list; adds post-parse fallback logic to assign `CUDA_URLS` when no URLs are provided.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title succinctly summarizes the central bug fix by indicating that the Langchain web ingest script will no longer always add CUDA documents, directly reflecting the main change in the pull request.
Docstring Coverage	✅ Passed	No functions found in the changes. Docstring coverage check skipped.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f9cafe4 and 9182a9f.

📒 Files selected for processing (4)

.gitignore (1 hunks)
examples/custom_functions/automated_description_generation/src/nat_automated_description_generation/configs/config.yml (0 hunks)
examples/custom_functions/automated_description_generation/src/nat_automated_description_generation/configs/config_no_auto.yml (2 hunks)
scripts/langchain_web_ingest.py (1 hunks)

💤 Files with no reviewable changes (1)

examples/custom_functions/automated_description_generation/src/nat_automated_description_generation/configs/config.yml

🧰 Additional context used

📓 Path-based instructions (7)

**/*.{py,yaml,yml}

📄 CodeRabbit inference engine (.cursor/rules/nat-test-llm.mdc)

**/*.{py,yaml,yml}: Configure response_seq as a list of strings; values cycle per call, and [] yields an empty string.
Configure delay_ms to inject per-call artificial latency in milliseconds for nat_test_llm.

Files:

scripts/langchain_web_ingest.py
examples/custom_functions/automated_description_generation/src/nat_automated_description_generation/configs/config_no_auto.yml

**/*.py

📄 CodeRabbit inference engine (.cursor/rules/nat-test-llm.mdc)

**/*.py: Programmatic use: create TestLLMConfig(response_seq=[...], delay_ms=...), add with builder.add_llm("", cfg).
When retrieving the test LLM wrapper, use builder.get_llm(name, wrapper_type=LLMFrameworkEnum.) and call the framework’s method (e.g., ainvoke, achat, call).

**/*.py: In code comments/identifiers use NAT abbreviations as specified: nat for API namespace/CLI, nvidia-nat for package name, NAT for env var prefixes; do not use these abbreviations in documentation
Follow PEP 20 and PEP 8; run yapf with column_limit=120; use 4-space indentation; end files with a single trailing newline
Run ruff check --fix as linter (not formatter) using pyproject.toml config; fix warnings unless explicitly ignored
Respect naming: snake_case for functions/variables, PascalCase for classes, UPPER_CASE for constants
Treat pyright warnings as errors during development
Exception handling: use bare raise to re-raise; log with logger.error() when re-raising to avoid duplicate stack traces; use logger.exception() when catching without re-raising
Provide Google-style docstrings for every public module, class, function, and CLI command; first line concise and ending with a period; surround code entities with backticks
Validate and sanitize all user input, especially in web or CLI interfaces
Prefer httpx with SSL verification enabled by default and follow OWASP Top-10 recommendations
Use async/await for I/O-bound work; profile CPU-heavy paths with cProfile or mprof before optimizing; cache expensive computations with functools.lru_cache or external cache; leverage NumPy vectorized operations when beneficial

Files:

scripts/langchain_web_ingest.py

{scripts/**,ci/scripts/**}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

Shell or utility scripts belong in scripts/ or ci/scripts/ and must not be mixed with library code

Files:

scripts/langchain_web_ingest.py

**/*

⚙️ CodeRabbit configuration file

**/*: # Code Review Instructions
Ensure the code follows best practices and coding standards. - For Python code, follow
PEP 20 and
PEP 8 for style guidelines.
Check for security vulnerabilities and potential issues. - Python methods should use type hints for all parameters and return values.
Example:
def my_function(param1: int, param2: str) -> bool:
    pass
For Python exception handling, ensure proper stack trace preservation:

When re-raising exceptions: use bare raise statements to maintain the original stack trace,
and use logger.error() (not logger.exception()) to avoid duplicate stack trace output.

When catching and logging exceptions without re-raising: always use logger.exception()
to capture the full stack trace information.
Documentation Review Instructions - Verify that documentation and comments are clear and comprehensive. - Verify that the documentation doesn't contain any TODOs, FIXMEs or placeholder text like "lorem ipsum". - Verify that the documentation doesn't contain any offensive or outdated terms. - Verify that documentation and comments are free of spelling mistakes, ensure the documentation doesn't contain any

words listed in the ci/vale/styles/config/vocabularies/nat/reject.txt file, words that might appear to be
spelling mistakes but are listed in the ci/vale/styles/config/vocabularies/nat/accept.txt file are OK.

Misc. - All code (except .mdc files that contain Cursor rules) should be licensed under the Apache License 2.0,

and should contain an Apache License 2.0 header comment at the top of each file.

Confirm that copyright years are up-to date whenever a file is changed.

Files:

scripts/langchain_web_ingest.py
examples/custom_functions/automated_description_generation/src/nat_automated_description_generation/configs/config_no_auto.yml

**/*.{yaml,yml}

📄 CodeRabbit inference engine (.cursor/rules/nat-test-llm.mdc)

In workflow/config YAML, set llms.._type: nat_test_llm to stub responses.

Files:

examples/custom_functions/automated_description_generation/src/nat_automated_description_generation/configs/config_no_auto.yml

**/configs/**

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

Configuration files consumed by code must be stored next to that code in a configs/ folder

Files:

examples/custom_functions/automated_description_generation/src/nat_automated_description_generation/configs/config_no_auto.yml

examples/**/*

⚙️ CodeRabbit configuration file

examples/**/*: - This directory contains example code and usage scenarios for the toolkit, at a minimum an example should
contain a README.md or file README.ipynb.

If an example contains Python code, it should be placed in a subdirectory named src/ and should
contain a pyproject.toml file. Optionally, it might also contain scripts in a scripts/ directory.

If an example contains YAML files, they should be placed in a subdirectory named configs/. - If an example contains sample data files, they should be placed in a subdirectory named data/, and should
be checked into git-lfs.

Files:

examples/custom_functions/automated_description_generation/src/nat_automated_description_generation/configs/config_no_auto.yml

🔇 Additional comments (6)

.gitignore (1)

214-215: LGTM! Improved ignore pattern coverage.

The change from a single literal path to recursive glob patterns (**/deploy/compose/volumes and **/deploy/volumes) appropriately matches vector database volume directories at any depth in the repository.

scripts/langchain_web_ingest.py (2)

127-130: LGTM! Correct implementation of optional URL argument.

The change from default=CUDA_URLS to default=[] with action="append" correctly allows users to specify custom URLs without forcing CUDA documentation into every collection. The updated help text accurately describes the new behavior.

136-137: LGTM! Appropriate fallback preserves backward compatibility.

The fallback logic ensures that when no URLs are provided, the script still defaults to CUDA documentation, maintaining backward compatibility for users who don't specify custom URLs.

examples/custom_functions/automated_description_generation/src/nat_automated_description_generation/configs/config_no_auto.yml (3)

44-44: LGTM! Improved description clarity.

The updated description more accurately describes the tool's purpose, which will improve the agent's ability to use this tool appropriately.

46-51: LGTM! Enhanced workflow configuration.

The explicit workflow configuration with tool_names, verbose, and llm_name provides clearer control over the agent's behavior and improves maintainability.

34-34: Verify Milvus collection creation
References to wikipedia_docs appear in both configs and tests—confirm the Milvus retriever or initialization logic will create or handle this collection at runtime to prevent errors.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

willkill07 · 2025-10-16T16:48:20Z

/merge

Fix langchain web ingest script

9182a9f

Signed-off-by: Will Killian <wkillian@nvidia.com>

willkill07 self-assigned this Oct 16, 2025

willkill07 added the bug Something isn't working label Oct 16, 2025

willkill07 requested a review from a team as a code owner October 16, 2025 13:39

willkill07 added the non-breaking Non-breaking change label Oct 16, 2025

bbednarski9 approved these changes Oct 16, 2025

View reviewed changes

rapids-bot bot merged commit d113b20 into NVIDIA:release/1.3 Oct 16, 2025
17 checks passed

willkill07 deleted the wkk_fix-langchain-ingest-script branch October 23, 2025 18:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: langchain web ingest script must not always add CUDA documents#1018

fix: langchain web ingest script must not always add CUDA documents#1018
rapids-bot[bot] merged 1 commit intoNVIDIA:release/1.3from
willkill07:wkk_fix-langchain-ingest-script

willkill07 commented Oct 16, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Oct 16, 2025 •

edited

Loading

Misc. - All code (except .mdc files that contain Cursor rules) should be licensed under the Apache License 2.0,

Uh oh!

willkill07 commented Oct 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

willkill07 commented Oct 16, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

By Submitting this PR I confirm:

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Pre-merge checks and finishing touches

Misc. - All code (except .mdc files that contain Cursor rules) should be licensed under the Apache License 2.0,

Uh oh!

willkill07 commented Oct 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

willkill07 commented Oct 16, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 16, 2025 •

edited

Loading