Skip to content

fix: langchain web ingest script must not always add CUDA documents#1018

Merged
rapids-bot[bot] merged 1 commit intoNVIDIA:release/1.3from
willkill07:wkk_fix-langchain-ingest-script
Oct 16, 2025
Merged

fix: langchain web ingest script must not always add CUDA documents#1018
rapids-bot[bot] merged 1 commit intoNVIDIA:release/1.3from
willkill07:wkk_fix-langchain-ingest-script

Conversation

@willkill07
Copy link
Member

@willkill07 willkill07 commented Oct 16, 2025

Description

By default, we always consumed the CUDA docs in every collection.

This contributed toward inconsistent behavior with Automated Description Generation for retriever collections.

This PR also unifies the configuration file descriptions and properly adds docker volumes to .gitignore.

Closes

By Submitting this PR I confirm:

  • I am familiar with the Contributing Guidelines.
  • We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
    • Any contribution which contains commits that are not Signed-Off will not be accepted.
  • When the PR is ready for review, new or existing tests cover these changes.
  • When the PR is ready for review, the documentation is up to date with these changes.

Summary by CodeRabbit

  • Chores
    • Updated Git ignore patterns for vector database volume directories
    • Simplified configuration settings in custom function examples
    • Enhanced default parameter handling for documentation ingestion script

Signed-off-by: Will Killian <wkillian@nvidia.com>
@willkill07 willkill07 self-assigned this Oct 16, 2025
@willkill07 willkill07 added the bug Something isn't working label Oct 16, 2025
@willkill07 willkill07 requested a review from a team as a code owner October 16, 2025 13:39
@willkill07 willkill07 added the non-breaking Non-breaking change label Oct 16, 2025
@coderabbitai
Copy link

coderabbitai bot commented Oct 16, 2025

Walkthrough

Configuration files are updated to remove NIM LLM base URL settings, expand deployment volume ignore patterns, and enhance workflow configuration options with explicit tool naming and logging controls. A command-line script is modified to support optional URL arguments with automatic fallback to default CUDA documentation sources.

Changes

Cohort / File(s) Summary
Git and Build Configuration
.gitignore
Replaces single literal path (deploy/compose/volumes) with recursive glob patterns (**/deploy/compose/volumes, **/deploy/volumes) to broadly ignore volume directories across directory hierarchies.
Automated Description Generation Config
examples/custom_functions/automated_description_generation/src/nat_automated_description_generation/configs/config.yml
Removes base_url field from nim_llm configuration.
Automated Description Generation Config (No Auto)
examples/custom_functions/automated_description_generation/src/nat_automated_description_generation/configs/config_no_auto.yml
Removes base_url from nim_llm; updates cuda_tool function description; adds workflow fields: tool_names (list), verbose (boolean), and llm_name (string).
Web Ingestion Script
scripts/langchain_web_ingest.py
Changes --urls argument default from predefined CUDA_URLS list to empty list; adds post-parse fallback logic to assign CUDA_URLS when no URLs are provided.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title succinctly summarizes the central bug fix by indicating that the Langchain web ingest script will no longer always add CUDA documents, directly reflecting the main change in the pull request.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f9cafe4 and 9182a9f.

📒 Files selected for processing (4)
  • .gitignore (1 hunks)
  • examples/custom_functions/automated_description_generation/src/nat_automated_description_generation/configs/config.yml (0 hunks)
  • examples/custom_functions/automated_description_generation/src/nat_automated_description_generation/configs/config_no_auto.yml (2 hunks)
  • scripts/langchain_web_ingest.py (1 hunks)
💤 Files with no reviewable changes (1)
  • examples/custom_functions/automated_description_generation/src/nat_automated_description_generation/configs/config.yml
🧰 Additional context used
📓 Path-based instructions (7)
**/*.{py,yaml,yml}

📄 CodeRabbit inference engine (.cursor/rules/nat-test-llm.mdc)

**/*.{py,yaml,yml}: Configure response_seq as a list of strings; values cycle per call, and [] yields an empty string.
Configure delay_ms to inject per-call artificial latency in milliseconds for nat_test_llm.

Files:

  • scripts/langchain_web_ingest.py
  • examples/custom_functions/automated_description_generation/src/nat_automated_description_generation/configs/config_no_auto.yml
**/*.py

📄 CodeRabbit inference engine (.cursor/rules/nat-test-llm.mdc)

**/*.py: Programmatic use: create TestLLMConfig(response_seq=[...], delay_ms=...), add with builder.add_llm("", cfg).
When retrieving the test LLM wrapper, use builder.get_llm(name, wrapper_type=LLMFrameworkEnum.) and call the framework’s method (e.g., ainvoke, achat, call).

**/*.py: In code comments/identifiers use NAT abbreviations as specified: nat for API namespace/CLI, nvidia-nat for package name, NAT for env var prefixes; do not use these abbreviations in documentation
Follow PEP 20 and PEP 8; run yapf with column_limit=120; use 4-space indentation; end files with a single trailing newline
Run ruff check --fix as linter (not formatter) using pyproject.toml config; fix warnings unless explicitly ignored
Respect naming: snake_case for functions/variables, PascalCase for classes, UPPER_CASE for constants
Treat pyright warnings as errors during development
Exception handling: use bare raise to re-raise; log with logger.error() when re-raising to avoid duplicate stack traces; use logger.exception() when catching without re-raising
Provide Google-style docstrings for every public module, class, function, and CLI command; first line concise and ending with a period; surround code entities with backticks
Validate and sanitize all user input, especially in web or CLI interfaces
Prefer httpx with SSL verification enabled by default and follow OWASP Top-10 recommendations
Use async/await for I/O-bound work; profile CPU-heavy paths with cProfile or mprof before optimizing; cache expensive computations with functools.lru_cache or external cache; leverage NumPy vectorized operations when beneficial

Files:

  • scripts/langchain_web_ingest.py
{scripts/**,ci/scripts/**}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

Shell or utility scripts belong in scripts/ or ci/scripts/ and must not be mixed with library code

Files:

  • scripts/langchain_web_ingest.py
**/*

⚙️ CodeRabbit configuration file

**/*: # Code Review Instructions

  • Ensure the code follows best practices and coding standards. - For Python code, follow
    PEP 20 and
    PEP 8 for style guidelines.
  • Check for security vulnerabilities and potential issues. - Python methods should use type hints for all parameters and return values.
    Example:
    def my_function(param1: int, param2: str) -> bool:
        pass
  • For Python exception handling, ensure proper stack trace preservation:
    • When re-raising exceptions: use bare raise statements to maintain the original stack trace,
      and use logger.error() (not logger.exception()) to avoid duplicate stack trace output.
    • When catching and logging exceptions without re-raising: always use logger.exception()
      to capture the full stack trace information.

Documentation Review Instructions - Verify that documentation and comments are clear and comprehensive. - Verify that the documentation doesn't contain any TODOs, FIXMEs or placeholder text like "lorem ipsum". - Verify that the documentation doesn't contain any offensive or outdated terms. - Verify that documentation and comments are free of spelling mistakes, ensure the documentation doesn't contain any

words listed in the ci/vale/styles/config/vocabularies/nat/reject.txt file, words that might appear to be
spelling mistakes but are listed in the ci/vale/styles/config/vocabularies/nat/accept.txt file are OK.

Misc. - All code (except .mdc files that contain Cursor rules) should be licensed under the Apache License 2.0,

and should contain an Apache License 2.0 header comment at the top of each file.

  • Confirm that copyright years are up-to date whenever a file is changed.

Files:

  • scripts/langchain_web_ingest.py
  • examples/custom_functions/automated_description_generation/src/nat_automated_description_generation/configs/config_no_auto.yml
**/*.{yaml,yml}

📄 CodeRabbit inference engine (.cursor/rules/nat-test-llm.mdc)

In workflow/config YAML, set llms.._type: nat_test_llm to stub responses.

Files:

  • examples/custom_functions/automated_description_generation/src/nat_automated_description_generation/configs/config_no_auto.yml
**/configs/**

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

Configuration files consumed by code must be stored next to that code in a configs/ folder

Files:

  • examples/custom_functions/automated_description_generation/src/nat_automated_description_generation/configs/config_no_auto.yml
examples/**/*

⚙️ CodeRabbit configuration file

examples/**/*: - This directory contains example code and usage scenarios for the toolkit, at a minimum an example should
contain a README.md or file README.ipynb.

  • If an example contains Python code, it should be placed in a subdirectory named src/ and should
    contain a pyproject.toml file. Optionally, it might also contain scripts in a scripts/ directory.
  • If an example contains YAML files, they should be placed in a subdirectory named configs/. - If an example contains sample data files, they should be placed in a subdirectory named data/, and should
    be checked into git-lfs.

Files:

  • examples/custom_functions/automated_description_generation/src/nat_automated_description_generation/configs/config_no_auto.yml
🔇 Additional comments (6)
.gitignore (1)

214-215: LGTM! Improved ignore pattern coverage.

The change from a single literal path to recursive glob patterns (**/deploy/compose/volumes and **/deploy/volumes) appropriately matches vector database volume directories at any depth in the repository.

scripts/langchain_web_ingest.py (2)

127-130: LGTM! Correct implementation of optional URL argument.

The change from default=CUDA_URLS to default=[] with action="append" correctly allows users to specify custom URLs without forcing CUDA documentation into every collection. The updated help text accurately describes the new behavior.


136-137: LGTM! Appropriate fallback preserves backward compatibility.

The fallback logic ensures that when no URLs are provided, the script still defaults to CUDA documentation, maintaining backward compatibility for users who don't specify custom URLs.

examples/custom_functions/automated_description_generation/src/nat_automated_description_generation/configs/config_no_auto.yml (3)

44-44: LGTM! Improved description clarity.

The updated description more accurately describes the tool's purpose, which will improve the agent's ability to use this tool appropriately.


46-51: LGTM! Enhanced workflow configuration.

The explicit workflow configuration with tool_names, verbose, and llm_name provides clearer control over the agent's behavior and improves maintainability.


34-34: Verify Milvus collection creation
References to wikipedia_docs appear in both configs and tests—confirm the Milvus retriever or initialization logic will create or handle this collection at runtime to prevent errors.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@willkill07
Copy link
Member Author

/merge

@rapids-bot rapids-bot bot merged commit d113b20 into NVIDIA:release/1.3 Oct 16, 2025
17 checks passed
@willkill07 willkill07 deleted the wkk_fix-langchain-ingest-script branch October 23, 2025 18:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants