Skip to content

Add create_or_update_text_document agent tool#1808

Merged
JSv4 merged 6 commits into
mainfrom
claude/agentic-document-versioning-JlyzS
May 28, 2026
Merged

Add create_or_update_text_document agent tool#1808
JSv4 merged 6 commits into
mainfrom
claude/agentic-document-versioning-JlyzS

Conversation

@JSv4
Copy link
Copy Markdown
Collaborator

@JSv4 JSv4 commented May 27, 2026

Summary

Adds a new corpus-scoped LLM tool, create_or_update_text_document, that lets a corpus-level agent author or version-up a text-based document inside the active corpus.

  • Initial scope: text formats onlytext/plain (default), text/markdown, application/txt (the TEXT_MIMETYPES set). Binary formats (PDF/DOCX) still require the parsing pipeline and are intentionally out of scope for this iteration.
  • Create-or-version-up semantics. The corpus filesystem path is derived from title using the same sanitisation as Corpus.add_document (truncate to MAX_FILENAME_LENGTH, non-alphanumeric → _, DEFAULT_DOCUMENT_PATH_PREFIX). A second call with the same title in the same corpus hits the same path and the dual-tree versioning architecture (documents/versioning.py::import_document) version-ups the existing doc — same version_tree_id, version_number + 1, old row flipped off is_current.
  • HITL by default. Registered with requires_approval=True + requires_write_permission=True + ToolCategory.CORPUS, mirroring move_document.
  • Permissions & IDOR. Requires PermissionTypes.UPDATE on the corpus (raises PermissionError otherwise). User / corpus / folder lookups all return the same "does not exist or is not accessible" message regardless of whether the row is missing or unreachable. Quota check routes through DocumentService.check_user_upload_quota so capped users get a clean ValueError instead of a half-created doc. The author is granted CRUD on the resulting document.
  • Return shape. {status: "created" | "updated", document_id, corpus_id, path, version_number, file_type, byte_count, message}.

The tool exposes both acreate_or_update_text_document (async, used by the agent runtime) and a sync create_or_update_text_document for tests. upload_text_document / aupload_text_document are exported as aliases for callers that prefer the verb-first name.

Files

  • opencontractserver/llms/tools/core_tools/text_document_import.py — new tool module (sync + async).
  • opencontractserver/llms/tools/core_tools/__init__.py — re-export the new symbols.
  • opencontractserver/llms/tools/tool_registry.pyToolDefinition (with parameter metadata for the GraphQL "available tools" surface) + FUNCTION_MAP entry (with upload_text_document alias).
  • opencontractserver/tests/test_text_document_import_tool.py — covers create, in-folder placement, second-call version-up, text/markdown acceptance, application/pdf rejection, empty-title / None-content rejection, nonexistent-user rejection, IDOR-on-other-user-corpus, read-only-corpus → PermissionError, folder-in-wrong-corpus rejection, and an async-wrapper smoke test.
  • CHANGELOG.md — Unreleased / Added entry.

Test plan

  • docker compose -f test.yml run django pytest opencontractserver/tests/test_text_document_import_tool.py -n 4 --dist loadscope
  • docker compose -f test.yml run django pytest opencontractserver/tests/test_tool_registry.py opencontractserver/tests/test_move_document_tool.py -n 4 --dist loadscope (regression: registry + sibling tool still pass)
  • pre-commit run --all-files
  • End-to-end smoke: drive a corpus-level agent and confirm the new tool surfaces in the available-tools list and that the approval modal renders with the parameter metadata.

Generated by Claude Code

Lets corpus-level agents author or version-up a text-based document
in the active corpus. Scope is intentionally limited to text formats
(text/plain, text/markdown, application/txt) for the initial release;
binary formats still require the parsing pipeline.

The tool derives the corpus path from `title` (matching Corpus.add_document
sanitisation), so a second call with the same title in the same corpus
hits the same path and the dual-tree versioning architecture
(documents/versioning.py::import_document) version-ups the existing doc.

Returns status ("created" | "updated"), document_id, version_number,
path, byte_count, etc. Wired into the registry as a CORPUS-category
tool with requires_approval=True / requires_write_permission=True
(HITL gate, mirroring move_document).
@claude
Copy link
Copy Markdown

claude Bot commented May 27, 2026

Code Review: Add create_or_update_text_document agent tool

Overview

This PR adds a corpus-scoped LLM tool that lets agents author or version-up text documents inside a corpus. The implementation is clean and well-structured, with solid IDOR protections, good test coverage, and correct async wrapping via _db_sync_to_async. A few issues are worth addressing before merge.


Issues

1. Empty-string content silently creates a 0-byte document (bug / docstring mismatch)

File: text_document_import.py:102

The docstring lists "content is empty" as a ValueError case, but the guard only catches None:

if content is None:
    raise ValueError("content must be a string (got None).")

An empty string ("") or whitespace-only (" ") passes through, producing a 0-byte txt_extract_file. If empty docs are intentional, remove the docstring claim. If not, add:

if not content or not content.strip():
    raise ValueError("content must be a non-empty string.")

There is also no test for content="", so this behavior is currently unspecified.


2. _path_record naming misuse (misleading leading underscore)

File: text_document_import.py:152

document, status, _path_record = corpus.import_content(...)
...
"version_number": _path_record.version_number,

A leading _ conventionally signals "intentionally unused" — but this variable IS used two lines later and again in the return dict. Rename to path_record.


3. Silent path collision between titles that sanitize identically

File: text_document_import.py:26-39

The docstring says "same title → same path," but the sanitisation means different titles can derive to the same path. "My_Doc" and "My Doc" both produce /documents/My_Doc, so a second call with either title silently version-ups the first document.

This is a design trade-off rather than a bug, but the docstring and the ToolDefinition.description in tool_registry.py only mention "same title" — they should clarify that non-alphanumeric chars (other than -_.) collapse to _, so distinct titles can derive to the same path and trigger an unexpected version-up.


4. Missing test for quota-exceeded path

File: test_text_document_import_tool.py

The quota check (DocumentService.check_user_upload_quota) is called on both create and update paths, but no test exhausts the quota and verifies the ValueError. A future refactor of check_user_upload_quota's return shape would silently break this tool without a test to catch it.


Minor / Nits

visible_to_user inline (text_document_import.py:119): The tool calls Corpus.objects.visible_to_user(user).get(pk=corpus_id) directly rather than through a service — CLAUDE.md's architecture invariant flags this for code with user context. The sibling move_document tool uses the same pattern (enforcement appears scoped to GraphQL resolvers), but worth aligning if the invariant is extended to LLM tools.

PermissionTypes.UPDATE check (text_document_import.py:125): Correct — matches requires_write_permission=True in the ToolDefinition.

Alias smoke coverage: The upload_text_document alias is wired in FUNCTION_MAP but the async smoke test only exercises the canonical acreate_or_update_text_document. A test call via the alias name would confirm registry wiring end-to-end.


Test Coverage Summary

Scenario Covered
Create at corpus root yes
Create in folder yes
Version-up (same title) yes
text/markdown accepted yes
Binary MIME rejected yes
Empty title rejected yes
None content rejected yes
Empty string content no
Nonexistent user yes
IDOR on foreign corpus yes
Read-only corpus PermissionError yes
Folder in wrong corpus yes
Path sanitisation yes
Quota exceeded no
Async wrapper smoke yes

Summary

The security posture (IDOR prevention, HITL flag, quota delegation) is correct, the async wrapping follows the _db_sync_to_async pattern, and the test suite covers the main happy and error paths well. The two actionable items before merge are: fix the empty-content validation gap (either add the check or remove the docstring claim) and rename _path_record to path_record. The path-collision docstring clarification and quota test are lower-priority but recommended.

JSv4 added 2 commits May 26, 2026 21:29
Claude review (#1808):
- Reject empty / whitespace-only content (was: 0-byte doc created)
- Rename _path_record to path_record (used three lines down)
- Document path-collision behavior in docstring + tool_registry desc
- Add tests for empty content, quota-exceeded, and both aliases

Discovered bug in opencontractserver/documents/versioning.py: the
update-existing-path branch did 'effective_txt_file = txt_file or
old_doc.txt_extract_file', causing a version-up to silently inherit
the previous version's stored file when the caller passed content
bytes but no explicit File object. Aligned with the create-path
logic (lines 324-343): always materialise the content bytes into a
fresh ContentFile when no File object is supplied.
@claude
Copy link
Copy Markdown

claude Bot commented May 27, 2026

Test comment - please ignore

@claude
Copy link
Copy Markdown

claude Bot commented May 27, 2026

Code Review: Add create_or_update_text_document agent tool

Overview

This PR adds a well-scoped corpus-level LLM tool for creating or versioning-up text documents, along with a meaningful bug fix in versioning.py that prevented version-up from writing new content. The design is correct and the test coverage is broad. There are a few issues worth addressing before merge.


Critical: Service Layer Invariant Violation

File: opencontractserver/llms/tools/core_tools/text_document_import.py, lines 203-208

The line corpus = Corpus.objects.visible_to_user(user).get(pk=corpus_id) is an inline Tier-0 use.

Per CLAUDE.md, LLM tools with a user context must reach models through opencontractserver/<app>/services/ — inline visible_to_user is Tier-0 and is enforced by both test_graphql_service_layer.py and the opencontracts.E001 Django system check (which fails manage.py startup). FolderCRUDService is correctly used three lines below, making the inconsistency conspicuous.

Fix: use the corpus service, e.g. CorpusService.get_corpus_for_user(user, corpus_id) or BaseService.get_or_none, then propagate the same IDOR-safe error message on None.


Medium: Possible Breaking API Change in versioning.py

File: opencontractserver/documents/versioning.py, lines 30-40

The old fallback effective_txt_file = txt_file or old_doc.txt_extract_file is removed in favour of always materialising from content. This is correct for the new tool (which always supplies content), and the motivation is well-documented. However, it changes the public contract of import_document: any existing caller that passes both txt_file=None and content=None expecting to inherit the old file will now get _create_content_file(content=None, ...) instead of a quiet no-op.

Ask: confirm no other caller of import_document relies on the old inherit-old-file behaviour, or add an explicit guard in import_document that raises ValueError when both txt_file and content are None.


Minor: Async Test Signal Management is Fragile

File: opencontractserver/tests/test_text_document_import_tool.py, lines 700-766

setUp disconnects the document-create signal, creates User + Corpus (neither triggers process_doc_on_create_atomic since that is a Document.post_save handler), then reconnects in finally. Each test method then repeats the disconnect/reconnect dance. This leaves a live-signal window between setUp.finally and the test method disconnect, and puts reconnect responsibility on individual tests rather than tearDown.

Recommended pattern: disconnect in setUp, reconnect in tearDown, remove per-test boilerplate.


Minor: aupload_text_document Absent from FUNCTION_MAP

File: opencontractserver/llms/tools/tool_registry.py, lines 1366-1369

The FUNCTION_MAP entry registers "upload_text_document" as an alias but aupload_text_document (the async alias exported from __init__.py) is not registered. Either register the async alias too, or remove aupload_text_document from __all__ to avoid advertising an entry point the registry cannot resolve.


Minor: No Content Size Guard

DocumentService.check_user_upload_quota caps document count but there is no maximum byte-size check on content. A crafted prompt could submit a multi-MB payload. Given that requires_approval=True this is low-risk in practice, but a soft cap via a constant in opencontractserver/constants/ would be a more complete defence.


Positive Notes

  • The versioning.py bug fix is correct and well-justified. The old code silently inherited stale bytes on version-up; the fix ensures new content is always materialised. The inline comment is one of the rare cases where a comment genuinely earns its place.
  • IDOR handling is consistent — all three lookup failure paths (user, corpus, folder) surface the same "does not exist or is not accessible" message.
  • corpus.user_can(user, PermissionTypes.UPDATE) is the right single-object permission API per CLAUDE.md.
  • Test coverage is thorough: create, folder placement, version-up (checks version_tree_id, is_current flip, single active path), all three supported MIME types, all four rejection paths (quota, IDOR, read-only corpus, cross-corpus folder), and the async wrapper.
  • requires_approval=True + requires_write_permission=True is the right HITL posture for a write tool.
  • Path sanitisation docstring correctly calls out the title-collision ambiguity and the ToolDefinition description propagates it to the LLM.

Removed the per-class disconnect/reconnect of process_doc_on_create_atomic.
conftest.py already disables document processing signals session-wide via
the disable_document_processing_signals autouse fixture, so the class-level
override was redundant — and worse, the tearDown reconnect *re-enabled* the
signal mid-session, causing 33 unrelated tests (test_nested_approval_gates,
test_llm_tools, test_blob_retention, test_extract_tasks, ...) that follow
on the same xdist worker to fail with IsADirectoryError when their
Document.objects.create() calls fired the now-reconnected parser pipeline.
@claude
Copy link
Copy Markdown

claude Bot commented May 28, 2026

Code Review — create_or_update_text_document agent tool

Overview

Adds a well-scoped agent tool for text document creation/versioning in a corpus, paired with a targeted fix to versioning.py that prevents version-up from silently inheriting old file content. The PR description is thorough, the test coverage is broad, and the HITL registration mirrors move_document. A few things worth addressing before merge:


Critical — Architecture Invariant Violation

text_document_import.py lines 203–212 uses inline Tier-0 calls:

corpus = Corpus.objects.visible_to_user(user).get(pk=corpus_id)
...
if not corpus.user_can(user, PermissionTypes.UPDATE):

CLAUDE.md is explicit: "LLM tools… must reach models through opencontractserver/<app>/services/— never compose visible_to_user / user_can / user_has_permission_for_obj inline." The invariant is enforced by both test_graphql_service_layer.py and the opencontracts.E001 Django system check that fails manage.py startup.

The folder lookup already uses FolderCRUDService.get_folder_by_id(), so this is inconsistent within the same function. The corpus lookup should route through the equivalent corpus-service method (visible in docs/architecture/query_permission_patterns.md).


Medium — versioning.py Behavioral Change May Break Existing Callers

versioning.py lines 31 and 40 removes the or old_doc.{field} fallback:

# Before
effective_txt_file = txt_file or old_doc.txt_extract_file
# After
effective_txt_file = txt_file  # only; no fallback

This is the right fix for the new tool's use-case (otherwise version-up silently inherits old bytes). However, it's a behavioral change to a shared code path. Any existing caller that passes import_document(…, txt_file=None, content=None) expecting to inherit the old file will now silently produce a document with an empty/None file.

It's worth adding a quick grep or comment confirming that all existing callers always supply either txt_file/pdf_file or content — or hardening the function with an assert content or txt_file or pdf_file guard to catch regressions early.


Medium — Ambiguous Upsert Key (Path vs. Title)

_derive_path_from_title collapses space, /, #, and other characters to _. The docstring correctly warns that "My Doc" and "My_Doc" both produce /documents/My_Doc, but the effect is broader:

  • "Contract / Draft #3" and "Contract___Draft__3" version-up each other
  • An agent calling the tool with both titles in sequence will silently clobber the second over the first

Since the LLM is the primary caller, this confusion is a real risk. Consider either: (a) echoing the derived path in the tool description so the LLM can reason about it, or (b) documenting in the tool's description string that the path — not the raw title — is the upsert key.


Low — No Individual Document Size Guard

Quota routing through DocumentService.check_user_upload_quota correctly enforces per-user document counts. But there is no guard on the size of a single document's content string. A very large payload (multi-MB string) will be fully encoded to bytes in memory before any storage check. Consider a max-bytes constant (e.g. from opencontractserver/constants/) checked before content.encode().


Low — content: str Type Signature vs. None Test

test_none_content_rejected passes None for a parameter typed str. The runtime validation catches it correctly (not content short-circuits for None), but the annotation is str, not str | None. Either:

  • Widen the annotation to str | None to reflect tested behavior, or
  • Add a leading if content is None: raise ValueError(…) check so the annotation stays str and the error message is more precise than "must be a non-empty string" for a None input.

Minor — test_quota_exceeded_rejected Mock Path

The patch target is:

"opencontractserver.documents.document_service.DocumentService.check_user_upload_quota"

But the import in text_document_import.py is:

from opencontractserver.documents.document_service import DocumentService

patch resolves the name at the module that owns the object, so patching the source (documents.document_service.DocumentService.check_user_upload_quota) is correct — but it's worth confirming this matches what's actually imported, since a local-name patch (opencontractserver.llms.tools.core_tools.text_document_import.DocumentService.check_user_upload_quota) would also be valid and arguably clearer.


Nits

  • _derive_path_from_title is only called from create_or_update_text_document and tested indirectly. If it ever needs reuse across tools, it belongs in opencontractserver/utils/. No action needed now, just flagging for if the module grows.
  • test_path_derivation_collapses_unsafe_chars expects "/documents/Contract___Draft__3". Three underscores for " / " (space→_, /_, space→_) and two for " #" (space→_, #_). This is correct but non-obvious — a short inline comment in the test would make the assertion self-documenting.
  • FUNCTION_MAP entry registers the alias key as "upload_text_document" in the tuple, but the canonical name exported in __all__ is "aupload_text_document" (async) and "upload_text_document" (sync). The registry wires the async acreate_or_update_text_document as the callable, so the alias string feels like it should be "aupload_text_document" for consistency. Worth a second look.

Summary

The architecture invariant violation (inline visible_to_user / user_can) is the blocker — it will trip the opencontracts.E001 system check. The versioning.py behavioral change deserves a callout comment or guard. Everything else is polish. The test suite is solid once those two items are addressed.

JSv4 added 2 commits May 27, 2026 20:53
…nt-versioning-JlyzS

# Conflicts:
#	CHANGELOG.md
- Route corpus visibility + UPDATE check through BaseService.get_or_none /
  user_has so the LLM tool no longer composes Tier-0 visible_to_user /
  user_can inline (CLAUDE.md rule 7).
- Cap content at MAX_FILE_UPLOAD_SIZE_BYTES before encoding to prevent
  in-memory runaway from a misbehaving agent.
- Widen content parameter annotation to str | None to match the
  None-rejection test contract.
- Document path-derivation expectation inline in
  test_path_derivation_collapses_unsafe_chars and add a size-cap test.
@JSv4 JSv4 merged commit 5c3ff2d into main May 28, 2026
11 checks passed
@JSv4 JSv4 deleted the claude/agentic-document-versioning-JlyzS branch May 28, 2026 03:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants