feat: add document inventory queue#31
Conversation
… read generated artifacts Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…cture Merges origin/main (PRs #27-#30: MCP surface collapse, codegen contract, workbook export, persist state) into codex-issue-23-document-queue. Resolution strategy: - Keep origin/main's 7-tool contract architecture throughout - Retain document inventory helpers (lib.rs) and service call path - Add DocumentInventory as a new action variant in DocumentsArgs so document_inventory is now reachable as ledgerr_documents + action - Add handle_document_inventory + parse helpers to mcp_adapter.rs - Update PUBLISHED_TOOLS actions list and mcp-capability-contract.md to include document_inventory in the ledgerr_documents row All 41 test suites pass including contract_codegen schema/doc drift checks. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds a deterministic “document inventory / queue” capability to the ledgerr_documents MCP surface, enabling agents to discover PDFs in a directory and triage them into stable status buckets (invalid_name, ready, ingested) with small-model hints.
Changes:
- Extend the
ledgerr_documentscontract/action set withdocument_inventoryand document status types. - Implement filesystem-derived inventory in
TurboLedgerServiceand expose it via the MCP adapter with a summary payload. - Add service-level tests covering ready/ingested/invalid classification and status filtering.
Reviewed changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| docs/mcp-capability-contract.md | Advertises document_inventory as an action under ledgerr_documents. |
| crates/ledgerr-mcp/src/contract.rs | Adds document_inventory to the published actions and introduces DocumentsArgs::DocumentInventory. |
| crates/ledgerr-mcp/src/lib.rs | Implements document inventory discovery, status derivation, and response structs. |
| crates/ledgerr-mcp/src/mcp_adapter.rs | Wires document_inventory through the transport adapter and adds summary.status_counts. |
| crates/ledgerr-mcp/tests/document_inventory.rs | Adds tests for deterministic ordering, statuses, and filtering. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| | Tool | Purpose | Common actions | | ||
| |---|---|---| | ||
| | `ledgerr_documents` | document intake, routing, manifest/account discovery, raw context retrieval | `list_accounts`, `pipeline_status`, `validate_filename`, `ingest_pdf`, `ingest_rows`, `get_raw_context` | | ||
| | `ledgerr_documents` | document intake, routing, manifest/account discovery, raw context retrieval | `list_accounts`, `pipeline_status`, `validate_filename`, `ingest_pdf`, `ingest_rows`, `get_raw_context`, `document_inventory` | |
There was a problem hiding this comment.
PR description mentions a first-class l3dg3rr_document_inventory MCP tool, but the published surface documented here adds document_inventory as an action under ledgerr_documents (and there is no corresponding l3dg3rr_* compatibility alias wired). Either update the PR description to match the contract approach, or add the promised alias/tool if that’s required for back-compat.
| #[test] | ||
| fn document_inventory_lists_ready_ingested_and_invalid_documents_deterministically() { | ||
| let tmp = tempfile::tempdir().expect("tempdir"); | ||
| let workbook_path = tmp.path().join("tax-ledger.xlsx"); | ||
| let service = service_for(&workbook_path); |
There was a problem hiding this comment.
These tests cover the service-level document_inventory behavior, but the PR also introduces an MCP-facing action (ledgerr_documents/document_inventory) with a specific JSON envelope (including summary). Consider adding an adapter/contract-level test that calls mcp_adapter::handle_documents_tool for this action so the transport payload shape and argument validation remain deterministic over time.
| fn source_ref_matches(source_ref: &std::path::Path, expected: &std::path::Path) -> bool { | ||
| let source_canonical = std::fs::canonicalize(source_ref).ok(); | ||
| let expected_canonical = std::fs::canonicalize(expected).ok(); | ||
| source_canonical.as_ref() == expected_canonical.as_ref() | ||
| || source_ref == expected | ||
| || source_ref.file_name() == expected.file_name() |
There was a problem hiding this comment.
source_ref_matches falls back to comparing only file_name() when canonicalization/path equality fails. This can incorrectly mark a document as ingested if two different directories contain the same *.rkyv filename (or if one path doesn’t exist), producing false-positive Ingested statuses. Prefer matching by canonicalized absolute path (and/or a normalized path relative to the allowed base) and remove the filename-only fallback.
| fn source_ref_matches(source_ref: &std::path::Path, expected: &std::path::Path) -> bool { | |
| let source_canonical = std::fs::canonicalize(source_ref).ok(); | |
| let expected_canonical = std::fs::canonicalize(expected).ok(); | |
| source_canonical.as_ref() == expected_canonical.as_ref() | |
| || source_ref == expected | |
| || source_ref.file_name() == expected.file_name() | |
| fn normalized_path_for_match(path: &std::path::Path) -> PathBuf { | |
| let mut normalized = PathBuf::new(); | |
| for component in path.components() { | |
| match component { | |
| std::path::Component::CurDir => {} | |
| std::path::Component::ParentDir => { | |
| normalized.pop(); | |
| } | |
| other => normalized.push(other.as_os_str()), | |
| } | |
| } | |
| normalized | |
| } | |
| fn source_ref_matches(source_ref: &std::path::Path, expected: &std::path::Path) -> bool { | |
| match ( | |
| std::fs::canonicalize(source_ref), | |
| std::fs::canonicalize(expected), | |
| ) { | |
| (Ok(source_canonical), Ok(expected_canonical)) => source_canonical == expected_canonical, | |
| _ => normalized_path_for_match(source_ref) == normalized_path_for_match(expected), | |
| } |
| let resolved = if directory.is_absolute() { | ||
| directory.to_path_buf() | ||
| } else { | ||
| let base = workbook_path | ||
| .parent() | ||
| .filter(|parent| !parent.as_os_str().is_empty()) | ||
| .map(std::path::Path::to_path_buf) | ||
| .unwrap_or(std::env::current_dir().map_err(|e| ToolError::Internal(e.to_string()))?); | ||
| base.join(directory) | ||
| }; | ||
|
|
||
| if !resolved.is_dir() { | ||
| return Err(ToolError::InvalidInput(format!( | ||
| "directory '{}' does not exist or is not a directory", | ||
| resolved.display() | ||
| ))); | ||
| } | ||
| Ok(resolved) |
There was a problem hiding this comment.
resolve_document_inventory_directory currently accepts any absolute directory path. This makes the MCP-exposed inventory capable of listing PDFs anywhere on the host filesystem, which is inconsistent with the existing source_ref/rkyv_ref safety model that constrains paths under workbook_path.parent(). Consider applying the same allowed-base restriction (and ideally canonicalizing before the prefix check to avoid symlink escapes).
| let resolved = if directory.is_absolute() { | |
| directory.to_path_buf() | |
| } else { | |
| let base = workbook_path | |
| .parent() | |
| .filter(|parent| !parent.as_os_str().is_empty()) | |
| .map(std::path::Path::to_path_buf) | |
| .unwrap_or(std::env::current_dir().map_err(|e| ToolError::Internal(e.to_string()))?); | |
| base.join(directory) | |
| }; | |
| if !resolved.is_dir() { | |
| return Err(ToolError::InvalidInput(format!( | |
| "directory '{}' does not exist or is not a directory", | |
| resolved.display() | |
| ))); | |
| } | |
| Ok(resolved) | |
| let base = workbook_path | |
| .parent() | |
| .filter(|parent| !parent.as_os_str().is_empty()) | |
| .map(std::path::Path::to_path_buf) | |
| .unwrap_or(std::env::current_dir().map_err(|e| ToolError::Internal(e.to_string()))?); | |
| let canonical_base = | |
| std::fs::canonicalize(&base).map_err(|e| ToolError::Internal(e.to_string()))?; | |
| let resolved = if directory.is_absolute() { | |
| directory.to_path_buf() | |
| } else { | |
| base.join(directory) | |
| }; | |
| let canonical_resolved = std::fs::canonicalize(&resolved).map_err(|_| { | |
| ToolError::InvalidInput(format!( | |
| "directory '{}' does not exist or is not a directory", | |
| resolved.display() | |
| )) | |
| })?; | |
| if !canonical_resolved.is_dir() { | |
| return Err(ToolError::InvalidInput(format!( | |
| "directory '{}' does not exist or is not a directory", | |
| resolved.display() | |
| ))); | |
| } | |
| if !canonical_resolved.starts_with(&canonical_base) { | |
| return Err(ToolError::InvalidInput(format!( | |
| "directory '{}' must be within '{}'", | |
| resolved.display(), | |
| canonical_base.display() | |
| ))); | |
| } | |
| Ok(canonical_resolved) |
| let known_source_refs = self | ||
| .classification_state | ||
| .lock() | ||
| .map_err(|_| ToolError::Internal("classification lock poisoned".to_string()))? | ||
| .tx_rows |
There was a problem hiding this comment.
This implementation builds known_source_refs and then, for each discovered PDF, scans the entire list to derive ingested_tx_ids (and source_ref_matches may canonicalize paths during each comparison). For large ledgers/directories this becomes O(num_docs × num_rows) with significant filesystem calls. Consider precomputing an index (e.g., canonical raw-context path -> tx_ids) once per request to make per-document lookup O(1) and avoid repeated canonicalize() calls.
| } => { | ||
| let status_filter: Vec<DocumentQueueStatusRequest> = statuses | ||
| .into_iter() | ||
| .filter_map(|s| DocumentQueueStatusRequest::parse(s.as_str())) | ||
| .collect(); | ||
| handle_document_inventory( | ||
| service, | ||
| &json!({ | ||
| "directory": directory, | ||
| "recursive": recursive, | ||
| "statuses": status_filter.iter().map(|s| s.as_str()).collect::<Vec<_>>(), | ||
| }), | ||
| ) | ||
| } |
There was a problem hiding this comment.
In the contract-driven path, invalid status strings are silently dropped via filter_map, which means callers can pass unsupported values without getting an error (and may get an unfiltered result set). Since parse_document_inventory_request already treats unsupported statuses as InvalidInput, consider validating here too (e.g., return an error when any status fails to parse) so contract + direct JSON parsing behave consistently.
| } => { | |
| let status_filter: Vec<DocumentQueueStatusRequest> = statuses | |
| .into_iter() | |
| .filter_map(|s| DocumentQueueStatusRequest::parse(s.as_str())) | |
| .collect(); | |
| handle_document_inventory( | |
| service, | |
| &json!({ | |
| "directory": directory, | |
| "recursive": recursive, | |
| "statuses": status_filter.iter().map(|s| s.as_str()).collect::<Vec<_>>(), | |
| }), | |
| ) | |
| } | |
| } => handle_document_inventory( | |
| service, | |
| &json!({ | |
| "directory": directory, | |
| "recursive": recursive, | |
| "statuses": statuses, | |
| }), | |
| ) |
| pub enum DocumentQueueStatusRequest { | ||
| InvalidName, | ||
| Ready, | ||
| Ingested, | ||
| } | ||
|
|
||
| impl DocumentQueueStatusRequest { |
There was a problem hiding this comment.
DocumentQueueStatusRequest is used both to filter requests (DocumentInventoryRequest.statuses) and as the response status type (DocumentRecordResponse.status). The *Request suffix is misleading for a type that is also returned to clients; consider renaming to a neutral type name (e.g., DocumentQueueStatus) and introducing a separate request filter type only if needed.
| pub enum DocumentQueueStatusRequest { | |
| InvalidName, | |
| Ready, | |
| Ingested, | |
| } | |
| impl DocumentQueueStatusRequest { | |
| pub enum DocumentQueueStatus { | |
| InvalidName, | |
| Ready, | |
| Ingested, | |
| } | |
| pub type DocumentQueueStatusRequest = DocumentQueueStatus; | |
| impl DocumentQueueStatus { |
Closes #23
Summary
l3dg3rr_document_inventoryMCP tool for deterministic document discovery and triageinvalid_name,ready,ingested) plus small-model hints and filename metadataValidation
cargo test -p ledgerr-mcpjust test