Skip to content

feat: add document inventory queue#31

Merged
elasticdotventures merged 3 commits into
mainfrom
codex-issue-23-document-queue
Apr 17, 2026
Merged

feat: add document inventory queue#31
elasticdotventures merged 3 commits into
mainfrom
codex-issue-23-document-queue

Conversation

@elasticdotventures
Copy link
Copy Markdown
Member

Closes #23

Summary

  • add a first-class l3dg3rr_document_inventory MCP tool for deterministic document discovery and triage
  • derive queue state from filesystem PDFs plus known ingested artifacts, with no claim/prioritization semantics in this first cut
  • expose stable status fields (invalid_name, ready, ingested) plus small-model hints and filename metadata

Validation

  • cargo test -p ledgerr-mcp
  • just test

Claude Sonnet (coordinator) and others added 2 commits April 17, 2026 23:15
… read generated artifacts

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…cture

Merges origin/main (PRs #27-#30: MCP surface collapse, codegen contract,
workbook export, persist state) into codex-issue-23-document-queue.

Resolution strategy:
- Keep origin/main's 7-tool contract architecture throughout
- Retain document inventory helpers (lib.rs) and service call path
- Add DocumentInventory as a new action variant in DocumentsArgs so
  document_inventory is now reachable as ledgerr_documents + action
- Add handle_document_inventory + parse helpers to mcp_adapter.rs
- Update PUBLISHED_TOOLS actions list and mcp-capability-contract.md
  to include document_inventory in the ledgerr_documents row

All 41 test suites pass including contract_codegen schema/doc drift checks.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@elasticdotventures elasticdotventures marked this pull request as ready for review April 17, 2026 23:17
Copilot AI review requested due to automatic review settings April 17, 2026 23:17
@elasticdotventures elasticdotventures merged commit dbe3fdb into main Apr 17, 2026
3 checks passed
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a deterministic “document inventory / queue” capability to the ledgerr_documents MCP surface, enabling agents to discover PDFs in a directory and triage them into stable status buckets (invalid_name, ready, ingested) with small-model hints.

Changes:

  • Extend the ledgerr_documents contract/action set with document_inventory and document status types.
  • Implement filesystem-derived inventory in TurboLedgerService and expose it via the MCP adapter with a summary payload.
  • Add service-level tests covering ready/ingested/invalid classification and status filtering.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
docs/mcp-capability-contract.md Advertises document_inventory as an action under ledgerr_documents.
crates/ledgerr-mcp/src/contract.rs Adds document_inventory to the published actions and introduces DocumentsArgs::DocumentInventory.
crates/ledgerr-mcp/src/lib.rs Implements document inventory discovery, status derivation, and response structs.
crates/ledgerr-mcp/src/mcp_adapter.rs Wires document_inventory through the transport adapter and adds summary.status_counts.
crates/ledgerr-mcp/tests/document_inventory.rs Adds tests for deterministic ordering, statuses, and filtering.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

| Tool | Purpose | Common actions |
|---|---|---|
| `ledgerr_documents` | document intake, routing, manifest/account discovery, raw context retrieval | `list_accounts`, `pipeline_status`, `validate_filename`, `ingest_pdf`, `ingest_rows`, `get_raw_context` |
| `ledgerr_documents` | document intake, routing, manifest/account discovery, raw context retrieval | `list_accounts`, `pipeline_status`, `validate_filename`, `ingest_pdf`, `ingest_rows`, `get_raw_context`, `document_inventory` |
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description mentions a first-class l3dg3rr_document_inventory MCP tool, but the published surface documented here adds document_inventory as an action under ledgerr_documents (and there is no corresponding l3dg3rr_* compatibility alias wired). Either update the PR description to match the contract approach, or add the promised alias/tool if that’s required for back-compat.

Copilot uses AI. Check for mistakes.
Comment on lines +15 to +19
#[test]
fn document_inventory_lists_ready_ingested_and_invalid_documents_deterministically() {
let tmp = tempfile::tempdir().expect("tempdir");
let workbook_path = tmp.path().join("tax-ledger.xlsx");
let service = service_for(&workbook_path);
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests cover the service-level document_inventory behavior, but the PR also introduces an MCP-facing action (ledgerr_documents/document_inventory) with a specific JSON envelope (including summary). Consider adding an adapter/contract-level test that calls mcp_adapter::handle_documents_tool for this action so the transport payload shape and argument validation remain deterministic over time.

Copilot uses AI. Check for mistakes.
Comment on lines +1857 to +1862
fn source_ref_matches(source_ref: &std::path::Path, expected: &std::path::Path) -> bool {
let source_canonical = std::fs::canonicalize(source_ref).ok();
let expected_canonical = std::fs::canonicalize(expected).ok();
source_canonical.as_ref() == expected_canonical.as_ref()
|| source_ref == expected
|| source_ref.file_name() == expected.file_name()
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

source_ref_matches falls back to comparing only file_name() when canonicalization/path equality fails. This can incorrectly mark a document as ingested if two different directories contain the same *.rkyv filename (or if one path doesn’t exist), producing false-positive Ingested statuses. Prefer matching by canonicalized absolute path (and/or a normalized path relative to the allowed base) and remove the filename-only fallback.

Suggested change
fn source_ref_matches(source_ref: &std::path::Path, expected: &std::path::Path) -> bool {
let source_canonical = std::fs::canonicalize(source_ref).ok();
let expected_canonical = std::fs::canonicalize(expected).ok();
source_canonical.as_ref() == expected_canonical.as_ref()
|| source_ref == expected
|| source_ref.file_name() == expected.file_name()
fn normalized_path_for_match(path: &std::path::Path) -> PathBuf {
let mut normalized = PathBuf::new();
for component in path.components() {
match component {
std::path::Component::CurDir => {}
std::path::Component::ParentDir => {
normalized.pop();
}
other => normalized.push(other.as_os_str()),
}
}
normalized
}
fn source_ref_matches(source_ref: &std::path::Path, expected: &std::path::Path) -> bool {
match (
std::fs::canonicalize(source_ref),
std::fs::canonicalize(expected),
) {
(Ok(source_canonical), Ok(expected_canonical)) => source_canonical == expected_canonical,
_ => normalized_path_for_match(source_ref) == normalized_path_for_match(expected),
}

Copilot uses AI. Check for mistakes.
Comment on lines +1734 to +1751
let resolved = if directory.is_absolute() {
directory.to_path_buf()
} else {
let base = workbook_path
.parent()
.filter(|parent| !parent.as_os_str().is_empty())
.map(std::path::Path::to_path_buf)
.unwrap_or(std::env::current_dir().map_err(|e| ToolError::Internal(e.to_string()))?);
base.join(directory)
};

if !resolved.is_dir() {
return Err(ToolError::InvalidInput(format!(
"directory '{}' does not exist or is not a directory",
resolved.display()
)));
}
Ok(resolved)
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolve_document_inventory_directory currently accepts any absolute directory path. This makes the MCP-exposed inventory capable of listing PDFs anywhere on the host filesystem, which is inconsistent with the existing source_ref/rkyv_ref safety model that constrains paths under workbook_path.parent(). Consider applying the same allowed-base restriction (and ideally canonicalizing before the prefix check to avoid symlink escapes).

Suggested change
let resolved = if directory.is_absolute() {
directory.to_path_buf()
} else {
let base = workbook_path
.parent()
.filter(|parent| !parent.as_os_str().is_empty())
.map(std::path::Path::to_path_buf)
.unwrap_or(std::env::current_dir().map_err(|e| ToolError::Internal(e.to_string()))?);
base.join(directory)
};
if !resolved.is_dir() {
return Err(ToolError::InvalidInput(format!(
"directory '{}' does not exist or is not a directory",
resolved.display()
)));
}
Ok(resolved)
let base = workbook_path
.parent()
.filter(|parent| !parent.as_os_str().is_empty())
.map(std::path::Path::to_path_buf)
.unwrap_or(std::env::current_dir().map_err(|e| ToolError::Internal(e.to_string()))?);
let canonical_base =
std::fs::canonicalize(&base).map_err(|e| ToolError::Internal(e.to_string()))?;
let resolved = if directory.is_absolute() {
directory.to_path_buf()
} else {
base.join(directory)
};
let canonical_resolved = std::fs::canonicalize(&resolved).map_err(|_| {
ToolError::InvalidInput(format!(
"directory '{}' does not exist or is not a directory",
resolved.display()
))
})?;
if !canonical_resolved.is_dir() {
return Err(ToolError::InvalidInput(format!(
"directory '{}' does not exist or is not a directory",
resolved.display()
)));
}
if !canonical_resolved.starts_with(&canonical_base) {
return Err(ToolError::InvalidInput(format!(
"directory '{}' must be within '{}'",
resolved.display(),
canonical_base.display()
)));
}
Ok(canonical_resolved)

Copilot uses AI. Check for mistakes.
Comment on lines +1016 to +1020
let known_source_refs = self
.classification_state
.lock()
.map_err(|_| ToolError::Internal("classification lock poisoned".to_string()))?
.tx_rows
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation builds known_source_refs and then, for each discovered PDF, scans the entire list to derive ingested_tx_ids (and source_ref_matches may canonicalize paths during each comparison). For large ledgers/directories this becomes O(num_docs × num_rows) with significant filesystem calls. Consider precomputing an index (e.g., canonical raw-context path -> tx_ids) once per request to make per-document lookup O(1) and avoid repeated canonicalize() calls.

Copilot uses AI. Check for mistakes.
Comment on lines +367 to +380
} => {
let status_filter: Vec<DocumentQueueStatusRequest> = statuses
.into_iter()
.filter_map(|s| DocumentQueueStatusRequest::parse(s.as_str()))
.collect();
handle_document_inventory(
service,
&json!({
"directory": directory,
"recursive": recursive,
"statuses": status_filter.iter().map(|s| s.as_str()).collect::<Vec<_>>(),
}),
)
}
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the contract-driven path, invalid status strings are silently dropped via filter_map, which means callers can pass unsupported values without getting an error (and may get an unfiltered result set). Since parse_document_inventory_request already treats unsupported statuses as InvalidInput, consider validating here too (e.g., return an error when any status fails to parse) so contract + direct JSON parsing behave consistently.

Suggested change
} => {
let status_filter: Vec<DocumentQueueStatusRequest> = statuses
.into_iter()
.filter_map(|s| DocumentQueueStatusRequest::parse(s.as_str()))
.collect();
handle_document_inventory(
service,
&json!({
"directory": directory,
"recursive": recursive,
"statuses": status_filter.iter().map(|s| s.as_str()).collect::<Vec<_>>(),
}),
)
}
} => handle_document_inventory(
service,
&json!({
"directory": directory,
"recursive": recursive,
"statuses": statuses,
}),
)

Copilot uses AI. Check for mistakes.
Comment on lines +98 to +104
pub enum DocumentQueueStatusRequest {
InvalidName,
Ready,
Ingested,
}

impl DocumentQueueStatusRequest {
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DocumentQueueStatusRequest is used both to filter requests (DocumentInventoryRequest.statuses) and as the response status type (DocumentRecordResponse.status). The *Request suffix is misleading for a type that is also returned to clients; consider renaming to a neutral type name (e.g., DocumentQueueStatus) and introducing a separate request filter type only if needed.

Suggested change
pub enum DocumentQueueStatusRequest {
InvalidName,
Ready,
Ingested,
}
impl DocumentQueueStatusRequest {
pub enum DocumentQueueStatus {
InvalidName,
Ready,
Ingested,
}
pub type DocumentQueueStatusRequest = DocumentQueueStatus;
impl DocumentQueueStatus {

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add document inventory and queue capabilities for back-office operation

2 participants