Skip to content

Add support for multi-doc extraction#389

Merged
ppinchuk merged 38 commits intomainfrom
pp/multi-doc
Mar 10, 2026
Merged

Add support for multi-doc extraction#389
ppinchuk merged 38 commits intomainfrom
pp/multi-doc

Conversation

@ppinchuk
Copy link
Collaborator

This has to be enabled on a plugin-by-plugin basis. However, if enabled, the context will simply be the concatenated text from all document sources instead of going through them on-by-one.

@ppinchuk ppinchuk added this to the Long-term plans milestone Mar 10, 2026
@ppinchuk ppinchuk self-assigned this Mar 10, 2026
@ppinchuk ppinchuk requested a review from castelao as a code owner March 10, 2026 18:37
@ppinchuk ppinchuk added enhancement Update to logic or general code improvements new computation Update that adds a new computation method p-critical Priority: critical topic-python-llm Issues/pull requests related to LLMs labels Mar 10, 2026
Copilot AI review requested due to automatic review settings March 10, 2026 18:37
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for multi-document extraction in COMPASS plugins, enabling a new mode where all document sources are concatenated into a single context before LLM-based extraction. It also renames the ord_year column to year throughout the codebase, and adds a new "one-shot" water rights demo example.

Changes:

  • Adds ALLOW_MULTI_DOC_EXTRACTION class attribute and parse_multi_doc_context_for_structured_data method to OrdinanceExtractionPlugin, allowing plugins to opt into multi-doc mode; refactors _concat_scrape_results and renames parse_single_doc_for_structured_data to parse_for_structured_data
  • Renames ord_year column to year throughout Python code, Rust crates, tests, and database diagram for consistency
  • Adds a new water_rights_demo/one-shot example showcasing multi-doc extraction with a JSON5 schema and plugin config, and updates the one-shot extraction docs to document the $qualitative_features schema key

Reviewed changes

Copilot reviewed 26 out of 31 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
compass/extraction/context.py Adds multi_doc_context() method and _text_from_doc() helper; changes text property to use multi_doc_context()
compass/plugin/ordinance.py Adds multi-doc extraction methods, new helper functions _fill_out_multi_file_sources, _get_source_inds, _fill_in_all_sources; refactors single-doc path; removes source/year assignment from _concat_scrape_results
compass/plugin/one_shot/base.py Adds allow_multi_doc_extraction config key support; updates class factory to read and set ALLOW_MULTI_DOC_EXTRACTION
compass/plugin/one_shot/components.py Adds abstract QUALITATIVE_FEATURES property; changes qualitative feature detection from $definitions.qualitative_restrictions to new $qualitative_features key; updates _to_dataframe output columns
compass/extraction/water/plugin.py Renames ord_year to year; simplifies _set_data_year and _set_data_sources
compass/utilities/parsing.py Renames extract_ord_year_from_doc_attrs to extract_year_from_doc_attrs
compass/utilities/finalize.py Renames ord_year to year in column list; updates compile_run_summary_message text
compass/utilities/__init__.py Exports renamed extract_year_from_doc_attrs
crates/compass/src/scraper/ordinance/quantitative.rs Renames ord_year to year in SQL schema, structs, and CSV header
crates/compass/src/scraper/ordinance/qualitative.rs Same as quantitative.rs
crates/compass/src/lib.rs Updates commented-out SQL snippet for column rename
docs/diagram/compass-db.dot Updates DB diagram to reflect year column rename
examples/water_rights_demo/one-shot/ New one-shot water rights example with config, schema, plugin config, local docs, and README
examples/water_rights_demo/rag-based/README.rst Title update and typo fixes
examples/water_rights_demo/rag-based/config.json5 Fixes jurisdiction filepath reference
examples/water_rights_demo/README.md New top-level README for the water rights demo
examples/one_shot_schema_extraction/README.rst Updates docs for new $qualitative_features key
examples/one_shot_schema_extraction/wind_schema.json Adds $qualitative_features array
examples/README.md New top-level examples README
pyproject.toml Updates ruff version constraint to >=0.15.5,<0.16
pixi.lock Lock file update for ruff version bump
tests/python/unit/utilities/test_utilities_parsing.py Updates test for renamed function
tests/python/unit/utilities/test_utilities_finalize.py Updates tests for year column rename
tests/python/unit/extraction/test_extraction_context.py Adds tests for multi_doc_context() method
.gitignore Adds *.code-workspace to ignore list

@codecov-commenter
Copy link

codecov-commenter commented Mar 10, 2026

Codecov Report

❌ Patch coverage is 22.72727% with 68 lines in your changes missing coverage. Please review.
✅ Project coverage is 54.49%. Comparing base (701f9f9) to head (e2e7528).
⚠️ Report is 7 commits behind head on main.

Files with missing lines Patch % Lines
compass/plugin/ordinance.py 15.15% 56 Missing ⚠️
compass/plugin/one_shot/base.py 0.00% 5 Missing ⚠️
compass/plugin/one_shot/components.py 0.00% 4 Missing ⚠️
compass/extraction/water/plugin.py 25.00% 3 Missing ⚠️

❌ Your patch status has failed because the patch coverage (22.72%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #389      +/-   ##
==========================================
- Coverage   54.85%   54.49%   -0.37%     
==========================================
  Files          61       61              
  Lines        5589     5652      +63     
  Branches      525      530       +5     
==========================================
+ Hits         3066     3080      +14     
- Misses       2477     2526      +49     
  Partials       46       46              
Flag Coverage Δ
unittests 54.49% <22.72%> (-0.37%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ppinchuk
Copy link
Collaborator Author

Will fix Rust lint in next PR

@ppinchuk ppinchuk merged commit 5126e36 into main Mar 10, 2026
25 of 26 checks passed
@ppinchuk ppinchuk deleted the pp/multi-doc branch March 10, 2026 20:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Update to logic or general code improvements new computation Update that adds a new computation method p-critical Priority: critical topic-python-llm Issues/pull requests related to LLMs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants