Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds support for multi-document extraction in COMPASS plugins, enabling a new mode where all document sources are concatenated into a single context before LLM-based extraction. It also renames the ord_year column to year throughout the codebase, and adds a new "one-shot" water rights demo example.
Changes:
- Adds
ALLOW_MULTI_DOC_EXTRACTIONclass attribute andparse_multi_doc_context_for_structured_datamethod toOrdinanceExtractionPlugin, allowing plugins to opt into multi-doc mode; refactors_concat_scrape_resultsand renamesparse_single_doc_for_structured_datatoparse_for_structured_data - Renames
ord_yearcolumn toyearthroughout Python code, Rust crates, tests, and database diagram for consistency - Adds a new
water_rights_demo/one-shotexample showcasing multi-doc extraction with a JSON5 schema and plugin config, and updates the one-shot extraction docs to document the$qualitative_featuresschema key
Reviewed changes
Copilot reviewed 26 out of 31 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
compass/extraction/context.py |
Adds multi_doc_context() method and _text_from_doc() helper; changes text property to use multi_doc_context() |
compass/plugin/ordinance.py |
Adds multi-doc extraction methods, new helper functions _fill_out_multi_file_sources, _get_source_inds, _fill_in_all_sources; refactors single-doc path; removes source/year assignment from _concat_scrape_results |
compass/plugin/one_shot/base.py |
Adds allow_multi_doc_extraction config key support; updates class factory to read and set ALLOW_MULTI_DOC_EXTRACTION |
compass/plugin/one_shot/components.py |
Adds abstract QUALITATIVE_FEATURES property; changes qualitative feature detection from $definitions.qualitative_restrictions to new $qualitative_features key; updates _to_dataframe output columns |
compass/extraction/water/plugin.py |
Renames ord_year to year; simplifies _set_data_year and _set_data_sources |
compass/utilities/parsing.py |
Renames extract_ord_year_from_doc_attrs to extract_year_from_doc_attrs |
compass/utilities/finalize.py |
Renames ord_year to year in column list; updates compile_run_summary_message text |
compass/utilities/__init__.py |
Exports renamed extract_year_from_doc_attrs |
crates/compass/src/scraper/ordinance/quantitative.rs |
Renames ord_year to year in SQL schema, structs, and CSV header |
crates/compass/src/scraper/ordinance/qualitative.rs |
Same as quantitative.rs |
crates/compass/src/lib.rs |
Updates commented-out SQL snippet for column rename |
docs/diagram/compass-db.dot |
Updates DB diagram to reflect year column rename |
examples/water_rights_demo/one-shot/ |
New one-shot water rights example with config, schema, plugin config, local docs, and README |
examples/water_rights_demo/rag-based/README.rst |
Title update and typo fixes |
examples/water_rights_demo/rag-based/config.json5 |
Fixes jurisdiction filepath reference |
examples/water_rights_demo/README.md |
New top-level README for the water rights demo |
examples/one_shot_schema_extraction/README.rst |
Updates docs for new $qualitative_features key |
examples/one_shot_schema_extraction/wind_schema.json |
Adds $qualitative_features array |
examples/README.md |
New top-level examples README |
pyproject.toml |
Updates ruff version constraint to >=0.15.5,<0.16 |
pixi.lock |
Lock file update for ruff version bump |
tests/python/unit/utilities/test_utilities_parsing.py |
Updates test for renamed function |
tests/python/unit/utilities/test_utilities_finalize.py |
Updates tests for year column rename |
tests/python/unit/extraction/test_extraction_context.py |
Adds tests for multi_doc_context() method |
.gitignore |
Adds *.code-workspace to ignore list |
Codecov Report❌ Patch coverage is ❌ Your patch status has failed because the patch coverage (22.72%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #389 +/- ##
==========================================
- Coverage 54.85% 54.49% -0.37%
==========================================
Files 61 61
Lines 5589 5652 +63
Branches 525 530 +5
==========================================
+ Hits 3066 3080 +14
- Misses 2477 2526 +49
Partials 46 46
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Will fix Rust lint in next PR |
This has to be enabled on a plugin-by-plugin basis. However, if enabled, the context will simply be the concatenated text from all document sources instead of going through them on-by-one.