Highlights
- Multi-level tag vocabulary for annotation (
archival_structures.datasets.vocabulary): anamespace:type(:subtype)?(#N)?grammar across four namespaces (carrier,generic,position,doctype), informed by SegmOnto, codicology, and diplomatics.ScanAnnotationis rebuilt around it, with tags at the scan, page (verso/recto), and line/region level. - Bulk tagging by cluster (
archival_structures.datasets.bulk_tagging): a paginated thumbnail grid with a structured namespace/type/subtype tag builder, writing straight intoScanAnnotation.tags. Demo:bulk-tag-annotation-demo.ipynb. archival_structures.stream_analysisis now part of the published package: embeddings + UMAP/HDBSCAN clustering, layout features, optional VLM tagging via the Anthropic API, and active-learning ground-truth creation for a plain directory of document images. Demos:stream-analysis-overview-demo.ipynb,stream-analysis-groundtruth-demo.ipynb.- Reconciled two other annotation formats (a bulk image tagger, an older region-drawing tool) into the canonical
ScanAnnotationground truth. - Fixed all outstanding Sphinx docstring-formatting warnings; the docs build now fails on any new warning.
Breaking changes
ScanAnnotation.page_layout(a single string) and.lines(single label per line) are replaced by multi-valued tag fields (.tags,.pages,.lines,.regions) -- see the Vocabulary guide.
Install
pip install --upgrade archival-structuresDemo data
Unchanged since v0.1.0 -- archival-structures-demo-data.zip is the same asset. See the README for the full notebook list.