Releases: Data-Scopes/archival-structures
Release list
v0.2.0
Highlights
- Multi-level tag vocabulary for annotation (
archival_structures.datasets.vocabulary): anamespace:type(:subtype)?(#N)?grammar across four namespaces (carrier,generic,position,doctype), informed by SegmOnto, codicology, and diplomatics.ScanAnnotationis rebuilt around it, with tags at the scan, page (verso/recto), and line/region level. - Bulk tagging by cluster (
archival_structures.datasets.bulk_tagging): a paginated thumbnail grid with a structured namespace/type/subtype tag builder, writing straight intoScanAnnotation.tags. Demo:bulk-tag-annotation-demo.ipynb. archival_structures.stream_analysisis now part of the published package: embeddings + UMAP/HDBSCAN clustering, layout features, optional VLM tagging via the Anthropic API, and active-learning ground-truth creation for a plain directory of document images. Demos:stream-analysis-overview-demo.ipynb,stream-analysis-groundtruth-demo.ipynb.- Reconciled two other annotation formats (a bulk image tagger, an older region-drawing tool) into the canonical
ScanAnnotationground truth. - Fixed all outstanding Sphinx docstring-formatting warnings; the docs build now fails on any new warning.
Breaking changes
ScanAnnotation.page_layout(a single string) and.lines(single label per line) are replaced by multi-valued tag fields (.tags,.pages,.lines,.regions) -- see the Vocabulary guide.
Install
pip install --upgrade archival-structuresDemo data
Unchanged since v0.1.0 -- archival-structures-demo-data.zip is the same asset. See the README for the full notebook list.
v0.1.0
First public release of archival-structures: tools for analysing PageXML/ATR transcriptions and scan images of archival documents -- detecting and splitting two-page book openings, clustering text lines and page layouts, mining cross-page document-element sequences, ink-colour and missing-transcription detection, and parsing EAD/METS archival finding-aid metadata.
Install
pip install archival-structuresDemo data
The demo notebooks in notebooks/demo/ need real PageXML/thumbnail data that isn't part of the package install. Download archival-structures-demo-data.zip from this release and extract it at the repository root:
unzip archival-structures-demo-data.zip -d .See the README for the full list of notebooks and what each one demonstrates.