Skip to content

Releases: Data-Scopes/archival-structures

Release list

v0.2.0

Choose a tag to compare

@marijnkoolen marijnkoolen released this 30 Jun 16:47

Highlights

  • Multi-level tag vocabulary for annotation (archival_structures.datasets.vocabulary): a namespace:type(:subtype)?(#N)? grammar across four namespaces (carrier, generic, position, doctype), informed by SegmOnto, codicology, and diplomatics. ScanAnnotation is rebuilt around it, with tags at the scan, page (verso/recto), and line/region level.
  • Bulk tagging by cluster (archival_structures.datasets.bulk_tagging): a paginated thumbnail grid with a structured namespace/type/subtype tag builder, writing straight into ScanAnnotation.tags. Demo: bulk-tag-annotation-demo.ipynb.
  • archival_structures.stream_analysis is now part of the published package: embeddings + UMAP/HDBSCAN clustering, layout features, optional VLM tagging via the Anthropic API, and active-learning ground-truth creation for a plain directory of document images. Demos: stream-analysis-overview-demo.ipynb, stream-analysis-groundtruth-demo.ipynb.
  • Reconciled two other annotation formats (a bulk image tagger, an older region-drawing tool) into the canonical ScanAnnotation ground truth.
  • Fixed all outstanding Sphinx docstring-formatting warnings; the docs build now fails on any new warning.

Breaking changes

  • ScanAnnotation.page_layout (a single string) and .lines (single label per line) are replaced by multi-valued tag fields (.tags, .pages, .lines, .regions) -- see the Vocabulary guide.

Install

pip install --upgrade archival-structures

Demo data

Unchanged since v0.1.0 -- archival-structures-demo-data.zip is the same asset. See the README for the full notebook list.

v0.1.0

Choose a tag to compare

@marijnkoolen marijnkoolen released this 30 Jun 09:15

First public release of archival-structures: tools for analysing PageXML/ATR transcriptions and scan images of archival documents -- detecting and splitting two-page book openings, clustering text lines and page layouts, mining cross-page document-element sequences, ink-colour and missing-transcription detection, and parsing EAD/METS archival finding-aid metadata.

Install

pip install archival-structures

Demo data

The demo notebooks in notebooks/demo/ need real PageXML/thumbnail data that isn't part of the package install. Download archival-structures-demo-data.zip from this release and extract it at the repository root:

unzip archival-structures-demo-data.zip -d .

See the README for the full list of notebooks and what each one demonstrates.