Coverage, Overlap, Trespass and Excess (COTe) score for Document Layout Analysis
Document Layout Analysis (DLA) is the process of parsing a page into meaningful elements, typically using machine learning models. Traditional evaluation metrics such as IoU, F1, and mAP were designed for 3D-to-2D image projections (e.g. photographs) and can give misleading results for natively 2D printed documents.
cotescore introduces two complementary ideas:
- Structural Semantic Units (SSUs) — a relational labelling approach that shifts focus from the physical bounding boxes of regions to the semantic structure of the content.
- COTe Score — a decomposable metric that breaks page-parsing quality into four interpretable components:
- Coverage — how well predictions cover ground-truth regions
- Overlap — redundant predictions within the same ground-truth region
- Trespass — predictions that cross semantic boundaries into a different SSU
- excess — A support metric for predictions that fall on background or white-space
COTe is more informative than traditional metrics, reveals distinct model failure modes, and remains useful even when explicit SSU labels are unavailable.
pip install cotescoreThe benchmarking extras include either PyTorch or PaddlePaddle — do not install both in the same environment. These frameworks require different CUDA versions and will conflict:
| Extra | Use case | GPU framework |
|---|---|---|
cotescore[benchmarks] |
Torch-based DLA models (DocLayout-YOLO, Heron) | PyTorch |
cotescore[paddle-benchmark] |
PaddleOCR / PP-DocLayout | PaddlePaddle |
# Torch-based benchmarks
pip install "cotescore[benchmarks]"
# PaddlePaddle benchmarks (install PaddlePaddle first, separately)
# See: https://www.paddlepaddle.org.cn/install/quick
pip install "cotescore[paddle-benchmark]"Note: PaddlePaddle must be installed before
cotescore[paddle-benchmark]— it is not listed as a pip dependency because the correct wheel depends on your CUDA version. Follow the PaddlePaddle installation guide to get the right version.
The library ships with a bundled limerick case study that you can use to try it out immediately.
from cotescore import cote_score, load_limerick_example, extract_ssu_boxes
from cotescore.adapters import boxes_to_gt_ssu_map, boxes_to_pred_masks, eval_shape
# Load the bundled example: ground-truth dict, document image, and example predictions
ground_truth, image, pred_boxes = load_limerick_example()
h, w = image.shape[:2]
_, _, scale = eval_shape(w, h)
# Build tagged SSU-level GT boxes and rasterize to a 2-D SSU id map
gt_boxes = extract_ssu_boxes(ground_truth)
gt_ssu_map = boxes_to_gt_ssu_map(gt_boxes, w, h, scale=scale)
preds = boxes_to_pred_masks(pred_boxes, w, h, scale=scale)
# Compute the COTe score
cote, C, O, T, E = cote_score(gt_ssu_map, preds)
print(f"COTe={cote:.3f} C={C:.3f} O={O:.3f} T={T:.3f} E={E:.3f}")COTe pixel-state visualisation: each pixel in the document is classified as Coverage (green), Overlap (yellow), Trespass (red), Trespass AND Overlap (purple), or Excess (blue). Produced by notebooks/limerick_analysis.py.
See notebooks/limerick_analysis.py for a full worked example comparing COTe against F1 and mean IoU at different granularity levels.
| Function | Description |
|---|---|
cote_score(gt_ssu_map, preds) |
Returns (cote, C, O, T, E) — the full decomposition |
coverage(gt_ssu_map, preds) |
Fraction of GT area correctly covered [0, 1] |
overlap(gt_ssu_map, preds) |
Redundant prediction area within GT [0, ∞) |
trespass(gt_ssu_map, preds) |
GT area covered by wrong-SSU predictions [0, ∞] |
cote_class(gt_ssu_map, ssu_to_class, preds) |
Per-class interaction matrices (ClassCOTeResult) |
iou(box1, box2) |
Intersection over Union for two boxes |
mean_iou(preds, gt) |
Mean IoU across all GT boxes |
f1(preds, gt, threshold) |
F1 score at a given IoU threshold |
All functions are importable directly from cotescore:
from cotescore import cote_score, coverage, overlap, iou, mean_iou, cote_classFor an alternative visual overview of the what the different elements of the COTe score mean please see notebooks/metrics_exploration.py
from cotescore import compute_cote_masks, visualize_cote_states
import matplotlib.pyplot as plt
masks = compute_cote_masks(gt_ssu_map, preds)
fig, ax = plt.subplots()
visualize_cote_states(image, masks, ax=ax)
plt.show()The library includes loaders for three DLA datasets used in the paper:
from cotescore.dataset import NCSEDataset, DocLayNetDataset, HNLA2013Dataset
# Bundled toy example (no download required)
from cotescore import load_limerick_example
ground_truth, image = load_limerick_example()The primary interactive example is the Marimo notebook at notebooks/limerick_analysis.py. It demonstrates:
- Visualising SSU, line, and character-level bounding boxes
- The granularity-mismatch problem and how COTe handles it
- Side-by-side comparison of COTe, F1, and mean IoU
- COTe pixel-state visualisation
To run the notebook (requires Marimo):
pip install marimo
marimo edit notebooks/limerick_analysis.pyIf you have questions, find a bug, or want to request a feature, please open an issue on GitHub.
If you use cotescore in your research, please cite:
Bourne, Jonathan, Mwiza Simbeye, and Ishtar Govia. “The COTe Score: A Decomposable Framework for Evaluating Document Layout Analysis Models.” arXiv:2603.12718. Preprint, arXiv, March 13, 2026. https://doi.org/10.48550/arXiv.2603.12718.
@misc{bourne2026cote,
title = {The {COTe} Score: A Decomposable Framework for Evaluating {Document Layout Analysis} Models},
author = {Bourne, Jonathan and Simbeye, Mwiza and Govia, Ishtar},
year = {2026},
month = mar,
publisher = {arXiv},
doi = {10.48550/arXiv.2603.12718},
}