A benchmark for evaluating LLM extraction accuracy on insurance loss run documents.
LossBench contains 36 synthetic insurance loss run PDFs across 12 test sets. Each PDF contains one or more loss runs combined into a single file, and each file is paired with verified correct extraction answers ("ground truth") in JSON files. The documents vary in format complexity, from simple tables to multi-page horizontal layouts with multiple coverage types per claim. All claims are synthetically created with randomized details, and designed to capture a wide variety of real-world edge cases and formats.
data/
├── lr-01/ ... lr-12/
│ ├── doc-{XXX}-{1,2,3}.pdf # Test documents
│ └── ground-{XXX}-{1,2,3}.json # Ground truth
results/
├── native.csv # Direct PDF extraction results
├── ocr.csv # OCR-based extraction results
└── hybrid.csv # Hybrid approach results
Each ground truth JSON contains an array of LossRunItem objects with 17 fields:
| Field | Description |
|---|---|
policy_no |
Policy number |
claim_id |
Claim identifier |
insurer |
Insurance carrier |
insured |
Policyholder name |
report_date |
Valuation/report date |
date_of_loss |
Date of incident |
date_reported |
Date claim was reported |
closed_date |
Date claim was closed |
loss_summary_description |
Description of the loss |
claim_status |
Open/Closed |
claimant |
Claimant name |
claim_coverage_type |
Coverage type (BI, PD, etc.) |
loss_reserved |
Outstanding loss reserve |
loss_paid |
Loss amount paid |
expense_reserve |
Outstanding expense reserve |
expense_paid |
Expense amount paid |
loss_total_recovered |
Subrogation/recovery |
| Set | Challenge |
|---|---|
| lr-01, lr-02, lr-03 | Baseline formats |
| lr-04 | Multi-claimant blocks (34 claimants per claim) |
| lr-05 | Multi-coverage TPA format (7 coverage types per claim) |
| lr-06 | Workers' comp (3 coverage types per claim) |
| lr-07 - lr-12 | Various complex formats |
Results CSVs contain per-run metrics:
f1,precision,recall- Cell-level accuracyexpected_rows,extracted_rows- Row countsmodel,provider- Model tested
Row matching uses composite key: claim_id | claimant | claim_coverage_type
Pass threshold: 95% F1
MIT