TypeScript evaluation toolkit for benchmarking table extraction accuracy using the RD-TableBench dataset and scoring methodology.
This repo contains:
- A faithful TypeScript port of Reducto's Needleman-Wunsch table similarity scorer
- A dataset loader that downloads the RD-TableBench dataset from HuggingFace
- A benchmark runner that scores extraction predictions against ground truth
- A test suite that validates the scorer produces correct results on known inputs
RD-TableBench is an open benchmark created by Reducto containing 1,000 complex table images manually annotated by PhD-level human labelers. It uses a hierarchical Needleman-Wunsch alignment algorithm to compute table similarity — scoring both structural fidelity and content accuracy.
We used this benchmark to evaluate DocLD's agentic table extraction. This repo is the evaluation code. See our blog post for the full results and methodology.
# Install dependencies
npm install
# Download the RD-TableBench dataset from HuggingFace
npm run download
# Score predictions against ground truth
npm run score -- --predictions ./my-predictions --output results.json
# Run scorer tests
npm testThe scorer implements the exact same algorithm as Reducto's grading.py with identical parameters:
| Parameter | Value | Description |
|---|---|---|
S_ROW_MATCH |
5 | Bonus for aligning two rows |
G_ROW |
-3 | Penalty for row gap (insertion/deletion) |
S_CELL_MATCH |
1 | Score for identical cells |
P_CELL_MISMATCH |
-1 | Penalty for completely different cells |
G_COL |
-1 | Penalty for column gap |
- Cell-level: Levenshtein distance normalized to
[0, 1], mapped to the range[P_CELL_MISMATCH, S_CELL_MATCH]. Partial credit for near-matches. - Row-level: Needleman-Wunsch alignment using cell-level scores plus
S_ROW_MATCHbonus. Free end gaps so cropped sub-tables aren't penalized.
Before comparison, all cell content is normalized: whitespace collapsed, newlines removed, hyphens stripped — identical to grading.py.
import { tableScore, htmlTableToArray } from './src/table-scorer';
const groundTruth = [
['Name', 'Age', 'City'],
['Alice', '30', 'New York'],
['Bob', '25', 'London'],
];
const prediction = [
['Name', 'Age', 'City'],
['Alice', '30', 'New York'],
['Bob', '25', 'Londn'], // minor typo — gets partial credit
];
const score = tableScore(groundTruth, prediction);
console.log(`Similarity: ${(score * 100).toFixed(1)}%`);import { tableScore, htmlTableToArray } from './src/table-scorer';
const gt = htmlTableToArray('<table><tr><td>A</td><td>B</td></tr></table>');
const pred = htmlTableToArray('<table><tr><td>A</td><td>B</td></tr></table>');
console.log(tableScore(gt, pred)); // 1.0# Score pre-computed prediction HTML files against ground truth
npx tsx src/run-benchmark.ts \
--data ./benchmark/data \
--predictions ./my-predictions \
--output results.json
# Limit to first 50 samples
npx tsx src/run-benchmark.ts --predictions ./preds --limit 50The benchmark uses the RD-TableBench dataset created by Reducto:
- 1,000 complex tables from diverse, publicly available documents
- PhD-level human annotation — not programmatic labels
- Challenging scenarios: scanned tables, handwriting, merged cells, multi-level headers, 20+ languages
Download with:
npm run download
# or
npx tsx src/dataset-loader.ts --output ./benchmark/data| Resource | Link |
|---|---|
| RD-TableBench announcement | Reducto blog |
| Dataset (labels + images) | HuggingFace |
| Original grading code (Python) | grading.py |
| Original conversion code | convert.py |
| Needleman-Wunsch algorithm | Wikipedia |
MIT — see LICENSE.
The RD-TableBench dataset is licensed under CC BY-NC-ND 4.0 by Reducto, Inc.