Skip to content

Japan OCR Mini Benchmark v0.2.0

Latest

Choose a tag to compare

@K10124 K10124 released this 14 Jun 00:07
· 14 commits to main since this release

Japan OCR Mini Benchmark v0.2.0

This release adds the v0.2.0 synthetic Japanese receipt target run and publication payload.

Snapshot

  • Public snapshot ZIP: japan_ocr_mini_benchmark_public_v0.2.0_snapshot.zip
  • ZIP size: 8103466 bytes
  • ZIP sha256: ea74b0b9b591e5ee4e1b1401031e8d5f724ad7dfa0a704c9defd04b1b0339b9b

Release Notes

Japan OCR Mini Benchmark v0.2.0 Release Notes

Status

  • Release status: Release Candidate
  • RC status: release_candidate_ready
  • Target run ID: v020_target_20260613_221713
  • Manual visual review: ok_by_user_step148
  • Created at: 2026-06-13T22:42:40

Highlights

  • Added a v0.2.0 synthetic Japanese receipt target run with 20 generated records.
  • Added both clean and noisy rendered receipt images.
  • Introduced hybrid item generation using a validated LLM-approved item pool plus deterministic item master data.
  • Strengthened noisy image rendering with resolution loss, local print fading, stroke-level kasure, thermal banding, local blur patches, JPEG roundtrip compression, and safe shift-blend motion blur.
  • Added nationwide randomized fictional store locations while excluding nearby Osaka/Kita-ku style local place names.
  • Adjusted parking receipt behavior to hide tax breakdown for more natural payment-machine style receipts.
  • Added review gallery and shortlist review files for human visual inspection.

Generation Summary

  • Requested records: 20
  • Successful records: 20
  • Failed records: 0
  • Documents with LLM-approved items: 19
  • LLM-approved item count: 56
  • Item-master item count: 124
  • LLM item mix ratio: 0.3111

Validation Summary

  • Validation status: warning
  • Record count: 20
  • Status counts: {'ok': 8, 'warning': 12}
  • Issue code top counts: {'clean_noisy_size_large_difference': 12}
  • Noisy profile counts: {'light': 3, 'hard': 8, 'medium': 9}

Known Validation Warning

  • clean_noisy_size_large_difference is expected for this release candidate.
  • The warning appears because noisy images include stronger degradation, rotation, canvas margins, shadows, and camera-like framing.
  • Human visual review was completed and accepted before freezing the release candidate.

Files

  • Target run directory: release_v0.2.0
  • Release candidate summary: release_v0.2.0\release_candidate\v020_release_candidate_summary.json
  • Release candidate checklist: release_v0.2.0\release_candidate\v020_release_candidate_checklist.md
  • Release candidate file inventory: release_v0.2.0\release_candidate\v020_release_candidate_files.csv
  • Full review HTML: release_v0.2.0\review_audit\v020_review_gallery_step147.html
  • Shortlist review HTML: release_v0.2.0\review_audit\v020_review_shortlist_step147.html

Notes

  • This release candidate uses synthetic fictional receipt data.
  • Store names, branch names, addresses, product names, and transaction contents are artificial test data.
  • The dataset is intended for OCR/VLM evaluation and workflow testing, not for representing real transactions.