Skip to content

1.5 OCR, hOCR, and search

Randall Floyd edited this page Apr 29, 2019 · 5 revisions

Hyrax Reboot

BPL blacklight_iiif_search gem

Scratchpad

Search hit coordinates

The BPL gem expects JSON derived from ALTO, and the magic moment where that is turned into coordinates and put onto service URLs in search responses is at:

So we need our own versions of coordinates and coordinates_raw that uses the word boundaries files that we should be creating in ESSI (or at least used to create in Pumpkin.) The pieces that helped translate those files into coordinates to put into service URLs are:

Relevant commits

Pumpkin commits related to OCR, hOCR, and word boundaries for search within:

Pumpkin commits:

Plum commits:

Git log

commit 37af9e2b63f7fe4fec8399b180891386f808a870
Author: Andy Smith <andjsmit@iupui.edu>
Date:   Thu Jun 8 08:44:24 2017 -0400

    added code from ocracoke for search api

commit 9a5bc487fc4420a8af3cf35b842572bd8cabb3ea
Merge: e2dff82 ed6367b
Author: Andy Smith <andjsmit@iupui.edu>
Date:   Thu Feb 16 09:53:54 2017 -0500

    Merge pull request #93 from IU-Libraries-Joint-Development/HPT-989_ocr_jp2
    
    HPT-989 Add OCR generation to JP2 derivatives creation code

commit db4363f64966c15c57f1b038c9cc8992418029d8
Merge: d218a84 1b2d8a3
Author: Trey Pendragon <tterrell@princeton.edu>
Date:   Tue Jun 7 11:34:46 2016 -0700

    Merge pull request #661 from pulibrary/repo-ocr
    
    Adding OCR to the repository

commit 03323b101b67219819babd0ed3452a01fdcedcf0
Author: Trey Pendragon <tterrell@princeton.edu>
Date:   Wed Apr 27 09:22:30 2016 -0700

    Don't display ocr_language in manifest. (#550)
    
    Closes #549.

commit 2c1c421d14ac3815ff164c95afdac85d5d7b0eb1
Merge: 817c2fe 3c86b11
Author: Trey Pendragon <tterrell@princeton.edu>
Date:   Tue Apr 26 08:32:32 2016 -0700

    Merge pull request #544 from pulibrary/ocr-language-filtering
    
    Only try to use languages supported by Tesseract

commit 3bcb7cf0a75ea60770d10abbd080468598022ae9
Merge: af94e09 389dac6
Author: Trey Pendragon <tterrell@princeton.edu>
Date:   Wed Mar 9 13:55:25 2016 -0800

    Merge pull request #480 from pulibrary/default_ocr_language
    
    Using resource language as default OCR language - closes #399

commit 14e9ffc410bc2bb850c31a0317edc5d1c731460a
Merge: 865680d ec19679
Author: Esmé Cowles <escowles@ticklefish.org>
Date:   Thu Mar 3 16:02:25 2016 -0500

    Merge pull request #462 from pulibrary/ocr_prioritize
    
    Prioritize OCR as lowest

commit 0a74ffa234686e8cfbaef3aaec537d5ba272cc92
Merge: f7a12fc c725c60
Author: Esmé Cowles <escowles@ticklefish.org>
Date:   Mon Feb 1 15:15:00 2016 -0500

    Merge pull request #398 from pulibrary/ocr_language
    
    Set OCR Language

commit 543fa4b752cfcdc9ec73e1e409a0ba92bca81d83
Merge: 5b859ee c710560
Author: Esmé Cowles <escowles@ticklefish.org>
Date:   Wed Jan 27 12:26:07 2016 -0500

    Merge pull request #387 from pulibrary/generate_hocr
    
    Generate hOCR for FileSets and make them searchable