Skip to content

Commit

Permalink
Add different classes to hocr output depending on BlockType
Browse files Browse the repository at this point in the history
These classes are taken from the hOCR specification, and seem
to map well onto the BlockType types. There are probably more that
could be added.
  • Loading branch information
nickjwhite committed May 14, 2019
1 parent b9b74a6 commit 068eb4c
Showing 1 changed file with 15 additions and 2 deletions.
17 changes: 15 additions & 2 deletions src/api/hocrrenderer.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -209,8 +209,21 @@ char* TessBaseAPI::GetHOCRText(ETEXT_DESC* monitor, int page_number) {
AddBoxTohOCR(res_it.get(), RIL_PARA, hocr_str);
}
if (res_it->IsAtBeginningOf(RIL_TEXTLINE)) {
hocr_str << "\n <span class='ocr_line'"
<< " id='"
hocr_str << "\n <span class='";
switch (res_it->BlockType()) {
case PT_HEADING_TEXT:
hocr_str << "ocr_header";
break;
case PT_PULLOUT_TEXT:
hocr_str << "ocr_textfloat";
break;
case PT_CAPTION_TEXT:
hocr_str << "ocr_caption";
break;
default:
hocr_str << "ocr_line";
}
hocr_str << "' id='"
<< "line_" << page_id << "_" << lcnt << "'";
AddBoxTohOCR(res_it.get(), RIL_TEXTLINE, hocr_str);
}
Expand Down

0 comments on commit 068eb4c

Please sign in to comment.