## Parsing REPBULIC hOCR Files

The scans of the printed RSG volumes have the following characteristics

- all scans:
  - have two pages per scan
  - have up to 4 columns per scan, 2 per page 
  - full scan is around 4800 pixels wide, left page is up to pixel 2400, right page is from pixel 2400 (roughly)
- scans of index pages
  - have no page numbers
- scans of resolution pages
  - have page numbers (left-side page is even, right-side page is odd)
  
### Columns

The scans are normalized such that the columns are straight. The text width should be around 1000 pixels. Some columns are not cut out properly, resulting in columns that are either to small (some of the column text is missing), or too wide (the hOCR output contains partial texts from two columns)

### Index pages

- start of entry: 
  - start left alignment
- end of entry:
  - end of line possibly before end of text column. 
  - One or more page numbers


### Resolution pages

- header:
  - next top of page (less than 350 pixels from the top)
  - page has header with:
    - even numbered pages: date page_number year
    - odd numbered pages: year page_number date
  - columns have half of page header, e.g.:
    - even numbered pages: 
      - first column: date left aligned and part of page_number right aligned
      - second column: part of page_number left aligned and year right aligned
    - odd numbered pages: 
      - first column: year left aligned and part of page_number right aligned
      - second column: part of page_number left aligned and date right aligned
      
### Viewer

- page viewer: https://images.huygens.knaw.nl/assets/argos/index.html
- list of page URLs: https://images.huygens.knaw.nl/api/argos


### National Archive site

- search in the archive: https://www.nationaalarchief.nl/onderzoeken/index/nt00444?searchTerm=
- search the index: https://www.nationaalarchief.nl/onderzoeken/zoekhulpen/voc-opvarenden
- example page: https://www.nationaalarchief.nl/onderzoeken/index/nt00444/d110980c-c864-11e6-9d8b-00505693001d


In [1]:
# This reload library is just used for developing the REPUBLIC hOCR parser 
# and can be removed once this module is stable.
%reload_ext autoreload
%autoreload 2


# This is needed to add the repo dir to the path so jupyter
# can load the republic modules directly from the notebooks
import os
import sys
repo_dir = os.path.split(os.getcwd())[0]
if repo_dir not in sys.path:
    sys.path.append(repo_dir)

In [2]:
import json
import os
import re
from collections import defaultdict
#from republic.parser.generic_hocr_parser import make_hocr_doc
import republic.parser.republic_page_parser as page_parser
import republic.parser.republic_paragraph_parser as paragraph_parser
import republic.parser.republic_file_parser as file_parser

from elasticsearch import Elasticsearch
import republic.elastic.republic_elasticsearch as rep_es

es = Elasticsearch()


# The hOCR file name contains relevant information for parsing. Here's an example:
# NL-HaNA_1.01.02_3780_0016.jpg-0-251-98--0.40.hocr

# NL-HaNA_1.01.02 is the name of the archive
# 3780_0016 identifies the specific page with a specific contract
# 0-251-98--0.40 identifies four aspects:
#   1. the number of the column (0)
#   2. the offset from the left (251)
#   3. the offset from the top (98)
#   4. and the slant (-0.40)



### Reading column scans for a single volume

1. get scan file info
    - scan number, page number, page side, column number, slant, page
2. iterate over pages
    - create hocr_page
    - determine page type: index, resolution, other
    

In [3]:
#from republic.parser.generic_hocr_parser import make_hocr_doc
import republic.parser.republic_page_parser as page_parser
import republic.parser.republic_file_parser as file_parser
from republic.config.republic_config import base_config, set_config_year

import copy

year = 1725
data_dir = "/Users/marijnkoolen/Data/Projects/REPUBLIC/hocr"


def get_pages_info(config):
    scan_files = file_parser.get_files(config["data_dir"])
    print("Number of scan files:", len(scan_files))
    return file_parser.gather_page_columns(scan_files)

year_config = set_config_year(base_config, year, data_dir)
pages_info = get_pages_info(year_config)



Number of scan files: 2161


## Indexing Page Data in Elasticsearch

Index the resolution volumes at the page level.

Every scan contains two pages. Since index terms reference page numbers, we want to be able to access individual pages for later matching.

### Determining Page Type

We want to parse index pages differently from resolution pages and filter out non-text pages and pages where the columns are not properly identified.

So a first step is to use the page layout and content to distinguish pages containing indices from pages containing resolution summaries. There are also title pages, that indicate where a new part starts (e.g. indices, resolutions of the first half of the year, resolutions of the second half of the year).

For examples of title pages, see: https://www.nationaalarchief.nl/onderzoeken/archief/1.01.02/inventaris?inventarisnr=3780&scans-inventarispagina=43&activeTab=gahetnascans#tab-heading

In [4]:
special_pages = {
    154: {
        "scan_num": 77,
        "page_num": 154,
        "type_page_num": 64,
        "special_type": "table",
    },
    155: {
        "scan_num": 77,
        "page_num": 155,
        "type_page_num": 65,
        "special_type": "table",
    },
    156: {
        "scan_num": 78,
        "page_num": 156,
        "type_page_num": 66,
        "special_type": "table",
    },
}

In [5]:
# What page info do we get
import json

for page_id in pages_info:
    if pages_info[page_id]["scan_num"] > 6 or pages_info[page_id]["scan_num"] < 6:
        continue
    print(page_id)
    #print(json.dumps(pages_info[page_id], indent=2))
    for column_info in pages_info[page_id]["columns"]:
        #print(json.dumps(column_info, indent=2))
        column_hocr = page_parser.get_column_hocr(column_info, year_config)
        #print(json.dumps(column_hocr, indent=2))
    break



year-1725-scan-6-even


In [14]:
from elasticsearch import Elasticsearch
import republic.elastic.republic_elasticsearch as rep_es

es = Elasticsearch()

rep_es.parse_pre_split_column_inventory(es, pages_info, year_config, delete_index=False)


### Adjusting Incorrect Page Type Assignments and Numbered Page Numbers

**Problem 1**: For some pages the page type may be incorrectly identified (e.g. an index page identified as a resolution page or vice versa). This mainly happens on pages with little text content or pages where the columns are misidentified. 

**Solution**: Using the title pages as part separators, and knowing that the indices precede the resolution pages, we can identify misclassified page and correct their labels.

**Problem 2**: Some pages are duplicates of the preceding scan. When the page turning mechanism fails, subsequent scans are images of the same two pages. Duplicate page shoulds therefore come in pairs, that is, even and odd side of scan $n$ are duplicates of even and odd side of scan $n-1$. Shingling or straightforward text tiling won't work because of OCR variation. Many words may be recognized slightly different and lines and words may not align.

**Solution**: Compare each pair of even+odd pages against preceding pair of even+odd pages, using Levenshtein distance. This deals with slight character-level variations due to OCR. Most pairs will be very dissimilar. Use a heuristic threshold to determine whether pages are duplicates.

**Problem 3**: A second problem is that page numbers of numbered pages are reset per part, starting from page 1, but the title page separating the first and second halves of the year should not reset the page numbering. 

**Solution**: Iterate over the pages, using a flag to keep track of whether we're in the indices part or a resolution part. If the title page is within the resolution part, update the page numbers by incrementing from the previous page.


In [None]:
import republic.elastic.republic_page_checks as page_checks

page_checks.correct_page_types(es, year_config)




In [None]:

page_checks.detect_duplicate_scans(es, year_config)


In [None]:

page_checks.correct_page_numbers(es, year_config)



### Extracting Resolutions From Pages

Identify:

- resolution dates
- resolution participant lists
- resolution text blocks

In [None]:
from republic.fuzzy.fuzzy_context_searcher import FuzzyContextSearcher
from republic.fuzzy.fuzzy_person_name_searcher import FuzzyPersonNameSearcher
from republic.model.republic_phrase_model import resolution_phrases, participant_list_phrases, spelling_variants

fuzzysearch_config = {
    "char_match_threshold": 0.8,
    "ngram_threshold": 0.6,
    "levenshtein_threshold": 0.8,
    "ignorecase": False,
    "ngram_size": 2,
    "skip_size": 2,
}


fuzzy_searcher = FuzzyContextSearcher(fuzzysearch_config)
fuzzy_person_searcher = FuzzyPersonNameSearcher(fuzzysearch_config)

fuzzy_searcher.index_keywords(resolution_phrases)
fuzzy_searcher.index_spelling_variants(spelling_variants)
#fuzzy_searcher.index_distractor_terms(distractor_terms)



In [None]:
from republic.fuzzy.fuzzy_context_searcher import FuzzyContextSearcher
import republic.elastic.republic_elasticsearch as rep_es

es = Elasticsearch()

fuzzysearch_config = {
    "char_match_threshold": 0.8,
    "ngram_threshold": 0.6,
    "levenshtein_threshold": 0.8,
    "ignorecase": False,
    "ngram_size": 2,
    "skip_size": 2,
    "paragraph_index": "republic_paragraphs",
    "paragraph_doc_type": "paragraph"
}

missing_dates = [
    {"date_string": "Veneris den 5. Januarii 1725.", "page_start": 11, "page_end": 14},
    {"date_string": "Mercuri den 10. Januarii 1725.", "page_start": 21, "page_end": 28},
]

fuzzy_date_searcher = FuzzyContextSearcher(fuzzysearch_config)

for missing_date in missing_dates:
    fuzzy_date_searcher.index_keywords([missing_date["date_string"]])
    for page_num in range(missing_date["page_start"], missing_date["page_end"] + 1):
        paragraphs = rep_es.retrieve_paragraph_by_type_page_number(es, page_num, year_config)
        for paragraph in paragraphs:
            if page_num == 25:
                print(paragraph["text"])
            matches = fuzzy_date_searcher.find_candidates(paragraph["text"])
            for match in matches:
                print(match)
                print("page: {}\tDate: {}\tText string: {}\n".format(page_num, match["match_keyword"], match["match_string"]))
                print(paragraph["text"])


### Indexing Paragraphs with Metadata

In [None]:
import datetime
from elasticsearch import Elasticsearch

import republic.parser.republic_paragraph_parser as para_parser
import republic.elastic.republic_elasticsearch as rep_es
from republic.model.republic_phrase_model import category_index
from republic.config.republic_config import base_config, set_config_year


page_index = "republic_hocr_pages"
page_doc_type = "page"

es = Elasticsearch()


year = 1725
data_dir = "../../../Data/Projects/REPUBLIC/hocr/"

year_config = set_config_year(base_config, year, data_dir)
pages_info = get_pages_info(year_config)

#rep_es.delete_es_index(year_config["paragraph_index"])

# start on January first

current_date = {
    "month_day": 1,
    "month_name": "Januarii",
    "month": 1,
    "week_day_name": None,
    "year": year
}
    
for page_id in pages_info:
    start_scan = 1
    end_scan = 600
    if pages_info[page_id]["scan_num"] < start_scan or  pages_info[page_id]["scan_num"] > end_scan:
        continue
    page_doc = rep_es.retrieve_page_doc(es, page_id, year_config)
    #if pages_info[page_id]["page_type"] != "resolution_page":
    if page_doc["page_type"] != "resolution_page":
        continue
    paragraphs, header = para_parser.get_resolution_page_paragraphs(page_doc)
    print("page_id:", page_id, "\ttype:", page_doc["page_type"], "\tnum columns:", len(page_doc["columns"]))
    #print("num columns:", len(page_doc["columns"]), "\theader lines:", [line["line_text"] for line in header])
    for paragraph_order, paragraph in enumerate(paragraphs):
        paragraph_text = para_parser.merge_paragraph_lines(paragraph)
        paragraph["metadata"]["categories"] = set()
        paragraph["text"] = paragraph_text
        paragraph["metadata"]["paragraph_num_on_page"] = paragraph_order
        paragraph["metadata"]["paragraph_id"] = "{}-para-{}".format(page_id, paragraph_order)
        #print(paragraph_text, "\n\n")
        matches = fuzzy_searcher.find_candidates(paragraph_text, include_variants=True)
        if len(matches) == 0 and para_parser.paragraph_starts_with_centered_date(paragraph):
            print("DATE LINE:", paragraph_text)
            current_date = para_parser.extract_meeting_date(paragraph, year, current_date)
            #print("paragraph_text:", paragraph_text)
        if para_parser.matches_participant_list(matches):
            print("DAY START:", paragraph_text)
            #context_match = fuzzy_searcher.get_term_context(paragraph_text, match, context_size=200)
            #print(context_match)
            #person_matches = fuzzy_person_searcher.find_person_names_in_text(context_match["match_term_in_context"])
            if para_parser.paragraph_starts_with_centered_date(paragraph):
                current_date = para_parser.extract_meeting_date(paragraph, year, current_date)
                paragraph["metadata"]["categories"].add("meeting_date")
            paragraph["metadata"]["type"] = "participant_list"
            #print("\n\tCurrent date: {}\n".format(current_date))
            #person_matches = fuzzy_person_searcher.find_person_names_in_context(context_match)
            #for person_match in person_matches:
            #    print("\t", person_match)
        if para_parser.matches_resolution_phrase(matches):
            paragraph["metadata"]["type"] = "resolution"
        paragraph["metadata"]["meeting_date_info"] = current_date
        if current_date:
            paragraph["metadata"]["meeting_date"] = datetime.date(current_date["year"], current_date["month"], current_date["month_day"])
        paragraph["metadata"]["keyword_matches"] = matches
        for match in matches:
            #print("\t{}\t{}".format(match["match_keyword"], match["match_string"]))
            if match["match_keyword"] in category_index:
                category = category_index[match["match_keyword"]]
                match["match_category"] = category
                paragraph["metadata"]["categories"].add(category)
                
        #print(paragraph["metadata"]["categories"])
        #print("\n\n\n")
        paragraph["metadata"]["categories"] = list(paragraph["metadata"]["categories"])
        del paragraph["lines"]
        es.index(index=year_config["paragraph_index"], doc_type=year_config["paragraph_doc_type"], 
                 id=paragraph["metadata"]["paragraph_id"], body=paragraph)

