## Parsing REPBULIC hOCR Files

The scans of the printed RSG volumes have the following characteristics

- all scans:
  - have two pages per scan
  - have up to 4 columns per scan, 2 per page 
  - full scan is around 4800 pixels wide, left page is up to pixel 2400, right page is from pixel 2400 (roughly)
- scans of index pages
  - have no page numbers
- scans of resolution pages
  - have page numbers (left-side page is even, right-side page is odd)
  
### Columns

The scans are normalized such that the columns are straight. The text width should be around 1000 pixels. Some columns are not cut out properly, resulting in columns that are either to small (some of the column text is missing), or too wide (the hOCR output contains partial texts from two columns)

### Index pages

- start of entry: 
  - start left alignment
- end of entry:
  - end of line possibly before end of text column. 
  - One or more page numbers


### Resolution pages

- header:
  - next top of page (less than 350 pixels from the top)
  - page has header with:
    - even numbered pages: date page_number year
    - odd numbered pages: year page_number date
  - columns have half of page header, e.g.:
    - even numbered pages: 
      - first column: date left aligned and part of page_number right aligned
      - second column: part of page_number left aligned and year right aligned
    - odd numbered pages: 
      - first column: year left aligned and part of page_number right aligned
      - second column: part of page_number left aligned and date right aligned
      
### Viewer

- page viewer: https://images.huygens.knaw.nl/assets/argos/index.html
- list of page URLs: https://images.huygens.knaw.nl/api/argos


### National Archive site

- search in the archive: https://www.nationaalarchief.nl/onderzoeken/index/nt00444?searchTerm=
- search the index: https://www.nationaalarchief.nl/onderzoeken/zoekhulpen/voc-opvarenden
- example page: https://www.nationaalarchief.nl/onderzoeken/index/nt00444/d110980c-c864-11e6-9d8b-00505693001d


In [1]:
# This reload library is just used for developing the REPUBLIC hOCR parser 
# and can be removed once this module is stable.
%reload_ext autoreload
%autoreload 2

In [2]:
import json
import os
import re
from collections import defaultdict
from model.republic_hocr_model import make_hocr_page
import parser.republic_page_parser as page_parser
import parser.republic_paragraph_parser as paragraph_parser
import parser.republic_file_parser as file_parser

from elasticsearch import Elasticsearch
import elastic.republic_elasticsearch as rep_es

es = Elasticsearch()


# The hOCR file name contains relevant information for parsing. Here's an example:
# NL-HaNA_1.01.02_3780_0016.jpg-0-251-98--0.40.hocr

# NL-HaNA_1.01.02 is the name of the archive
# 3780_0016 identifies the specific page with a specific contract
# 0-251-98--0.40 identifies four aspects:
#   1. the number of the column (0)
#   2. the offset from the left (251)
#   3. the offset from the top (98)
#   4. and the slant (-0.40)



### Reading column scans for a single volume

1. get scan file info
    - scan number, page number, page side, column number, slant, page
2. iterate over pages
    - create hocr_page
    - determine page type: index, resolution, other
    

In [3]:
from model.republic_hocr_model import make_hocr_page
import parser.republic_page_parser as page_parser
import parser.republic_paragraph_parser as paragraph_parser
import parser.republic_file_parser as file_parser
from config.republic_config import base_config, set_config_year

import copy

year = 1725
data_dir = "/Users/marijnkoolen/Data/Projects/REPUBLIC/hocr"


def get_pages_info(config):
    scan_files = file_parser.get_files(config["data_dir"])
    print("Number of scan files:", len(scan_files))
    return file_parser.gather_page_columns(scan_files)

year_config = set_config_year(base_config, year, data_dir)
pages_info = get_pages_info(year_config)



Number of scan files: 2161


## Indexing Page Data in Elasticsearch

Index the resolution volumes at the page level.

Every scan contains two pages. Since index terms reference page numbers, we want to be able to access individual pages for later matching.

### Determining Page Type

We want to parse index pages differently from resolution pages and filter out non-text pages and pages where the columns are not properly identified.

So a first step is to use the page layout and content to distinguish pages containing indices from pages containing resolution summaries. There are also title pages, that indicate where a new part starts (e.g. indices, resolutions of the first half of the year, resolutions of the second half of the year).

For examples of title pages, see: https://www.nationaalarchief.nl/onderzoeken/archief/1.01.02/inventaris?inventarisnr=3780&scans-inventarispagina=43&activeTab=gahetnascans#tab-heading

In [2]:
special_pages = {
    154: {
        "scan_num": 77,
        "page_num": 154,
        "type_page_num": 64,
        "special_type": "table",
    },
    155: {
        "scan_num": 77,
        "page_num": 155,
        "type_page_num": 65,
        "special_type": "table",
    },
    156: {
        "scan_num": 78,
        "page_num": 156,
        "type_page_num": 66,
        "special_type": "table",
    },
}

In [3]:
# What page info do we get
import json

for page_id in pages_info:
    if pages_info[page_id]["scan_num"] > 6 or pages_info[page_id]["scan_num"] < 6:
        continue
    print(page_id)
    print(json.dumps(pages_info[page_id], indent=2))
    for column_info in pages_info[page_id]["columns"]:
        print(json.dumps(column_info, indent=2))
        column_hocr = page_parser.get_column_hocr(column_info, year_config)
        print(json.dumps(column_hocr, indent=2))



year-1725-scan-6-even
{
  "scan_num": 6,
  "inventory_num": 3780,
  "inventory_year": 1725,
  "inventory_period": [
    "1725-01-01",
    "1725-12-31"
  ],
  "page_id": "year-1725-scan-6-even",
  "page_num": 12,
  "page_side": "even",
  "columns": [
    {
      "scan_num": 6,
      "scan_column": 0,
      "scan_num_column_num": 6.0,
      "inventory_num": 3780,
      "inventory_year": 1725,
      "inventory_period": [
        "1725-01-01",
        "1725-12-31"
      ],
      "page_id": "year-1725-scan-6-even",
      "page_num": 12,
      "page_side": "even",
      "slant": -1.0,
      "column_id": "scan-6-even-0",
      "filepath": "/Users/marijnkoolen/Data/Projects/REPUBLIC/hocr/1725/NL-HaNA_1.01.02_3780_0006.jpg-0-657-131--1.10.hocr"
    },
    {
      "scan_num": 6,
      "scan_column": 1,
      "scan_num_column_num": 6.1,
      "inventory_num": 3780,
      "inventory_year": 1725,
      "inventory_period": [
        "1725-01-01",
        "1725-12-31"
      ],
      "page_id": "year-

In [3]:
from elasticsearch import Elasticsearch
import elastic.republic_elasticsearch as rep_es

es = Elasticsearch()

rep_es.do_page_indexing(es, pages_info, year_config, delete_index=False)


year-1725-scan-1-odd resolution_page 1
year-1725-scan-4-odd resolution_page 1
year-1725-scan-5-odd index_page 1
year-1725-scan-6-even index_page 2
year-1725-scan-6-odd index_page 3
year-1725-scan-7-even index_page 4
year-1725-scan-7-odd index_page 5
year-1725-scan-8-even index_page 6
year-1725-scan-8-odd index_page 7
year-1725-scan-9-even index_page 8
year-1725-scan-9-odd index_page 9
year-1725-scan-10-even index_page 10
year-1725-scan-10-odd index_page 11
year-1725-scan-11-even index_page 12
year-1725-scan-11-odd index_page 13
year-1725-scan-12-even index_page 14
year-1725-scan-12-odd index_page 15
year-1725-scan-13-even index_page 16
year-1725-scan-13-odd index_page 17
year-1725-scan-14-even index_page 18
year-1725-scan-14-odd index_page 19
year-1725-scan-15-even index_page 20
year-1725-scan-15-odd index_page 21
year-1725-scan-16-even index_page 22
year-1725-scan-16-odd index_page 23
year-1725-scan-17-even index_page 24
year-1725-scan-17-odd index_page 25
year-1725-scan-18-even index

year-1725-scan-108-even resolution_page 126
year-1725-scan-108-odd resolution_page 127
year-1725-scan-109-even resolution_page 128
year-1725-scan-109-odd resolution_page 129
year-1725-scan-110-even resolution_page 130
year-1725-scan-110-odd resolution_page 131
year-1725-scan-111-even resolution_page 132
year-1725-scan-111-odd resolution_page 133
year-1725-scan-112-even resolution_page 134
year-1725-scan-112-odd resolution_page 135
year-1725-scan-113-even resolution_page 136
year-1725-scan-113-odd resolution_page 137
year-1725-scan-114-even resolution_page 138
year-1725-scan-114-odd resolution_page 139
year-1725-scan-115-even resolution_page 140
year-1725-scan-115-odd resolution_page 141
year-1725-scan-116-even resolution_page 142
year-1725-scan-116-odd resolution_page 143
year-1725-scan-117-even resolution_page 144
year-1725-scan-117-odd resolution_page 145
year-1725-scan-118-even resolution_page 146
year-1725-scan-118-odd resolution_page 147
year-1725-scan-119-even resolution_page 148

year-1725-scan-203-even resolution_page 316
year-1725-scan-203-odd resolution_page 317
year-1725-scan-204-even resolution_page 318
year-1725-scan-204-odd resolution_page 319
year-1725-scan-205-even resolution_page 320
year-1725-scan-205-odd resolution_page 321
year-1725-scan-206-even resolution_page 322
year-1725-scan-206-odd resolution_page 323
year-1725-scan-207-even resolution_page 324
year-1725-scan-207-odd resolution_page 325
year-1725-scan-208-even resolution_page 326
year-1725-scan-208-odd resolution_page 327
year-1725-scan-209-even resolution_page 328
year-1725-scan-209-odd resolution_page 329
year-1725-scan-210-even resolution_page 330
year-1725-scan-210-odd resolution_page 331
year-1725-scan-211-even resolution_page 332
year-1725-scan-211-odd resolution_page 333
year-1725-scan-212-even resolution_page 334
year-1725-scan-212-odd resolution_page 335
year-1725-scan-213-even resolution_page 336
year-1725-scan-213-odd resolution_page 337
year-1725-scan-214-even resolution_page 338

year-1725-scan-299-even resolution_page 32
year-1725-scan-299-odd resolution_page 33
year-1725-scan-300-even resolution_page 34
year-1725-scan-300-odd resolution_page 35
year-1725-scan-301-even resolution_page 36
year-1725-scan-301-odd resolution_page 37
year-1725-scan-302-even resolution_page 38
year-1725-scan-302-odd resolution_page 39
year-1725-scan-303-even resolution_page 40
year-1725-scan-303-odd resolution_page 41
year-1725-scan-304-even resolution_page 42
year-1725-scan-304-odd resolution_page 43
year-1725-scan-305-even resolution_page 44
year-1725-scan-305-odd resolution_page 45
year-1725-scan-306-even resolution_page 46
year-1725-scan-306-odd resolution_page 47
year-1725-scan-307-even resolution_page 48
year-1725-scan-307-odd resolution_page 49
year-1725-scan-308-even resolution_page 50
year-1725-scan-308-odd resolution_page 51
year-1725-scan-309-even resolution_page 52
year-1725-scan-309-odd resolution_page 53
year-1725-scan-310-even resolution_page 54
year-1725-scan-310-odd

year-1725-scan-393-odd resolution_page 221
year-1725-scan-394-even resolution_page 222
year-1725-scan-394-odd resolution_page 223
year-1725-scan-395-even resolution_page 224
year-1725-scan-395-odd resolution_page 225
year-1725-scan-396-even resolution_page 226
year-1725-scan-396-odd resolution_page 227
year-1725-scan-397-even resolution_page 228
year-1725-scan-397-odd resolution_page 229
year-1725-scan-398-even resolution_page 230
year-1725-scan-398-odd resolution_page 231
year-1725-scan-399-even resolution_page 232
year-1725-scan-399-odd resolution_page 233
year-1725-scan-400-even resolution_page 234
year-1725-scan-400-odd resolution_page 235
year-1725-scan-401-even resolution_page 236
year-1725-scan-401-odd resolution_page 237
year-1725-scan-402-even resolution_page 238
year-1725-scan-402-odd resolution_page 239
year-1725-scan-403-even resolution_page 240
year-1725-scan-403-odd resolution_page 241
year-1725-scan-404-even resolution_page 242
year-1725-scan-404-odd resolution_page 243


year-1725-scan-488-odd resolution_page 411
year-1725-scan-489-even resolution_page 412
year-1725-scan-489-odd resolution_page 413
year-1725-scan-490-even resolution_page 414
year-1725-scan-490-odd resolution_page 415
year-1725-scan-491-even resolution_page 416
year-1725-scan-491-odd resolution_page 417
year-1725-scan-492-even resolution_page 418
year-1725-scan-492-odd index_page 419
year-1725-scan-493-even resolution_page 420
year-1725-scan-493-odd resolution_page 421
year-1725-scan-494-even resolution_page 422
year-1725-scan-494-odd resolution_page 423
year-1725-scan-495-even resolution_page 424
year-1725-scan-495-odd resolution_page 425
year-1725-scan-496-even resolution_page 426
year-1725-scan-496-odd resolution_page 427
year-1725-scan-497-even resolution_page 428
year-1725-scan-497-odd resolution_page 429
year-1725-scan-498-even resolution_page 430
year-1725-scan-498-odd resolution_page 431
year-1725-scan-499-even resolution_page 432
year-1725-scan-499-odd resolution_page 433
year-

### Adjusting Incorrect Page Type Assignments and Numbered Page Numbers

**Problem 1**: For some pages the page type may be incorrectly identified (e.g. an index page identified as a resolution page or vice versa). This mainly happens on pages with little text content or pages where the columns are misidentified. 

**Solution**: Using the title pages as part separators, and knowing that the indices precede the resolution pages, we can identify misclassified page and correct their labels.

**Problem 2**: Some pages are duplicates of the preceding scan. When the page turning mechanism fails, subsequent scans are images of the same two pages. Duplicates page should therefore come in pairs, that is, even and odd side of scan $n$ are duplicates of even and odd side of scan $n-1$. Shingling or straightforward text tiling won't work because of OCR variation. Many words may be recognized slightly different and lines and words may not align.

**Solution**: Compare each pair of even+odd pages against preceding pair of even+odd pages, using Levenshtein distance. This deals with slight character-level variations due to OCR. Most pairs will be very dissimilar. Use a heuristic threshold to determine whether pages are duplicates.

**Problem 3**: A second problem is that page numbers of numbered pages are reset per part, starting from page 1, but the title page separating the first and second halves of the year should not reset the page numbering. 

**Solution**: Iterate over the pages, using a flag to keep track of whether we're in the indices part or a resolution part. If the title page is within the resolution part, update the page numbers by incrementing from the previous page.


In [5]:
import elastic.republic_page_checks as page_checks

page_checks.correct_page_types(es, year_config)




correcting: year-1725-scan-1-odd from type resolution_page to type None
Switching to part index_page
correcting: year-1725-scan-4-odd from type resolution_page to type index_page
Switching to part resolution_page
Switching to part resolution_page
correcting: year-1725-scan-332-odd from type index_page to type resolution_page
correcting: year-1725-scan-333-even from type unknown_page_type to type resolution_page
correcting: year-1725-scan-333-odd from type unknown_page_type to type resolution_page
correcting: year-1725-scan-334-even from type index_page to type resolution_page
correcting: year-1725-scan-334-odd from type index_page to type resolution_page
correcting: year-1725-scan-335-even from type index_page to type resolution_page
correcting: year-1725-scan-335-odd from type unknown_page_type to type resolution_page
correcting: year-1725-scan-336-even from type unknown_page_type to type resolution_page
correcting: year-1725-scan-336-odd from type unknown_page_type to type resolution

In [8]:

page_checks.detect_duplicate_scans(es, year_config)


Page year-1725-scan-139-even is duplicate of page year-1725-scan-138-even
Page year-1725-scan-139-odd is duplicate of page year-1725-scan-138-odd
Page year-1725-scan-279-even is duplicate of page year-1725-scan-278-even
Page year-1725-scan-279-odd is duplicate of page year-1725-scan-278-odd
Page year-1725-scan-286-even is duplicate of page year-1725-scan-285-even
Page year-1725-scan-286-odd is duplicate of page year-1725-scan-285-odd
Page year-1725-scan-492-odd is duplicate of page year-1725-scan-491-odd

Done!


In [10]:

page_checks.correct_page_numbers(es, year_config)



CORRECTING FOR DUPLICATE SCAN: year-1725-scan-139-even 188 186
CORRECTING FOR DUPLICATE SCAN: year-1725-scan-139-odd 189 187
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-140-even FROM 190 TO 188:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-140-odd FROM 191 TO 189:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-141-even FROM 192 TO 190:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-141-odd FROM 193 TO 191:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-142-even FROM 194 TO 192:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-142-odd FROM 195 TO 193:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-143-even FROM 196 TO 194:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-143-odd FROM 197 TO 195:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-144-even FROM 198 TO 196:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-144-odd FROM 199 TO 197:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-145-even FROM 200 TO 198:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-145-odd FROM 201 TO 199:
CORRECTING PAGE N

CORRECTING PAGE NUMBER OF PAGE year-1725-scan-199-even FROM 308 TO 306:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-199-odd FROM 309 TO 307:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-200-even FROM 310 TO 308:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-200-odd FROM 311 TO 309:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-201-even FROM 312 TO 310:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-201-odd FROM 313 TO 311:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-202-even FROM 314 TO 312:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-202-odd FROM 315 TO 313:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-203-even FROM 316 TO 314:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-203-odd FROM 317 TO 315:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-204-even FROM 318 TO 316:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-204-odd FROM 319 TO 317:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-205-even FROM 320 TO 318:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-205-odd FROM 321 TO 319:

CORRECTING PAGE NUMBER OF PAGE year-1725-scan-257-odd FROM 425 TO 423:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-258-even FROM 426 TO 424:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-258-odd FROM 427 TO 425:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-259-even FROM 428 TO 426:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-259-odd FROM 429 TO 427:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-260-even FROM 430 TO 428:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-260-odd FROM 431 TO 429:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-261-even FROM 432 TO 430:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-261-odd FROM 433 TO 431:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-262-even FROM 434 TO 432:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-262-odd FROM 435 TO 433:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-263-even FROM 436 TO 434:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-263-odd FROM 437 TO 435:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-264-even FROM 438 TO 436:

CORRECTING PAGE NUMBER OF PAGE year-1725-scan-317-even FROM 68 TO 536:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-317-odd FROM 69 TO 537:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-318-even FROM 70 TO 538:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-318-odd FROM 71 TO 539:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-319-even FROM 72 TO 540:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-319-odd FROM 73 TO 541:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-320-even FROM 74 TO 542:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-320-odd FROM 75 TO 543:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-321-even FROM 76 TO 544:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-321-odd FROM 77 TO 545:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-322-even FROM 78 TO 546:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-322-odd FROM 79 TO 547:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-323-even FROM 80 TO 548:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-323-odd FROM 81 TO 549:
CORRECTING PA

CORRECTING PAGE NUMBER OF PAGE year-1725-scan-375-odd FROM 185 TO 653:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-376-even FROM 186 TO 654:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-376-odd FROM 187 TO 655:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-377-even FROM 188 TO 656:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-377-odd FROM 189 TO 657:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-378-even FROM 190 TO 658:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-378-odd FROM 191 TO 659:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-379-even FROM 192 TO 660:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-379-odd FROM 193 TO 661:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-380-even FROM 194 TO 662:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-380-odd FROM 195 TO 663:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-381-even FROM 196 TO 664:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-381-odd FROM 197 TO 665:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-382-even FROM 198 TO 666:

CORRECTING PAGE NUMBER OF PAGE year-1725-scan-433-odd FROM 301 TO 769:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-434-even FROM 302 TO 770:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-434-odd FROM 303 TO 771:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-435-even FROM 304 TO 772:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-435-odd FROM 305 TO 773:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-436-even FROM 306 TO 774:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-436-odd FROM 307 TO 775:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-437-even FROM 308 TO 776:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-437-odd FROM 309 TO 777:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-438-even FROM 310 TO 778:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-438-odd FROM 311 TO 779:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-439-even FROM 312 TO 780:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-439-odd FROM 313 TO 781:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-440-even FROM 314 TO 782:

CORRECTING FOR DUPLICATE SCAN: year-1725-scan-492-odd 419 885
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-493-even FROM 420 TO 886:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-493-odd FROM 421 TO 887:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-494-even FROM 422 TO 888:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-494-odd FROM 423 TO 889:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-495-even FROM 424 TO 890:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-495-odd FROM 425 TO 891:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-496-even FROM 426 TO 892:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-496-odd FROM 427 TO 893:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-497-even FROM 428 TO 894:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-497-odd FROM 429 TO 895:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-498-even FROM 430 TO 896:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-498-odd FROM 431 TO 897:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-499-even FROM 432 TO 898:
CORRECTI

### Extracting Resolutions From Pages

Identify:

- resolution dates
- resolution participant lists
- resolution text blocks

In [5]:
from fuzzy.fuzzy_context_searcher import FuzzyContextSearcher
from fuzzy.fuzzy_person_name_searcher import FuzzyPersonNameSearcher
from model.republic_phrase_model import resolution_phrases, participant_list_phrases, spelling_variants

fuzzysearch_config = {
    "char_match_threshold": 0.8,
    "ngram_threshold": 0.6,
    "levenshtein_threshold": 0.8,
    "ignorecase": False,
    "ngram_size": 2,
    "skip_size": 2,
}


fuzzy_searcher = FuzzyContextSearcher(fuzzysearch_config)
fuzzy_person_searcher = FuzzyPersonNameSearcher(fuzzysearch_config)

fuzzy_searcher.index_keywords(resolution_phrases)
fuzzy_searcher.index_spelling_variants(spelling_variants)
#fuzzy_searcher.index_distractor_terms(distractor_terms)



In [14]:
from fuzzy.fuzzy_context_searcher import FuzzyContextSearcher
import elastic.republic_elasticsearch as rep_es

es = Elasticsearch()

fuzzysearch_config = {
    "char_match_threshold": 0.8,
    "ngram_threshold": 0.6,
    "levenshtein_threshold": 0.8,
    "ignorecase": False,
    "ngram_size": 2,
    "skip_size": 2,
    "paragraph_index": "republic_paragraphs",
    "paragraph_doc_type": "paragraph"
}

missing_dates = [
    {"date_string": "Veneris den 5. Januarii 1725.", "page_start": 11, "page_end": 14},
    {"date_string": "Mercuri den 10. Januarii 1725.", "page_start": 21, "page_end": 28},
]

fuzzy_date_searcher = FuzzyContextSearcher(fuzzysearch_config)

for missing_date in missing_dates:
    fuzzy_date_searcher.index_keywords([missing_date["date_string"]])
    for page_num in range(missing_date["page_start"], missing_date["page_end"] + 1):
        paragraphs = rep_es.retrieve_paragraph_by_type_page_number(es, page_num, year_config)
        for paragraph in paragraphs:
            matches = fuzzy_date_searcher.find_candidates(paragraph["text"])
            for match in matches:
                print(match)
                print("page: {}\tDate: {}\tText string: {}\n".format(page_num, match["match_keyword"], match["match_string"]))
                print(paragraph["text"])


{'match_keyword': 'Veneris den 5. Januarii 1725.', 'match_term': 'Veneris den 5. Januarii 1725.', 'match_string': 'Veucris den 5. Januaris 1725', 'match_offset': 9, 'char_match': 0.8620689655172413, 'ngram_match': 0.7666666666666667, 'levenshtein_distance': 0.8620689655172413}
page: 13	Date: Veneris den 5. Januarii 1725.	Text string: Veucris den 5. Januaris 1725

1735. ie Veucris den 5. Januaris 1725. PR&ASIDE, Den Heere Bentinck. PRASENTIEBUS, De Heeren Jan Welderen , van Dam, Torck , met een extraordinaris Gedeputeerde uyt de Provincie van Gelderlandt. Van Maasdam , vanden Boeizelaar , Raadtpenfionaris van Hoornbeeck , met een extraordinaris Gedeputeerde uyt de Provincie van Hollandt ende Welt-Vrieslandt. Velters, Ockere , Noey; van Hoorn , met een extraordinaris Gedeputeerde uyt de Provincie van Zeelandt. Van Renswoude , van Voor{t. Van Schwartzenbergh, vander Waayen, Vegilin Van I{elmuden. Van Iddekinge ‚ van Tamminga. 


### Indexing Paragraphs with Metadata

In [15]:
import datetime
from elasticsearch import Elasticsearch

import parser.republic_paragraph_parser as para_parser
from model.republic_phrase_model import category_index
import elastic.republic_elasticsearch as rep_es
from config.republic_config import base_config, set_config_year


page_index = "republic_hocr_pages"
page_doc_type = "page"

es = Elasticsearch()


year = 1725
data_dir = "../../../Data/Projects/REPUBLIC/hocr/"

year_config = set_config_year(base_config, year, data_dir)
pages_info = get_pages_info(year_config)

#rep_es.delete_es_index(year_config["paragraph_index"])

# start on January first

current_date = {
    "month_day": 1,
    "month_name": "Januarii",
    "month": 1,
    "week_day_name": None,
    "year": year
}
    
for page_id in pages_info:
    start_scan = 1
    end_scan = 600
    if pages_info[page_id]["scan_num"] < start_scan or  pages_info[page_id]["scan_num"] > end_scan:
        continue
    page_doc = rep_es.retrieve_page_doc(es, page_id, year_config)
    #if pages_info[page_id]["page_type"] != "resolution_page":
    if page_doc["page_type"] != "resolution_page":
        continue
    paragraphs, header = para_parser.get_resolution_page_paragraphs(page_doc)
    print("page_id:", page_id, "\ttype:", page_doc["page_type"], "\tnum columns:", len(page_doc["columns"]))
    #print("num columns:", len(page_doc["columns"]), "\theader lines:", [line["line_text"] for line in header])
    for paragraph_order, paragraph in enumerate(paragraphs):
        paragraph_text = para_parser.merge_paragraph_lines(paragraph)
        paragraph["metadata"]["categories"] = set()
        paragraph["text"] = paragraph_text
        paragraph["metadata"]["paragraph_num_on_page"] = paragraph_order
        paragraph["metadata"]["paragraph_id"] = "{}-para-{}".format(page_id, paragraph_order)
        #print(paragraph_text, "\n\n")
        matches = fuzzy_searcher.find_candidates(paragraph_text, include_variants=True)
        if len(matches) == 0 and para_parser.paragraph_starts_with_centered_date(paragraph):
            print("DATE LINE:", paragraph_text)
            current_date = para_parser.extract_meeting_date(paragraph, year, current_date)
            #print("paragraph_text:", paragraph_text)
        if para_parser.matches_participant_list(matches):
            print("DAY START:", paragraph_text)
            #context_match = fuzzy_searcher.get_term_context(paragraph_text, match, context_size=200)
            #print(context_match)
            #person_matches = fuzzy_person_searcher.find_person_names_in_text(context_match["match_term_in_context"])
            if para_parser.paragraph_starts_with_centered_date(paragraph):
                current_date = para_parser.extract_meeting_date(paragraph, year, current_date)
                paragraph["metadata"]["categories"].add("meeting_date")
            paragraph["metadata"]["type"] = "participant_list"
            #print("\n\tCurrent date: {}\n".format(current_date))
            #person_matches = fuzzy_person_searcher.find_person_names_in_context(context_match)
            #for person_match in person_matches:
            #    print("\t", person_match)
        if para_parser.matches_resolution_phrase(matches):
            paragraph["metadata"]["type"] = "resolution"
        paragraph["metadata"]["meeting_date_info"] = current_date
        if current_date:
            paragraph["metadata"]["meeting_date"] = datetime.date(current_date["year"], current_date["month"], current_date["month_day"])
        paragraph["metadata"]["keyword_matches"] = matches
        for match in matches:
            #print("\t{}\t{}".format(match["match_keyword"], match["match_string"]))
            if match["match_keyword"] in category_index:
                category = category_index[match["match_keyword"]]
                match["match_category"] = category
                paragraph["metadata"]["categories"].add(category)
                
        #print(paragraph["metadata"]["categories"])
        #print("\n\n\n")
        paragraph["metadata"]["categories"] = list(paragraph["metadata"]["categories"])
        del paragraph["lines"]
        es.index(index=year_config["paragraph_index"], doc_type=year_config["paragraph_doc_type"], 
                 id=paragraph["metadata"]["paragraph_id"], body=paragraph)



Number of scan files: 2161
page_id: year-1725-scan-45-odd 	type: resolution_page 	num columns: 2
DAY START: Martis den 2. Jannarii 1725. PRASIDE, Den Heere Bentinck. PRASENTIBUS, De Heeren Van Dam , Torck, Ham, met een extraordinarts Gedeputeerde uyt de Provincie van Gelderlandt. Van Maasdam, vanden Boetzelaar , Boon, Raadtpenfionaris van Hoornbeeck , met een extraordinaris Gedeputeerde uyt de Provincie van Hollandt ende Weû-Fries= landt. Velters, Ockerfe 5 Noey , van Hoorn, met een extraordinaris Gedeputeerde uyt de Provincie van Zeelandt. Van Renswoude , van Voork. Van Schwartzenbergh 5 vander Waayen , Vegilin. Van Ifelmuden, Van Iddekinge, van Tamminga. 
page_id: year-1725-scan-46-even 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-46-odd 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-47-even 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-47-odd 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-48-even 	type: resolution_pa

KeyboardInterrupt: 