## Parsing REPBULIC hOCR Files

The scans of the printed RSG volumes have the following characteristics

- all scans:
  - have two pages per scan
  - have up to 4 columns per scan, 2 per page 
  - full scan is around 4800 pixels wide, left page is up to pixel 2400, right page is from pixel 2400 (roughly)
- scans of index pages
  - have no page numbers
- scans of resolution pages
  - have page numbers (left-side page is even, right-side page is odd)
  
### Columns

The scans are normalized such that the columns are straight. The text width should be around 1000 pixels. Some columns are not cut out properly, resulting in columns that are either to small (some of the column text is missing), or too wide (the hOCR output contains partial texts from two columns)

### Index pages

- start of entry: 
  - start left alignment
- end of entry:
  - end of line possibly before end of text column. 
  - One or more page numbers


### Resolution pages

- header:
  - next top of page (less than 350 pixels from the top)
  - page has header with:
    - even numbered pages: date page_number year
    - odd numbered pages: year page_number date
  - columns have half of page header, e.g.:
    - even numbered pages: 
      - first column: date left aligned and part of page_number right aligned
      - second column: part of page_number left aligned and year right aligned
    - odd numbered pages: 
      - first column: year left aligned and part of page_number right aligned
      - second column: part of page_number left aligned and date right aligned
      
### Viewer

- page viewer: https://images.huygens.knaw.nl/assets/argos/index.html
- list of page URLs: https://images.huygens.knaw.nl/api/argos


### National Archive site

- search in the archive: https://www.nationaalarchief.nl/onderzoeken/index/nt00444?searchTerm=
- search the index: https://www.nationaalarchief.nl/onderzoeken/zoekhulpen/voc-opvarenden
- example page: https://www.nationaalarchief.nl/onderzoeken/index/nt00444/d110980c-c864-11e6-9d8b-00505693001d


In [19]:
# This reload library is just used for developing the REPUBLIC hOCR parser 
# and can be removed once this module is stable.
%reload_ext autoreload
%autoreload 2

In [20]:
import json
import os
import re
from collections import defaultdict
from parse_hocr_files import make_hocr_page
#from parse_republic_hocr_files import get_files, get_page_types, count_page_ref_lines, get_index_entry_lines, gather_page_columns
#from elasticsearch import Elasticsearch
import republic_page_parser as page_parser
import republic_paragraph_parser as paragraph_parser
import republic_file_parser as file_parser

# The hOCR file name contains relevant information for parsing. Here's an example:
# NL-HaNA_1.01.02_3780_0016.jpg-0-251-98--0.40.hocr

# NL-HaNA_1.01.02 is the name of the archive
# 3780_0016 identifies the specific page with a specific contract
# 0-251-98--0.40 identifies four aspects:
#   1. the number of the column (0)
#   2. the offset from the left (251)
#   3. the offset from the top (98)
#   4. and the slant (-0.40)



### Reading column scans for a single volume

1. get scan file info
    - scan number, page number, page side, column number, slant, page
2. iterate over pages
    - create hocr_page
    - determine page type: index, resolution, other
    

In [2]:
from parse_hocr_files import make_hocr_page
#from parse_republic_hocr_files import get_files, get_page_types, count_page_ref_lines, get_index_entry_lines, gather_page_columns
#from elasticsearch import Elasticsearch
import republic_page_parser as page_parser
import republic_paragraph_parser as paragraph_parser
import republic_file_parser as file_parser

import copy

year = 1725
base_config = {
    "year": year,
    "base_dir": "../../../Data/Projects/REPUBLIC/hocr/",
    "page_index": "republic_hocr_pages",
    "page_doc_type": "page",
    "tiny_word_width": 15, # pixel width
    "avg_char_width": 20,
    "remove_tiny_words": True,
    "remove_line_numbers": False,
}


def get_pages_info(config):
    scan_files = file_parser.get_files(config["data_dir"])
    print("Number of scan files:", len(scan_files))
    return file_parser.gather_page_columns(scan_files)

def set_config_year(base_config, year):
    config = copy.deepcopy(base_config)
    config["year"] = year
    config["data_dir"] = config["base_dir"] + "{}/".format(year)
    return config

year_config = set_config_year(base_config, year)
pages_info = get_pages_info(year_config)



Number of scan files: 2161


## Indexing Page Data in Elasticsearch

Index the resolution volumes at the page level.

Every scan contains two pages. Since index terms reference page numbers, we want to be able to access individual pages for later matching.

### Determining Page Type

We want to parse index pages differently from resolution pages and filter out non-text pages and pages where the columns are not properly identified.

So a first step is to use the page layout and content to distinguish pages containing indices from pages containing resolution summaries. There are also title pages, that indicate where a new part starts (e.g. indices, resolutions of the first half of the year, resolutions of the second half of the year).

For examples of title pages, see: https://www.nationaalarchief.nl/onderzoeken/archief/1.01.02/inventaris?inventarisnr=3780&scans-inventarispagina=43&activeTab=gahetnascans#tab-heading

In [4]:
special_pages = {
    154: {
        "scan_num": 77,
        "page_num": 154,
        "type_page_num": 64,
        "special_type": "table",
    },
    155: {
        "scan_num": 77,
        "page_num": 155,
        "type_page_num": 65,
        "special_type": "table",
    },
    156: {
        "scan_num": 78,
        "page_num": 156,
        "type_page_num": 66,
        "special_type": "table",
    },
}

In [19]:
for page_id in pages_info:
    if pages_info[page_id]["scan_num"] > 6 or pages_info[page_id]["scan_num"] < 6:
        continue
    print(page_id)
    for column_info in pages_info[page_id]["columns"]:
        column_hocr = page_parser.get_column_hocr(column_info, year_config)
    print(pages_info[page_id])

year-1725-scan-6-even

tiny word: {'bbox': [187, 503, 189, 505], 'width': 2, 'height': 2, 'left': 187, 'right': 189, 'top': 503, 'bottom': 505, 'word_text': '.', 'word_conf': 30} 


tiny word: {'bbox': [0, 3658, 15, 3659], 'width': 15, 'height': 1, 'left': 0, 'right': 15, 'top': 3658, 'bottom': 3659, 'word_text': '5', 'word_conf': 22} 


tiny word: {'bbox': [619, 617, 622, 621], 'width': 3, 'height': 4, 'left': 619, 'right': 622, 'top': 617, 'bottom': 621, 'word_text': 'ì', 'word_conf': 22} 


tiny word: {'bbox': [901, 619, 904, 622], 'width': 3, 'height': 3, 'left': 901, 'right': 904, 'top': 619, 'bottom': 622, 'word_text': ':', 'word_conf': 0} 


tiny word: {'bbox': [449, 1182, 451, 1185], 'width': 2, 'height': 3, 'left': 449, 'right': 451, 'top': 1182, 'bottom': 1185, 'word_text': '|', 'word_conf': 39} 


tiny word: {'bbox': [859, 1486, 862, 1492], 'width': 3, 'height': 6, 'left': 859, 'right': 862, 'top': 1486, 'bottom': 1492, 'word_text': ':', 'word_conf': 26} 

{'scan_num': 6, 'i

In [14]:
import republic_page_parser as page_parser

page_parser.do_page_indexing(pages_info, year_config, delete_index=True)


exists, deleting
year-1725-scan-1-odd bad_page 1
year-1725-scan-4-odd bad_page 1
year-1725-scan-5-odd index_page 1
year-1725-scan-6-even bad_page 2
year-1725-scan-6-odd index_page 3
year-1725-scan-7-even index_page 4
year-1725-scan-7-odd index_page 5
year-1725-scan-8-even index_page 6
year-1725-scan-8-odd index_page 7
year-1725-scan-9-even bad_page 8
year-1725-scan-9-odd index_page 9
year-1725-scan-10-even index_page 10
year-1725-scan-10-odd index_page 11
year-1725-scan-11-even bad_page 12
year-1725-scan-11-odd bad_page 13
year-1725-scan-12-even index_page 14
year-1725-scan-12-odd index_page 15
year-1725-scan-13-even index_page 16
year-1725-scan-13-odd index_page 17
year-1725-scan-14-even index_page 18
year-1725-scan-14-odd bad_page 19
year-1725-scan-15-even index_page 20
year-1725-scan-15-odd bad_page 21
year-1725-scan-16-even index_page 22
year-1725-scan-16-odd index_page 23
year-1725-scan-17-even index_page 24
year-1725-scan-17-odd index_page 25
year-1725-scan-18-even index_page 26


year-1725-scan-109-even resolution_page 128
year-1725-scan-109-odd resolution_page 129
year-1725-scan-110-even resolution_page 130
year-1725-scan-110-odd resolution_page 131
year-1725-scan-111-even resolution_page 132
year-1725-scan-111-odd resolution_page 133
year-1725-scan-112-even resolution_page 134
year-1725-scan-112-odd resolution_page 135
year-1725-scan-113-even resolution_page 136
year-1725-scan-113-odd resolution_page 137
year-1725-scan-114-even resolution_page 138
year-1725-scan-114-odd resolution_page 139
year-1725-scan-115-even resolution_page 140
year-1725-scan-115-odd resolution_page 141
year-1725-scan-116-even resolution_page 142
year-1725-scan-116-odd resolution_page 143
year-1725-scan-117-even resolution_page 144
year-1725-scan-117-odd resolution_page 145
year-1725-scan-118-even resolution_page 146
year-1725-scan-118-odd resolution_page 147
year-1725-scan-119-even resolution_page 148
year-1725-scan-119-odd resolution_page 149
year-1725-scan-120-even resolution_page 150

year-1725-scan-204-even resolution_page 318
year-1725-scan-204-odd resolution_page 319
year-1725-scan-205-even resolution_page 320
year-1725-scan-205-odd resolution_page 321
year-1725-scan-206-even resolution_page 322
year-1725-scan-206-odd resolution_page 323
year-1725-scan-207-even resolution_page 324
year-1725-scan-207-odd resolution_page 325
year-1725-scan-208-even resolution_page 326
year-1725-scan-208-odd resolution_page 327
year-1725-scan-209-even resolution_page 328
year-1725-scan-209-odd resolution_page 329
year-1725-scan-210-even resolution_page 330
year-1725-scan-210-odd resolution_page 331
year-1725-scan-211-even resolution_page 332
year-1725-scan-211-odd resolution_page 333
year-1725-scan-212-even resolution_page 334
year-1725-scan-212-odd resolution_page 335
year-1725-scan-213-even resolution_page 336
year-1725-scan-213-odd resolution_page 337
year-1725-scan-214-even resolution_page 338
year-1725-scan-214-odd resolution_page 339
year-1725-scan-215-even resolution_page 340

year-1725-scan-300-even resolution_page 34
year-1725-scan-300-odd resolution_page 35
year-1725-scan-301-even resolution_page 36
year-1725-scan-301-odd resolution_page 37
year-1725-scan-302-even resolution_page 38
year-1725-scan-302-odd resolution_page 39
year-1725-scan-303-even resolution_page 40
year-1725-scan-303-odd resolution_page 41
year-1725-scan-304-even resolution_page 42
year-1725-scan-304-odd resolution_page 43
year-1725-scan-305-even resolution_page 44
year-1725-scan-305-odd resolution_page 45
year-1725-scan-306-even resolution_page 46
year-1725-scan-306-odd resolution_page 47
year-1725-scan-307-even resolution_page 48
year-1725-scan-307-odd resolution_page 49
year-1725-scan-308-even resolution_page 50
year-1725-scan-308-odd resolution_page 51
year-1725-scan-309-even resolution_page 52
year-1725-scan-309-odd resolution_page 53
year-1725-scan-310-even resolution_page 54
year-1725-scan-310-odd resolution_page 55
year-1725-scan-311-even resolution_page 56
year-1725-scan-311-odd

year-1725-scan-395-even resolution_page 224
year-1725-scan-395-odd resolution_page 225
year-1725-scan-396-even resolution_page 226
year-1725-scan-396-odd resolution_page 227
year-1725-scan-397-even resolution_page 228
year-1725-scan-397-odd resolution_page 229
year-1725-scan-398-even resolution_page 230
year-1725-scan-398-odd resolution_page 231
year-1725-scan-399-even resolution_page 232
year-1725-scan-399-odd resolution_page 233
year-1725-scan-400-even resolution_page 234
year-1725-scan-400-odd resolution_page 235
year-1725-scan-401-even resolution_page 236
year-1725-scan-401-odd resolution_page 237
year-1725-scan-402-even resolution_page 238
year-1725-scan-402-odd resolution_page 239
year-1725-scan-403-even resolution_page 240
year-1725-scan-403-odd resolution_page 241
year-1725-scan-404-even resolution_page 242
year-1725-scan-404-odd resolution_page 243
year-1725-scan-405-even resolution_page 244
year-1725-scan-405-odd resolution_page 245
year-1725-scan-406-even resolution_page 246

year-1725-scan-490-odd bad_page 415
year-1725-scan-491-even bad_page 416
year-1725-scan-491-odd bad_page 417
year-1725-scan-492-even bad_page 418
year-1725-scan-492-odd bad_page 419
year-1725-scan-493-even bad_page 420
year-1725-scan-493-odd bad_page 421
year-1725-scan-494-even bad_page 422
year-1725-scan-494-odd resolution_page 423
year-1725-scan-495-even resolution_page 424
year-1725-scan-495-odd resolution_page 425
year-1725-scan-496-even resolution_page 426
year-1725-scan-496-odd resolution_page 427
year-1725-scan-497-even resolution_page 428
year-1725-scan-497-odd resolution_page 429
year-1725-scan-498-even resolution_page 430
year-1725-scan-498-odd resolution_page 431
year-1725-scan-499-even resolution_page 432
year-1725-scan-499-odd resolution_page 433
year-1725-scan-500-even resolution_page 434
year-1725-scan-500-odd resolution_page 435
year-1725-scan-501-even resolution_page 436
year-1725-scan-501-odd resolution_page 437
year-1725-scan-502-even resolution_page 438
year-1725-sc

### Adjusting Incorrect Page Type Assignments and Numbered Page Numbers

**Problem 1**: For some pages the page type may be incorrectly identified (e.g. an index page identified as a resolution page or vice versa). This mainly happens on pages with little text content or pages where the columns are misidentified. 

**Solution**: Using the title pages as part separators, and knowing that the indices precede the resolution pages, we can identify misclassified page and correct their labels.

**Problem 2**: Some pages are duplicates of the preceding scan. When the page turning mechanism fails, subsequent scans are images of the same two pages. Duplicates page should therefore come in pairs, that is, even and odd side of scan $n$ are duplicates of even and odd side of scan $n-1$. Shingling or straightforward text tiling won't work because of OCR variation. Many words may be recognized slightly different and lines and words may not align.

**Solution**: Compare each pair of even+odd pages against preceding pair of even+odd pages, using Levenshtein distance. This deals with slight character-level variations due to OCR. Most pairs will be very dissimilar. Use a heuristic threshold to determine whether pages are duplicates.

**Problem 3**: A second problem is that page numbers of numbered pages are reset per part, starting from page 1, but the title page separating the first and second halves of the year should not reset the page numbering. 

**Solution**: Iterate over the pages, using a flag to keep track of whether we're in the indices part or a resolution part. If the title page is within the resolution part, update the page numbers by incrementing from the previous page.


In [15]:
import republic_page_checks

republic_page_checks.correct_page_types(year_config)




correcting: year-1725-scan-1-odd None bad_page False
Switching to part index_page
correcting: year-1725-scan-4-odd index_page bad_page False
correcting: year-1725-scan-6-even index_page bad_page False
correcting: year-1725-scan-9-even index_page bad_page False
correcting: year-1725-scan-11-even index_page bad_page False
correcting: year-1725-scan-11-odd index_page bad_page False
correcting: year-1725-scan-14-odd index_page bad_page False
correcting: year-1725-scan-15-odd index_page bad_page False
correcting: year-1725-scan-18-odd index_page bad_page False
correcting: year-1725-scan-19-odd index_page bad_page False
correcting: year-1725-scan-20-odd index_page bad_page False
correcting: year-1725-scan-21-odd index_page bad_page False
correcting: year-1725-scan-23-odd index_page bad_page False
correcting: year-1725-scan-24-odd index_page bad_page False
correcting: year-1725-scan-29-odd index_page bad_page False
correcting: year-1725-scan-31-odd index_page bad_page False
correcting: year-1

In [16]:

republic_page_checks.detect_duplicate_scans(year_config)


Page year-1725-scan-139-even is duplicate of page year-1725-scan-138-even
Page year-1725-scan-139-odd is duplicate of page year-1725-scan-138-odd
Page year-1725-scan-279-even is duplicate of page year-1725-scan-278-even
Page year-1725-scan-279-odd is duplicate of page year-1725-scan-278-odd
Page year-1725-scan-286-even is duplicate of page year-1725-scan-285-even
Page year-1725-scan-286-odd is duplicate of page year-1725-scan-285-odd
Page year-1725-scan-492-even is duplicate of page year-1725-scan-491-even
Page year-1725-scan-492-odd is duplicate of page year-1725-scan-491-odd

Done!


In [17]:

republic_page_checks.correct_page_numbers(year_config)



CORRECTING FOR DUPLICATE SCAN: year-1725-scan-139-even 188 186
CORRECTING FOR DUPLICATE SCAN: year-1725-scan-139-odd 189 187
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-140-even FROM 190 TO 188:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-140-odd FROM 191 TO 189:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-141-even FROM 192 TO 190:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-141-odd FROM 193 TO 191:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-142-even FROM 194 TO 192:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-142-odd FROM 195 TO 193:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-143-even FROM 196 TO 194:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-143-odd FROM 197 TO 195:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-144-even FROM 198 TO 196:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-144-odd FROM 199 TO 197:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-145-even FROM 200 TO 198:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-145-odd FROM 201 TO 199:
CORRECTING PAGE N

CORRECTING PAGE NUMBER OF PAGE year-1725-scan-199-even FROM 308 TO 306:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-199-odd FROM 309 TO 307:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-200-even FROM 310 TO 308:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-200-odd FROM 311 TO 309:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-201-even FROM 312 TO 310:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-201-odd FROM 313 TO 311:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-202-even FROM 314 TO 312:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-202-odd FROM 315 TO 313:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-203-even FROM 316 TO 314:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-203-odd FROM 317 TO 315:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-204-even FROM 318 TO 316:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-204-odd FROM 319 TO 317:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-205-even FROM 320 TO 318:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-205-odd FROM 321 TO 319:

CORRECTING PAGE NUMBER OF PAGE year-1725-scan-257-odd FROM 425 TO 423:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-258-even FROM 426 TO 424:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-258-odd FROM 427 TO 425:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-259-even FROM 428 TO 426:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-259-odd FROM 429 TO 427:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-260-even FROM 430 TO 428:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-260-odd FROM 431 TO 429:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-261-even FROM 432 TO 430:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-261-odd FROM 433 TO 431:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-262-even FROM 434 TO 432:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-262-odd FROM 435 TO 433:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-263-even FROM 436 TO 434:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-263-odd FROM 437 TO 435:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-264-even FROM 438 TO 436:

CORRECTING PAGE NUMBER OF PAGE year-1725-scan-318-odd FROM 71 TO 539:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-319-even FROM 72 TO 540:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-319-odd FROM 73 TO 541:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-320-even FROM 74 TO 542:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-320-odd FROM 75 TO 543:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-321-even FROM 76 TO 544:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-321-odd FROM 77 TO 545:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-322-even FROM 78 TO 546:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-322-odd FROM 79 TO 547:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-323-even FROM 80 TO 548:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-323-odd FROM 81 TO 549:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-324-even FROM 82 TO 550:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-324-odd FROM 83 TO 551:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-325-even FROM 84 TO 552:
CORRECTING PA

CORRECTING PAGE NUMBER OF PAGE year-1725-scan-379-even FROM 192 TO 660:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-379-odd FROM 193 TO 661:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-380-even FROM 194 TO 662:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-380-odd FROM 195 TO 663:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-381-even FROM 196 TO 664:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-381-odd FROM 197 TO 665:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-382-even FROM 198 TO 666:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-382-odd FROM 199 TO 667:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-383-even FROM 200 TO 668:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-383-odd FROM 201 TO 669:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-384-even FROM 202 TO 670:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-384-odd FROM 203 TO 671:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-385-even FROM 204 TO 672:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-385-odd FROM 205 TO 673:

CORRECTING PAGE NUMBER OF PAGE year-1725-scan-437-even FROM 308 TO 776:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-437-odd FROM 309 TO 777:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-438-even FROM 310 TO 778:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-438-odd FROM 311 TO 779:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-439-even FROM 312 TO 780:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-439-odd FROM 313 TO 781:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-440-even FROM 314 TO 782:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-440-odd FROM 315 TO 783:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-441-even FROM 316 TO 784:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-441-odd FROM 317 TO 785:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-442-even FROM 318 TO 786:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-442-odd FROM 319 TO 787:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-443-even FROM 320 TO 788:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-443-odd FROM 321 TO 789:

CORRECTING PAGE NUMBER OF PAGE year-1725-scan-495-even FROM 424 TO 890:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-495-odd FROM 425 TO 891:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-496-even FROM 426 TO 892:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-496-odd FROM 427 TO 893:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-497-even FROM 428 TO 894:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-497-odd FROM 429 TO 895:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-498-even FROM 430 TO 896:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-498-odd FROM 431 TO 897:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-499-even FROM 432 TO 898:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-499-odd FROM 433 TO 899:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-500-even FROM 434 TO 900:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-500-odd FROM 435 TO 901:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-501-even FROM 436 TO 902:
CORRECTING PAGE NUMBER OF PAGE year-1725-scan-501-odd FROM 437 TO 903:

### Parsing and Preprocessing Index Pages

- filter tiny and huge text elements (i.e. deviating from average character/word width and height
- extract page lines that are part of the main text body containing index entries
- insert and clean up repetition symbols in index entries
    - determine length of repetition symbol
    - identify and replace mis-recognized repetition symbols


In [9]:
#from republic_index_page_parser import index_lemmata
from collections import defaultdict
import republic_index_page_parser as index_parser
import republic_elasticsearch as rep_es

avg_left = 0
lemma_index = defaultdict(list)
curr_lemma = None
    

for page_id in pages_info:
    page_doc = rep_es.retrieve_page_doc(page_id, year_config)
    print("\n\n", page_id)
    if page_doc["page_type"] != "index_page":
        print("skipping non-index page")
        continue
    page_doc["num_page_ref_lines"] = index_parser.count_page_ref_lines(page_doc)
    for column_info in page_doc["columns"]:
        print("\n\n", column_info["column_id"])
        column_hocr = column_info["column_hocr"]
        lines = index_parser.get_index_entry_lines(column_hocr)
        curr_lemma = index_parser.index_lemmata(column_info["column_id"], lines, lemma_index, curr_lemma)
        print("returned lemma:", curr_lemma)






 year-1725-scan-1-odd
skipping non-index page


 year-1725-scan-4-odd


 scan-4-odd-0
returned lemma: None


 scan-4-odd-1
returned lemma: None


 year-1725-scan-5-odd


 scan-5-odd-0
possibly misrecognised repeat symbol: 132 SNE
possibly misrecognised repeat symbol: 356 OE
possibly misrecognised repeat symbol: 137 mm
DEVIATING LINE: 213 [99, 213, 229, 215, 261, 218] AM Uae re aan Conrmijari/-
avg_repeat_symbol_length: 208
DEVIATING LINE: 229 [99, 213, 229, 215, 261, 218, 113] NN A YA fen Inftructeurs áf-
avg_repeat_symbol_length: 208
DEVIATING LINE: 215 [99, 213, 229, 215, 261, 218, 113, 272] va EON  rewecen. 361. 4T9.
avg_repeat_symbol_length: 208
DEVIATING LINE: 261 [99, 213, 229, 215, 261, 218, 113, 272, 146] | VN WW, Aannemers van Fortif-
avg_repeat_symbol_length: 208
DEVIATING LINE: 218 [213, 229, 215, 261, 218, 113, 272, 146, 282] —— Ef catien, Temet om [ub-
avg_repeat_symbol_length: 208
DEVIATING LINE: 272 [215, 261, 218, 113, 272, 146, 282, 169, 85] om betilinge van achterft

19 149 6 continue 	      ven   te  examineeren   en  Naaldens  ende
20 150 5 continue 	      Meen onder gemaakt Yerwerck begreepen.
	PAGE_REFS: [919] 	CURR LEMMA: Lijfte
21 150 11 continue_stop 	      919.
22 105 -34 start 	     —— geauthorifeert  boven [eekere premie,
23 149 17 continue 	      impuniteyt aan de Medeplightige te beloven.
	PAGE_REFS: [936] 	CURR LEMMA: Lijfte
24 152 21 continue_stop 	      936.
HAS LEMMA: Admiraliteyt tot Amfterdam Oatfanger
setting lemma: Admiraliteyt
25 102 -29 start 	    Admiraliteyt  tot  Amfterdam     Oatfanger
26 150 25 continue 	      Groeninx gelaft Jcekere penningen te betalen.
27 81 -48 start 	   —— te berichten op het ver(oek van Schry-
28 147 18 continue 	      ver wegens premie van veroverde Turck[che
29 146 19 continue 	      Rovers.   $9.
30 100 -32 start 	    —— gelaft het grootfte Schip tot de Equi-
31 139 8 continue 	      pagie, te  verftercken  met  vijftigh  koppen.
	PAGE_REFS: [75] 	CURR LEMMA: Admiraliteyt
32 146 8 continue_stop 	

returned lemma: Baftie


 year-1725-scan-8-even


 scan-8-even-0
possibly misrecognised repeat symbol: 149 ne
possibly misrecognised repeat symbol: 198 o_O
possibly misrecognised repeat symbol: 134 in
DEVIATING LINE: 234 [156, 155, 107, 147, 234, 150, 104, 150, 277] Ei en Zoelmont midts vier duy[ent
avg_repeat_symbol_length: 156
DEVIATING LINE: 150 [147, 234, 150, 104, 150, 277, 105, 141, 49] landt. 31.
avg_repeat_symbol_length: 156
DEVIATING LINE: 277 [234, 150, 104, 150, 277, 105, 141, 49, 153] creditif en aangenaam. 96.
avg_repeat_symbol_length: 156
DEVIATING LINE: 153 [277, 105, 141, 49, 153, 98, 98, 99, 98] 137.
avg_repeat_symbol_length: 156
DEVIATING LINE: 266 [147, 97, 144, 141, 266, 143, 95, 141, 139] rapport, de Raadt van Stats te ad-
avg_repeat_symbol_length: 156
DEVIATING LINE: 261 [143, 95, 141, 139, 261, 138, 88, 136, 86] tiem voor het Regiment van d’ Aba-
avg_repeat_symbol_length: 156
DEVIATING LINE: 239 [105, 111, 114, 103, 239, 108, 78, 105, 108] Dro(ará te berichten op

HAS LEMMA: Bilderbeeck , advertentie. 1. 19: 31. 49. 46.
LEMMA: Bilderbeeck 
LEMMA: Bilderbeeck 
setting lemma: Bilderbeeck
14 61 -33 start_stop 	    Bilderbeeck, advertentie.  1. 19:  31. 49. 46.
	PAGE_REFS: [77, 84, 100, 110, 134, 153, 170, 185] 	CURR LEMMA: Bilderbeeck
15 112 16 continue_stop 	       77. 84:  100.  110.  134.  153.  170.  185.
	PAGE_REFS: [203, 214, 232, 235, 318, 339] 	CURR LEMMA: Bilderbeeck
16 116 20 continue_stop 	       203. 214. 232. 235.  261  z09. 318. 339.
	PAGE_REFS: [350, 356, 369, 376, 382, 385, 396, 430] 	CURR LEMMA: Bilderbeeck
17 115 13 continue_stop 	       350,  356. 369. 376. 382. 385. 396. 430.
	PAGE_REFS: [435, 466, 473, 487, 519] 	CURR LEMMA: Bilderbeeck
18 111 3 continue_stop 	       435: 495 466. 473. 487. 499.507. 519:
	PAGE_REFS: [531, 535, 631, 642, 655, 659] 	CURR LEMMA: Bilderbeeck
19 115 6 continue_stop 	       531. 535. Got. GIS. 631. 642. 655. 659:
	PAGE_REFS: [669, 678, 744, 755, 763, 774] 	CURR LEMMA: Bilderbeeck
20 113 8 continue_st

	PAGE_REFS: [295, 764] 	CURR LEMMA: Cromvoirt
60 106 11 continue_stop 	   tugaal.   295. 764.
returned lemma: Cromvoirt


 scan-10-even-1


ZeroDivisionError: division by zero

In [None]:
for lemma in lemma_index:
    print("\nTrefwoord:", lemma)
    #print(lemma_index[lemma])
    for entry in lemma_index[lemma]:
        pages = ", ".join([str(page_ref) for page_ref in entry["page_refs"]])
        description = entry["description"][:70]
        print("\tPagina:", pages, "\tBeschrijving:", description)

In [3]:
# scan 45 uneven is first resolution page
# page num: 91

from fuzzy_context_searcher import FuzzyContextSearcher
import pandas as pd

config = {
    "char_match_threshold": 0.8,
    "ngram_threshold": 0.6,
    "levenshtein_threshold": 0.8,
    "ignorecase": False,
    "ngram_size": 3,
    "skip_size": 0,
}

fuzzy_searcher = FuzzyContextSearcher(config)

keywords = [
    "Admiraliteyt tot Amfterdam", 
    "Admiraliteyt in het Noorder Quartier", 
    "Admiraliteyt in Vrieslandt", 
    "Admiralteyt in Zeelandt",
    "Varckens"
]

distractor_terms = {
    "Admiraliteyt tot Amfterdam": {
        "Admiraliteyt in het Noorder Quartier", "Admiraliteyt in Vrieslandt", "Admiralteyt in Zeelandt"
    },
    "Admiraliteyt in het Noorder Quartier": {
        "Admiraliteyt tot Amfterdam", "Admiraliteyt in Vrieslandt", "Admiralteyt in Zeelandt"
    },
    "Admiraliteyt in Vrieslandt": {
        "Admiraliteyt tot Amfterdam", "Admiraliteyt in het Noorder Quartier", "Admiralteyt in Zeelandt"
    },
    "Admiralteyt in Zeelandt": {
        "Admiraliteyt tot Amfterdam", "Admiraliteyt in het Noorder Quartier", "Admiraliteyt in Vrieslandt"
    },
}
fuzzy_searcher.index_keywords(keywords)
fuzzy_searcher.index_distractor_terms(distractor_terms)

hocr_resolution_pages = []



### Extracting Resolutions From Pages

Identify:

- resolution dates
- resolution participant lists
- resolution text blocks

In [5]:
from fuzzy_context_searcher import FuzzyContextSearcher
from fuzzy_person_name_searcher import FuzzyPersonNameSearcher
from republic_phrase_model import resolution_phrases, participant_list_phrases, spelling_variants

config = {
    "char_match_threshold": 0.8,
    "ngram_threshold": 0.6,
    "levenshtein_threshold": 0.8,
    "ignorecase": False,
    "ngram_size": 2,
    "skip_size": 2,
}


fuzzy_searcher = FuzzyContextSearcher(config)
fuzzy_person_searcher = FuzzyPersonNameSearcher(config)

fuzzy_searcher.index_keywords(resolution_phrases)
fuzzy_searcher.index_spelling_variants(spelling_variants)
#fuzzy_searcher.index_distractor_terms(distractor_terms)



In [1]:
from fuzzy_context_searcher import FuzzyContextSearcher
import republic_elasticsearch as rep_es

config = {
    "char_match_threshold": 0.8,
    "ngram_threshold": 0.6,
    "levenshtein_threshold": 0.8,
    "ignorecase": False,
    "ngram_size": 2,
    "skip_size": 2,
    "paragraph_index": "republic_paragraphs",
    "paragraph_doc_type": "paragraph"
}

missing_dates = [
    {"date_string": "Veneris den 5. Januarii 1725.", "page_start": 11, "page_end": 14},
    {"date_string": "Mercuri den 10. Januarii 1725.", "page_start": 21, "page_end": 28},
]

fuzzy_searcher = FuzzyContextSearcher(config)

for missing_date in missing_dates:
    fuzzy_searcher.index_keywords([missing_date["date_string"]])
    for page_num in range(missing_date["page_start"], missing_date["page_end"] + 1):
        paragraphs = rep_es.retrieve_paragraph_by_type_page_number(page_num, config)
        for paragraph in paragraphs:
            matches = fuzzy_searcher.find_candidates(paragraph["text"])
            for match in matches:
                print(match)
                print("page: {}\tDate: {}\tText string: {}\n".format(page_num, match["match_keyword"], match["match_string"]))
                print(paragraph["text"])


{'match_keyword': 'Veneris den 5. Januarii 1725.', 'match_term': 'Veneris den 5. Januarii 1725.', 'match_string': 'Veucris den 5. Januaris 1725', 'match_offset': 9, 'char_match': 0.8620689655172413, 'ngram_match': 0.7666666666666667, 'levenshtein_distance': 0.8620689655172413}
page: 13	Date: Veneris den 5. Januarii 1725.	Text string: Veucris den 5. Januaris 1725

1735. ie Veucris den 5. Januaris 1725. PR&ASIDE, Den Heere Bentinck. PRASENTIEBUS, De Heeren Jan Welderen van Dam, Torck met een extraordinaris Gedeputeerde uyt de Provincie van Gelderlandt. Van Maasdam vanden Boeizelaar Raadtpenfionaris van Hoornbeeck met een extraordinaris Gedeputeerde uyt de Provincie van Hollandt ende Welt-Vrieslandt. Velters, Ockere Noey; van Hoorn , met een extraordinaris Gedeputeerde uyt de Provincie van Zeelandt. Van Renswoude van Voor{t. Van Schwartzenbergh, vander Waayen, Vegilin Van I{elmuden. Van Iddekinge van Tamminga. 
{'match_keyword': 'Mercuri den 10. Januarii 1725.', 'match_term': 'Mercuri den

In [14]:
import datetime
from elasticsearch import Elasticsearch

import republic_paragraph_parser as para_parser
from republic_phrase_model import category_index
import republic_elasticsearch as rep_es


page_index = "republic_hocr_pages"
page_doc_type = "page"

es = Elasticsearch()

def delete_es_index(index):
    if es.indices.exists(index=index):
        print("exists, deleting")
        es.indices.delete(index=index)

base_config = {
    "year": year,
    "base_dir": "../../../Data/Projects/REPUBLIC/hocr/",
    "page_index": "republic_hocr_pages",
    "page_doc_type": "page",
    "paragraph_index": "republic_paragraphs",
    "paragraph_doc_type": "paragraph",
    "tiny_word_width": 15, # pixel width
    "avg_char_width": 20,
    "remove_tiny_words": True,
    "remove_line_numbers": False,
}


year = 1725

year_config = set_config_year(base_config, year)
pages_info = get_pages_info(year_config)

#delete_es_index(year_config["paragraph_index"])

# start on January first

current_date = {
    "month_day": 1,
    "month_name": "Januarii",
    "month": 1,
    "week_day_name": None,
    "year": year
}
    
for page_id in pages_info:
    start_scan = 1
    end_scan = 600
    if pages_info[page_id]["scan_num"] < start_scan or  pages_info[page_id]["scan_num"] > end_scan:
        continue
    page_doc = rep_es.retrieve_page_doc(page_id, year_config)
    #if pages_info[page_id]["page_type"] != "resolution_page":
    if page_doc["page_type"] != "resolution_page":
        continue
    paragraphs, header = para_parser.get_resolution_page_paragraphs(page_doc)
    print("page_id:", page_id, "\ttype:", page_doc["page_type"], "\tnum columns:", len(page_doc["columns"]))
    #print("num columns:", len(page_doc["columns"]), "\theader lines:", [line["line_text"] for line in header])
    for paragraph_order, paragraph in enumerate(paragraphs):
        paragraph_text = para_parser.merge_paragraph_lines(paragraph)
        paragraph["metadata"]["categories"] = set()
        paragraph["text"] = paragraph_text
        paragraph["metadata"]["paragraph_num_on_page"] = paragraph_order
        paragraph["metadata"]["paragraph_id"] = "{}-para-{}".format(page_id, paragraph_order)
        #print(paragraph_text, "\n\n")
        matches = fuzzy_searcher.find_candidates(paragraph_text, include_variants=True)
        if len(matches) == 0 and para_parser.paragraph_starts_with_centered_date(paragraph):
            print("DATE LINE:", paragraph_text)
            current_date = para_parser.extract_meeting_date(paragraph, year, current_date)
            #print("paragraph_text:", paragraph_text)
        if para_parser.matches_participant_list(matches):
            print("DAY START:", paragraph_text)
            #context_match = fuzzy_searcher.get_term_context(paragraph_text, match, context_size=200)
            #print(context_match)
            #person_matches = fuzzy_person_searcher.find_person_names_in_text(context_match["match_term_in_context"])
            if para_parser.paragraph_starts_with_centered_date(paragraph):
                current_date = para_parser.extract_meeting_date(paragraph, year, current_date)
                paragraph["metadata"]["categories"].add("meeting_date")
            paragraph["metadata"]["type"] = "participant_list"
            #print("\n\tCurrent date: {}\n".format(current_date))
            #person_matches = fuzzy_person_searcher.find_person_names_in_context(context_match)
            #for person_match in person_matches:
            #    print("\t", person_match)
        if para_parser.matches_resolution_phrase(matches):
            paragraph["metadata"]["type"] = "resolution"
        paragraph["metadata"]["meeting_date_info"] = current_date
        if current_date:
            paragraph["metadata"]["meeting_date"] = datetime.date(current_date["year"], current_date["month"], current_date["month_day"])
        paragraph["metadata"]["keyword_matches"] = matches
        for match in matches:
            #print("\t{}\t{}".format(match["match_keyword"], match["match_string"]))
            if match["match_keyword"] in category_index:
                category = category_index[match["match_keyword"]]
                match["match_category"] = category
                paragraph["metadata"]["categories"].add(category)
                
        #print(paragraph["metadata"]["categories"])
        #print("\n\n\n")
        paragraph["metadata"]["categories"] = list(paragraph["metadata"]["categories"])
        del paragraph["lines"]
        es.index(index=year_config["paragraph_index"], doc_type=year_config["paragraph_doc_type"], 
                 id=paragraph["metadata"]["paragraph_id"], body=paragraph)



Number of scan files: 2161
page_id: year-1725-scan-45-odd 	type: resolution_page 	num columns: 2
DAY START: Martis den 2. Jannarii 1725. PRASIDE, Den Heere Bentinck. PRASENTIBUS, De Heeren Van Dam Torck, Ham, met een extraordinarts Gedeputeerde uyt de Provincie van Gelderlandt. Van Maasdam, vanden Boetzelaar , Boon, Raadtpenfionaris van Hoornbeeck met een extraordinaris Gedeputeerde uyt de Provincie van Hollandt ende Weû-Fries= landt. Velters, Ockerfe 5 Noey van Hoorn, met een extraordinaris Gedeputeerde uyt de Provincie van Zeelandt. Van Renswoude van Voork. Van Schwartzenbergh vander Waayen Vegilin. Van Ifelmuden, Van Iddekinge, van Tamminga. 
page_id: year-1725-scan-46-even 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-46-odd 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-47-even 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-47-odd 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-48-even 	type: resolution_page 	num colu

page_id: year-1725-scan-65-even 	type: resolution_page 	num columns: 2
DATE LINE: Jovis den 18. Januariì 1725. PRASIDE;:4Den Heere Van Welderen. 
page_id: year-1725-scan-65-odd 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-66-even 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-66-odd 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-67-even 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-67-odd 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-68-even 	type: resolution_page 	num columns: 2
DAY START: Veneris den 19. Januarii 1725. PRAESIDE, Den Heere an Welderen. PRAESENTIBUS, De Heeren Van Singendonck van Dam, van Hynbergen, Torck met een extraor= dinaris Gedeputeerde uyt de Provincie van Gelderlandt. Van Maasdam, Backer Zuylen van Ny= velt, Boon, Raadtpen[ionaris van Hoorn= beeck , miet twee extraordinaris Gedeputeerdenuyt de Provincie van Hollandt en Wel» Prieslandt. Velters, Ockere Noey, van Hoorn, met een ex

page_id: year-1725-scan-85-even 	type: resolution_page 	num columns: 2
DAY START: Mercuri den 31. Jannarit 1725. PRASIDE, Den Heere Velters. PR&ASENTIBUS, De Heeren Zan Welderen van Singendonck met een extraordinaris Gedeputeerde uyt de Provincie van Gelderlandt. Van Maasdam, Eelbo Zuylen van Nyvelt Boon Deym Raadtpenfionaris van Hoornbeeck met een extraordinaris Gedeputeerde uyt \de Provincie van Hollandt ende Wef-Vrieslandt. Ockerje, Noey, van Hoorn met een extraordinaris Gedeputeerde uyt de Provincie van Zeelandt. Taats van Amerongen van Renswoude. Van Schwartzenbergh vander Waayen , Verilin. Bentinck, van Haar[olte , van Ifelmuden. Van Iddekinge. 
page_id: year-1725-scan-85-odd 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-86-even 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-86-odd 	type: resolution_page 	num columns: 2
DAY START: Fovis den 1, Februari 1725. PRASIDE, Den Heere Velters. PRASENTIBUS, De Heeren Van Welderen van Singendunck, van Dam v

page_id: year-1725-scan-107-odd 	type: resolution_page 	num columns: 2
DAY START: Martis den 20, Februarit 1725. PRASIDE, Den Heere Van Haar[òlte. PRESENTIBUS, De Heeren Van Jelderen van Siùgeidonck, van Wynbergen, met twet extraordinaris Gedeputeerden uyt de Provincie van Gelderlandt. Van Maasdam Eelbo Zuylen van Nyvelt, Boon, Raadtpen{ionzris van Hoorn= beeck. Velters Ockerfe, Noey , van Hoorn; met een extraordmaris Gedeputeerdé uyt de Provincie van Zeelandt, Van Renswoude. Van Schwartzenbsrob. Van IJelmuden. Tamminga. 
page_id: year-1725-scan-108-even 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-108-odd 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-109-even 	type: resolution_page 	num columns: 2
DAY START: T° ter Vergaderinge gelefen de Requefte van Cornelis Cornkoper Heere van den Houte, Schout van Etten Leur erde Sprundel; houdende, dat haar Hoogh Mogende by der felver Refolutie van den vyfden Februarii feventien hondert vyf en twintigh, hebbende

DATE LINE: Dominica den 4. Maart 1725. Nibil aëtum eff. 
DAY START: Lune den 5. Maart 1725, PRASIDE, Den Heere an Helderen. PR &
page_id: year-1725-scan-125-even 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-125-odd 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-126-even 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-126-odd 	type: resolution_page 	num columns: 2
DAY START: Mercuri: den 7. Maart 1725. PRAESIDE, Den Heere Van Welderen. 
page_id: year-1725-scan-127-even 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-127-odd 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-128-even 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-128-odd 	type: resolution_page 	num columns: 2
DAY START: Jovis den 8. Maart 1725. PRASIDE, Den Heere an Heldere. PR&SENTIBUS, De Heeren Haa Heuckelom van Singêndonck wan Wynbergen met eén exiraordinaris Gedeputeerde uyt de Provincie van Gelderlandt. Pan Maasdam vanden Bo

DAY START: Mercuri; den 21. Maart 1725. PR. SIDE, Den Heere Velters. PRAESENTIBUS, De Heeren Jan Heuckelom, van Singen» donck van Wynbergen met ‘een extraordinaris Gedeputeerde uyt de Provincie van Gelderlandt. Eelbo, Steyn, Backer Kerckhove, Boon , Deym Raadtpenfionaris van Hoornbeeck, met cen extraordinaris Gedeputeerde uyt de Provincie van Hollandt, ende WehFrieslandt. Ockerje , Noey van Hoorn met een extraordinaris Gedeputeerde uyt de Provin cie van Zeelandt. Van Voorft. Vegilin. Van Haarfolte van IJelmuden. Tamminga. 
page_id: year-1725-scan-146-odd 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-147-even 	type: resolution_page 	num columns: 2
DAY START: Jovis den 12, Maart 1725. PRASIDE, Den Heere Pelters. PRASENTIBUS, De Heeren Jan Welderen van Heuckelord, van Singendonck van Wynbergen met een extraordinaris Gedeputeerde uyt de Provincie van Gelderlandt. Eelbo, Backer , Boon Deym RaadtpenJionaris van Hoornbeeck met een extraordinaris Gedeputeerde uyt de Provincie 

page_id: year-1725-scan-171-even 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-171-odd 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-172-even 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-172-odd 	type: resolution_page 	num columns: 2
DAY START: Touvis den 5. April 1724. PRASIDE, Den Heere Pander Waayen. 3Q PR # 
page_id: year-1725-scan-173-even 	type: resolution_page 	num columns: 2
DAY START: PRAESENTIBUS, De Heeren Zan Heuckelom van Singendonck , van Wynbergen, met cen extraordinaris Gedeputeerde uyt de Provincie van Gelderlandt. Van Maasdam Eelbo, Zuylen van Nyvelt, Boon Raadtpenfonaris van Hoornbeeck, met cen extraordinaris Gedeputeerde uyt de Provincie van Hollandt ende Weh-Vrieslandt. Velters Noey, van Hoorn. Taats van Amerongen. Vegilin. Van Haar[olte van Ifelmuden. 
page_id: year-1725-scan-173-odd 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-174-even 	type: resolution_page 	num columns: 2
page_id: year-1725-s

page_id: year-1725-scan-195-even 	type: resolution_page 	num columns: 2
DAY START: Jovis den 19. April 1725. PRASIDE, Den Heere 2’an Tamminga. PR,ESENTIBUS, De Heeren Van Heuckelom , met een extraordinaris Gedeputeerde uyt de Provincie van Gelderland. Dan Maasdam, Zuylen van Nyvelt, Boon, Raadtpenjionaris van Hoornbeeck met een extraordinaris Gedeputeerde uyt de Provincie van Hollandt ende Weh-Vrieslandt. Velters Noey , van Hoorn. Taats van Amerongen met een extraordinaTis Gedeputeerde uyt de Provincie van Utrecht. Pander Waayen , Vegilin, Portz.  Eeckhont. 
page_id: year-1725-scan-195-odd 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-196-even 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-196-odd 	type: resolution_page 	num columns: 2
DAY START: Vencris den 10, April 1725. PRASIDE, Den Heere Van Tamminga. PPRASENTIUS, De Heeren Van Henckelom met een êx
page_id: year-1725-scan-197-even 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-197-o

page_id: year-1725-scan-215-odd 	type: resolution_page 	num columns: 2
DAY START: Sabbathi den $. Mey 1745. PRASIDE, Den Heere Zan Maasdam. PRA&ASENTIBUS, De Heeren Van Welderen E(enius met een extraordinaris Gedeputierde uyt de Provincie van Gelderlandt. Eelbo , van Bleskensgrave ; van Mar[eveens Boon, Raadtpenfionaris van Hoornbeeck: Velters, Ockere, Noey van Hoorn: Een extraordmaris Gedeputeerde uyt de Pros vincie van Utrecht. Vander Waayen Vegilin: Van IJelmuden, Emmen. 
DERIVING MEETING DATE FROM PREVIOUS DATE
{'month_day': 5, 'month_name': 'Mey', 'month': 5, 'week_day_name': 'Sabbathi', 'year': 1725}
page_id: year-1725-scan-216-even 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-216-odd 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-217-even 	type: resolution_page 	num columns: 2
DATE LINE: Dominita den 6. Mey 17254. Nibil aftum ef. 
DATE LINE: Lune den 7. Moy 1725. 
page_id: year-1725-scan-217-odd 	type: resolution_page 	num columns: 2
page_id: ye

page_id: year-1725-scan-240-odd 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-241-even 	type: resolution_page 	num columns: 2
DAY START: Mercurii den 30, Mey iN PRASIDE, Den Heere Zan IJelmuden. PRASENTIBTUS, De Heeren Van Welderen van Heuckelom Torck met drie extraordinaris Gedeputeerden uyt de Provincie van Gelder= landt. Van Maasdam. Pelters, Ocker(fe van Hoorn. Van Renswoude, van Voorh. Vegilin. Roue. Emmen. 
page_id: year-1725-scan-241-odd 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-242-even 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-242-odd 	type: resolution_page 	num columns: 2
DAY START: Jovis den 31, Mey 1725. PRASIDE, Den Heere Rou/e. PRASENTIBUS, De Heeren Van Woelderen van Heuckes lom van Heeckerens, Torck, met drit x= traordinaris Gedeputeerden uyt de Provins cie van Gelderlandt. Haack Raadtpenfionaris van Hoornbeeck:  Pelters  Ocker(Je, van Hoorn. Pan Renswoude van Voor[t. Vegilin. Steenbergen. 
page_id: year-1725-sc

page_id: year-1725-scan-262-odd 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-263-even 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-263-odd 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-264-even 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-264-odd 	type: resolution_page 	num columns: 2
DAY START: Martis den 19, Junit 1725. PRASIDE, Den Heere Van Maasdam. PRASENTIBUS. De Heeren Jan Welderen van Heucketom , Singendonck , ván Heeckeren, Umibgroeven met een extraordinaris Gedeputeerde uyt de Provincie van Gelderlandt. Vanden Boetzelaar Raadtpenfionaris van Hoornbeeck. Ocker(Je. Van Voor. Van Schwartzenbergty Rou[e , Vriefen. Emmen, Tamminga. 
page_id: year-1725-scan-265-even 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-265-odd 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-266-even 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-266-odd 	type: resolution_page 	num columns:

page_id: year-1725-scan-289-even 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-289-odd 	type: resolution_page 	num columns: 2
DAY START: Venecris den 6, Juli 1725. PRASIDE, Den Heere Zaats van Amerongen. PRASENTIBUS, De Heeren Van Heuckelom, van Heeckeven, Umbgroeven zw zmet een extra0rdinae vis Gedeputeerde uyt de Provincie van Gelderlandt. Van Maasdam Raadtpenfionaris van Hoorn= beeck. Ocker (Je. Van Schwartzenbergh, de Kempenaar. Van IJelmuden, Roufe, Vrie[en. Emmen. 
page_id: year-1725-scan-290-even 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-290-odd 	type: resolution_page 	num columns: 2
DAY START: Sabbathi den 7. Julii 1725. PRASIDE, Den Heere Taats van Amerongen. PRA&SENTIBUS, De Heeren Zan Jelderen van Heecke: ven Umbgroeven met een extraordina115 Gedeputeerde uyt de Provincie van Gelderland. F: 7 rs) Raatltpenfionaris van Hoorns eeck. OckerfJe , van Hoorn met een extraordinas vis Gedeputeerde uyt de Provincie van Zeelandt. Van Schwartzenberg

page_id: year-1725-scan-314-even 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-314-odd 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-315-even 	type: resolution_page 	num columns: 2
DAY START: Sabbathi den 28. Juli 1725, PRASIDE, Den Heere Emmen. PRASENTIBUS, De Heeren en Heuckelom van ITeeckte ren, van IV yubergen met ecn extraordis aarts Gedeputeerde uyt de Provincie va Gelderlandt Pan 
page_id: year-1725-scan-315-odd 	type: resolution_page 	num columns: 2
DATE LINE: Dominica den 29. Juli 1725. Nibil aflum ef. 
DAY START: Lune den 30 Julië 17:74. PRASIDE, Den Heere Pan Heuckelom. PRA&ASENTIBUS, De Heeren Jan Heeckeren van IW ynbergen. Van Maasdam. 
page_id: year-1725-scan-316-even 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-316-odd 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-317-even 	type: resolution_page 	num columns: 2
DAY START: Martis den 31. Juli 1725. PRAESIDE, Den Heerc #an Heuckelom. PR Ä
page_id: year-17

page_id: year-1725-scan-353-odd 	type: resolution_page 	num columns: 2
DAY START: Vencris den 10, Augufti 1725. PRASIDE, Den Heere Zan Maasdam. PRAESENTIBUS, De Heeren Jan Heuckelom van Heeckeven, van Wynbergen L. van Eck met een extraordinarts Gedeputeerde uyt de Provincie van Gelderlandt. Vanden Boetzelaar , Eelbo Boon Deym, Raadtpenfionaris van Hoornbeeck. Velters Ockere. Taats van Amerongen, van Voort. Van Schwartzenbergh, de Kempenaar. 
page_id: year-1725-scan-354-even 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-354-odd 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-355-even 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-355-odd 	type: resolution_page 	num columns: 2
DATE LINE:  Sabbatbi den 11. Augufti 1725. BRE 5 1D E, Den Heere van Maasdam. 
page_id: year-1725-scan-356-even 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-356-odd 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-357-even 	type: resolu

page_id: year-1725-scan-380-even 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-380-odd 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-381-even 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-381-odd 	type: resolution_page 	num columns: 2
DAY START: Jovis den 6, September 1725. P Ras: D-E, Den Heere Rou/e. PRAESENTIBUS, De Heeren Jan Singendonck, van Heeckeven van Hynbergen. Pan Maasdam, vanden Boetzelaar van Marfeveen, Deym Raadtpen/ionaris van Hoornbeeck. Pelters, Noey, van Hoorn met ecn extraordinaris Gedeputeerde uyt de Provincie van Zeelandt. Van Schwartzenbergh, de Kempenaar. Steenbergen. 
page_id: year-1725-scan-382-even 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-382-odd 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-383-even 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-383-odd 	type: resolution_page 	num columns: 2
DAY START: Veneris den 7. September 1725. PR:E£-S1 DE, Den Heere

page_id: year-1725-scan-407-even 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-407-odd 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-408-even 	type: resolution_page 	num columns: 2
DAY START: Veneris den 21. September 1725. PRESIDE, Den Heere Van Singendonck. PR&SENTIBUS, ‚De Heeren Van Wynbergen. Van Maasdam Eelbo Steyn Boos. Raadtpenfionaris van Hoornbeeck. Velters, Noey , van Hoorn, Van Renswoude met een extraordinaris Gedeputeerde uyt de Provincie van Utrecht. Van Schwartzenbergh Portz met ce ex= traordinaris Gedeputeerde uyt de Provincie van Vrieslandt. ee (3 
page_id: year-1725-scan-408-odd 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-409-even 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-409-odd 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-410-even 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-410-odd 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-411-even 	typ

page_id: year-1725-scan-434-even 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-434-odd 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-435-even 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-435-odd 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-436-even 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-436-odd 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-437-even 	type: resolution_page 	num columns: 2
DAY START: Sabbathi den 13. Oltober 1725, PRE SIDE, Den Heere Zan Voorft. PR&ESENTIBUS, De Heeren Jan Sinzendonck van Wynberem. vó Maasdam vanden Boeizelaar Eelbo, Steyn van Marfeveen Boon Deym … Raadtpenfionaris van , Hoornbeeck met een extraordinaris Gedeputeerde uyt de Provincie van Hollandt ende Weft-Vrieslandt. Velters, Noey, van Hoorn. Eenextraordinaris Gedeputeerde nyt de Provincie van Utrecht. Portz.  Ronfe. De Drews. 
page_id: year-1725-scan-437-odd 	type: resolution_page 	num columns:

page_id: year-1725-scan-463-even 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-463-odd 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-464-even 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-464-odd 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-465-even 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-465-odd 	type: resolution_page 	num columns: 4
page_id: year-1725-scan-466-even 	type: resolution_page 	num columns: 4
page_id: year-1725-scan-466-odd 	type: resolution_page 	num columns: 3
page_id: year-1725-scan-467-even 	type: resolution_page 	num columns: 4
page_id: year-1725-scan-467-odd 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-468-even 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-468-odd 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-469-even 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-469-odd 	type: resolution_page 	num columns: 2

page_id: year-1725-scan-490-odd 	type: resolution_page 	num columns: 4
page_id: year-1725-scan-491-even 	type: resolution_page 	num columns: 4
page_id: year-1725-scan-491-odd 	type: resolution_page 	num columns: 4
page_id: year-1725-scan-492-even 	type: resolution_page 	num columns: 4
page_id: year-1725-scan-492-odd 	type: resolution_page 	num columns: 4
page_id: year-1725-scan-493-even 	type: resolution_page 	num columns: 3
page_id: year-1725-scan-493-odd 	type: resolution_page 	num columns: 4
page_id: year-1725-scan-494-even 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-494-odd 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-495-even 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-495-odd 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-496-even 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-496-odd 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-497-even 	type: resolution_page 	num columns: 2

page_id: year-1725-scan-513-odd 	type: resolution_page 	num columns: 2
DAY START: Martis den 11, December 1725. PRASIDE, Den Heere an Haarfòlte. PRASENTIBUS. De Heeren Jan Welderen van Singenderek van Dam, Torck Umbgrove: ven. Van Maasdam, Zuylen van Nyvelt, Boon. Velters, Ocker(je, Noey, van Hoon. Taats van Amerongen van Renswoude , met cen extraordimaris Gedeputeerde uyt de Provincie van Utrecht. Vegilin. Van lelmuden ‚ Roue. De Drews. 
page_id: year-1725-scan-514-even 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-514-odd 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-515-even 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-515-odd 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-516-even 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-516-odd 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-517-even 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-517-odd 	type: resolution_page 	

page_id: year-1725-scan-532-odd 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-533-even 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-533-odd 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-534-even 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-534-odd 	type: resolution_page 	num columns: 2
DAY START: Veneris den 28. December 1725. PRASIDE, Den Heere Jan Dam. PRAESENTIBUS, De Heeren Zan -Wynbergen, Umbgroeven met een extraordinaris IE uyt de Provincie van Gelderlandt. Fan Maasdam Steyn, van. Mar[eveen Boon Deym  Raadtpen/ionatis - van Hoornbeeck, Velters , Ockere  Noey van Hoorn, met een extraordinaris Gedeputeerde uyt de Pro» vincie van Zeeland. Taats van Amerongen van Renswoude, met cen extraordinaris Gedeputeerde uyt de Provincie van Utrecht. Vegilin, Van Haarfolte van Ielmuden. 
page_id: year-1725-scan-535-even 	type: resolution_page 	num columns: 2
page_id: year-1725-scan-535-odd 	type: resolution_page 	num columns: 

In [6]:
from collections import Counter

subphrase_searcher = FuzzyContextSearcher(config)
known_phrase_searcher = FuzzyContextSearcher(config)

keywords = [
    "WAAR op",
]

known_phrases = [
    "voor de genomen moeyte bedanckt",
    "waar by goedgevonden is",
    "gehouden voor gecommiteert",
    "WAAR op gedelibereert zijnde",
    "WAAR op geen refolutie is gevallen",
    "WAAR op geen refolutie voor alsnoch is gevallen",
    "WAAR op gedelibereert en in achtinge genomen zynde",
]

known_phrase_searcher.index_keywords(known_phrases)
subphrase_searcher.index_keywords(keywords)
#subphrase_searcher.index_spelling_variants(spelling_variants)

phrase_count = Counter()

for page_id in pages_info:
    current_date = None
    start_page = 44
    end_page = 680
    if pages_info[page_id]["page_type"] != "resolution_page":
        continue
    if pages_info[page_id]["scan_num"] < start_page or  pages_info[page_id]["scan_num"] > end_page:
        continue
    #print(page_id, pages_info[page_id]["page_type"])
    page_doc = retrieve_page_doc(page_id)
    paragraphs, header = get_resolution_page_paragraphs(page_doc)
    #print("num columns:", len(page_doc["columns"]), "\theader lines:", [line["line_text"] for line in header])
    #continue
    for paragraph in paragraphs:
        paragraph_text = merge_paragraph_lines(paragraph)
        #print(paragraph_text, "\n\n")
        matches = subphrase_searcher.find_candidates(paragraph_text, include_variants=True)
        for match in matches:
            context_match = subphrase_searcher.get_term_context(paragraph_text, match, context_size=50)
            match_string = context_match["match_string"]
            context = context_match["match_term_in_context"]
            known_phrase_matches = known_phrase_searcher.find_candidates(context)
            for match in known_phrase_matches:
                phrase_count.update([match["match_keyword"]])
            if len(known_phrase_matches) > 0:
                continue
            parts = context.split(match_string)
            if len(parts) == 2:
                print(page_id, pages_info[page_id]["page_type"])
                print(match_string + parts[1])
            else:
                print("UNEXPECTED NUMBER OF PARTS")
                print(context_match)


for phrase, freq in phrase_count.most_common():
    print(freq, phrase)

KeyError: 'page_type'

In [195]:
from collections import Counter

from fuzzy_context_searcher import FuzzyContextSearcher
from fuzzy_person_name_searcher import FuzzyPersonNameSearcher

config = {
    "char_match_threshold": 0.8,
    "ngram_threshold": 0.6,
    "levenshtein_threshold": 0.8,
    "ignorecase": False,
    "ngram_size": 2,
    "skip_size": 2,
    "max_lenght_variance": 3,
}


subphrase_searcher = FuzzyContextSearcher(config)
known_phrase_searcher = FuzzyContextSearcher(config)

keywords = [
    #"DE Refolutien, gifteren genomen",
    "BY refumptie gedelibereert zynde",
]

known_phrases = [
]

known_phrase_searcher.index_keywords(known_phrases)
subphrase_searcher.index_keywords(keywords)
#subphrase_searcher.index_spelling_variants(spelling_variants)

phrase_count = Counter()

for page_id in pages_info:
    current_date = None
    start_page = 60
    end_page = 265
    if pages_info[page_id]["page_type"] != "resolution_page":
        continue
    if pages_info[page_id]["scan_num"] < start_page or  pages_info[page_id]["scan_num"] > end_page:
        continue
    #print(page_id, pages_info[page_id]["page_type"])
    page_doc = retrieve_page_doc(page_id)
    paragraphs, header = get_resolution_page_paragraphs(page_doc)
    #print("num columns:", len(page_doc["columns"]), "\theader lines:", [line["line_text"] for line in header])
    #continue
    for paragraph in paragraphs:
        paragraph_text = merge_paragraph_lines(paragraph)
        #print("PARAGRAPH TEXT:", paragraph_text, "\n\n")
        matches = subphrase_searcher.find_candidates(paragraph_text, include_variants=True)
        for match in matches:
            #print(match)
            context_match = subphrase_searcher.get_term_context(paragraph_text, match, context_size=50)
            match_string = context_match["match_string"]
            context = context_match["match_term_in_context"]
            known_phrase_matches = known_phrase_searcher.find_candidates(context)
            for match in known_phrase_matches:
                phrase_count.update([match["match_keyword"]])
            if len(known_phrase_matches) > 0:
                continue
            parts = context.split(match_string)
            if len(parts) == 2:
                print(page_id, pages_info[page_id]["page_type"])
                print("PARAGRAPH TEXT:", paragraph_text, "\n\n")
                print("\tMATCH_STRING:", match_string + parts[1])
            else:
                print("UNEXPECTED NUMBER OF PARTS")
                print(context_match)


for phrase, freq in phrase_count.most_common():
    print(freq, phrase)

scan-65-even resolution_page
PARAGRAPH TEXT: T° ter Vergaderinge gelefen de Requefte van het meerder gctal der Leden van de Magiftraat der Stadt ’s Hertogenbofch 5 verfoeckende permiflie om ter Griffie van haar Hoogh Mogende te mogen lichten Copie van het bericht door den Prz(ident en vier Scheepenen aân haar Hoogh Mogende overgegeeven op haare voorige Regquelte, raackende her aanftellen van een Marcktíchipper op Rotterdam, en oock op Haarlem. WAAR op gedelibereert zynde, als mede hy refumptie gedelibereert zynde op de voorige Requefte van de Supplianten en op het voorgemelde bericht ; 1s goedtgevonden en verftaan dat de quzftie tuflchen de Supplianten, ter eenre, en: gemelde Praefident ende Scheepenen ter andere zyde, over recht ván aanttellinge van Marckt{chippers, aan wien het felve competeert {al werden gerenvoyeert en gelaten aan den Raadt van Brabandt , om daar op te difponeeren foo als na verhoor van Partyen in goede juftitie fullen vinden te behooren Behouddijek dat Partyen ten

scan-128-odd resolution_page
PARAGRAPH TEXT: B refuinptie gedelibereert Zynde op het rapport van de Heeren de Ja Bafecour, Vegelin van Claarbergen cn Bentinck, haar Hoogh  


	MATCH_STRING: B refuinptie gedelibereert Zynde op het rapport van de Heeren de Ja Bafecour, Vegel
scan-134-odd resolution_page
PARAGRAPH TEXT: 1735 (17 B refumptie gedelibereert zynde op de Miflive van den Secretaris Rumpf en.  laatítelijck op die van den aghtienden der voorlede maandt waar by aan haar Hoogh Mogende heeft gereprz@fenteert, hoe dat hy by manquement van ander Caradter, niet alleen voor {ijn Perfoon qmam te mi(fen den vrydom van impofitien , maar oock dat niet in {taat was om nevens de Mini{ters van andere Princen en Staaten, het intereft van defen Staat en van de Ingezenen van dien door het bekomen van 4udientien en anderfints met de felve nadruck waar te nemen. IS goetgevonden en verítaan dat aan gemelden Secretaris Rumpf (al werden gegeven de caraêter van Refident, en dat ten dien eynde voor hem 

scan-197-odd resolution_page
PARAGRAPH TEXT: B refumptie gedelibereert zynde op de Miflive van de Magijtraat der Stadt Dordrecht, gefchreven aldaar den veeriienden defer loopende maandt , houdende dat door het overlijden van Mr. Johan Berck, in fijn leven Contrerolleur van de Convoyen en Licenten tot Hoorn, de Contrerolleursplaatfe aldaar is komen te vaceeren dat de voorfchreve Stadt Dordrecht competeert het recht, om tot vervullinge van dien te defpicieeren een bequaam Perfoon ; dat in achtingh genomen hebbende dc bes uaamheydt van Cornelis de Vries, haar De en Inwoonder, goedtgevonden hadden den felven tot Contrerolleur der Convoyen en Licenten tot Hoorn voornoemdt te defpicieeren ende aan haar Hoogh Mogende voor te draagen ; met verfoeck , dat haar Hoogh Mogendc gemelden Cornelis de Vries ‘als Contrerolleur aldaar gelieven te beëdigen en voor den {elven te doen de pefcheeren Commi{he in forma. Ende oock nagefien zynde haar Hoogh Mogende Refolutien van den twee en twintighften Januat

## Experimental Corner

Playground for trying fuzzy searching algorithms.

In [11]:
def score_levenshtein_distance(s1, s2):
    if len(s1) > len(s2):
        s1, s2 = s2, s1
    distances = range(len(s1) + 1)
    #distances = [0] * (len(s1) + 1)
    #print("distances initial:",distances)
    for i2, c2 in enumerate(s2):
        distances_ = [i2+1]
        #print("distances_:", distances_)
        #print("i2:", i2, "c2:", c2)
        for i1, c1 in enumerate(s1):
            #print("i1:", i1, "c1:", c1)
            if c1 == c2:
                distances_.append(distances[i1])
                #print(distances_, "equal")
            else:
                #print("\tminimum of:", distances[i1], distances[i1 + 1], distances_[-1])
                distances_.append(1 + min((distances[i1], distances[i1 + 1], distances_[-1])))
                #print(distances_, "unequal")
        distances = distances_
        #distances = distances_
        #print("distances:", distances, s2[i2:i2+len(s1)])
    return distances[-1]

def get_match(matrix, T):
    match = ""
    col_index = len(matrix) - 1
    distance = matrix[-1][-1]
    row_index = len(matrix[-1]) - 1
    while row_index > 0:
        if col_index < 0:
            print("BREAKING")
            break
        #print("col_index:", col_index, "row_index:", row_index, "distance:", distance)
        if matrix[col_index-1][row_index-1] < distance:
            #print("replacing")
            match = T[col_index] + match
            distance = matrix[col_index-1][row_index-1]
            col_index -= 1
            row_index -= 1
        elif matrix[col_index][row_index-1] < distance:
            #print("inserting")
            #match = T[col_index-1] + match
            distance = matrix[col_index][row_index-1]
            row_index -= 1
        elif matrix[col_index-1][row_index] < distance:
            #print("deleting")
            match = T[col_index] + match
            distance = matrix[col_index-1][row_index]
            col_index -= 1
        elif matrix[col_index-1][row_index-1] == distance:
            #print("copying")
            match = T[col_index] + match
            col_index -= 1
            row_index -= 1
        else:
            print("This should never be printed")
        #print("row_index:", row_index)
        #print("match:", match)
    #print("returning match:", match)
    return (match, col_index+1, matrix[-1][-1])

def search_levenshtein_distance(P, T, threshold=2):
    if len(P) > len(T):
        P, T = T, P
    distances = range(len(P) + 1)
    matches = []
    matrix = []
    #distances = [0] * (len(T) + 1)
    #print("distances initial:",distances)
    for t_j, T_j in enumerate(T):
        distances_ = [0]
        #print("distances_:", distances_)
        #print("t_j:", t_j, "T_j:", T_j)
        for p_i, P_i in enumerate(P):
            #print("p_i:", p_i, "P_i:", P_i)
            if P_i == T_j:
                distances_.append(distances[p_i])
                #print(distances_, "equal")
            else:
                #print("\tminimum of:", distances[p_i], distances[p_i + 1], distances_[-1])
                distances_.append(1 + min((distances[p_i], distances[p_i + 1], distances_[-1])))
                #print(distances_, "unequal")
        distances = distances_
        #distances = distances_
        #print("distances:", distances, T[t_j-len(P):t_j])
        matrix += [distances]
        distance = distances[-1]
        if distance <= threshold:
            match = get_match(matrix, T)
            #print(T[:match[1]])
            #print(T[match[1]:match[1]+len(match[0])])
            #print(T[match[1]+len(match[0]):])
            matches.append(match)
    #for row in matrix:
    #    print(row)
    #for col_index, distance in enumerate(distances):
    return matches

target = "survey"
context = "surgery"
matches = search_levenshtein_distance(target, context)
print(matches)

target = "survey"
context = "the doctor is specialised in surgery of the heart"
matches = search_levenshtein_distance(target, context)
print(matches)



[('surge', 0, 2), ('surger', 0, 2), ('surgery', 0, 2)]
[('surge', 29, 2), ('surger', 29, 2), ('surgery', 29, 2)]


In [327]:
texts = [
    "B. dr Bofch Mikelaer, zal op den 11 May , in de Keyzers Kroon verkopen , een uvrmuntënde pirty Schilderyen van de voornaernfte M;efter«; ils van de Oude Pslma extra goed , A'exander Veronecs, Pluweele Breugel , Wouwerman, J. vander Heydc, Laireffe, Bril en ancVc. Nagelaten door een voon.aem Liefhcbb.T ; als mede een fraye parry Teekenmgen en Miniatuur fchoone D'ukke Prenten van de voorn-emfte Mcefters. De Citilogu-zal ir.tyds te bekomen zyn by Jicób Carpi Konft Schilder, en by de gsm. Mikelaer. B. SÜgtenborff Mikelaer, za' op Woensdag den 25- Miy, t'Amft. in 't Oude Heeren Logement verkopen, een pirty Engelfche Manuft-Uuren et Wmkclw ren , beftaende in Lakend diverfe fbrteering vin * en 9 quart breed , Biyen, Kirfayen, Drogetreu,.Sergies, gcfc»om-c en gj'lrcepre Kalamankcn, S.urynen, diverfe Grynen en Stoften, Ssycn, Chitzen , Catoenenen Neteldoekeu , divetfc Kouffen,-1 !___. witte gebleekte. Li.inens„cn andere Goederen meer ;, alle* dacgj ïoer <k Verkoping te zien.. '",
    "Plülippus van d:r Land Makelaer, za! op d:.i : 5 May verkopen , een uytmuntend ko1 fti- Kabinet Schilderyen; als vin Pb. Woaverman, A. vin O.tilc, Rottenhimcr, G. Metzu, G. dc LairefTc, D. van Deeien, |. Srccn, D. Teniers, de- Oude Griffier. ]. en 4. Bcth, M. ds Hond-koere-, J. do Heem, J. Lingelbigh.en andere Mecfterj meer; nagelaten door den £J. Heer Secretaris Lambert Witzen , wier van de Citalogus by den gen*. Mikelaer te bekomen zullen zyn. Alte de gee-.e die iers te prereideeren hebben of verfftiulcKgt zyn op de mgchten Boedel vin Pieter Engels , tot Warmer overleden, gelieven luc prerenficn binr.e.-yden tyd vaa 5 weeleen 1:1 te leveren leo Wccsvi-lers rn Armvoogden tot Wormer ; dew-'ke de Penningen vm de y;em. Boedel geprovenieert binnen korre , na verloop van de ge.eyde C weken , by preferentie en concurrentie zulten diftri- MSfctti foi.nig als zy zu'ten vmdc.i te be'nojren,",
    "Jac. Torner Junior Mikelaer, prefènteert tegens primo May.of wel eerder :e verbuurcn, een zeer vermakelyke en welgelege Herberg genaemt RUSTENBURG ; gelegen buyten dc Urregtfj Poort op 't Ruften!., rger par! , met zyn tnodicufe Huyzinge , fraye Thuyn en Tuynhuy», Kolfoaen.Troktafel &o; aiwaer de neering veel jiren met fueces il gcoaen en nog werd gecontinueert, laetft bewoont dexr wylen Dirk He Lange. N ider onderngtinge bv gem. Makelae-. Alle de eeene die eenig regt, aélie of pretenfie beeft, of iets fchal lig zyn a-.n den r.eabindonneerden Boedel .-an Dirk van Schaik en Ja -metje Theuniffe, geweezene Herbe-g er in de Oude Loosdregt, werden-verzoet zulks op en aen ce geven ter Secretary aldaer, uyterlyk voor primo April 1745 , op potne vau eeuwig ftilfwygen en verftek.",
    "ArnoHns van Sprang-Mikelaer, zal op Maendag d-n 27 Vnny.t'Amfteid. in 't Oude Heeten Logement vetkopen, een wel ter nerinc 4_endc Hoys, In de Crom elboo»ftee» 't rte huvs -an den\"Dim,»rnaemt het Mo huys, daer de Baftille uythangt ; breeJet byßiljettr Vermeld : lemant nadet ondetrigi begreiende of genegen zynde dit Perceel uyt de hand te-koopen , fpieeke met de gem. Makelaet, by van eygendom «dagen voet de Veikoopdag te zien tullen zyn. Ajnüld:. s- au Spiang.M.kc!aei, zal op Maendag den 2- lünv , t'Aml. in 't Ó.de tfeeien Logement vetku. en, drie hegre, fterkeen weC-doortt üsnefde Huvfcn, ftaende nieft mi'k.nrlrien in deGroore aen dl-Vtoord-vde by de dwirsftiact • bieede by 3i,;**rte**i vermeld: lefnanrna-e.'onderri.ti.ig begeerende of geneegen z.ynie degemPi'tcedriaTt ie.__.ad.teko.pen,fpi?eke mei yucru, Mak_liei, by wien de bewyzen .van cyi*j_d-_- S dagen voot de:yeikoop_ag.tc _iea zullen ü\", cv",
    "Job. Haverkamp Mikelaer, zal op Maendag den 11 January 1745, t Amft. in t Oude Heere Logement verkopen, No. 1. Een hegt en fterk Huys; ftaende op de Weesperftraet tuffchen de Hect:n en Keyzersgragt. No.1. Eén dito Huys.ftaende naeft 't voorgaende. No. 3. Een Sdepart in een we! ter nering ftaende Huys en Erve op de N. Z. van de Agterburgwal op de hoek van de Pottebakkersfteeg. Breder by Biüetten gefpecifkeert. De bewyzen van eygendom en de Vylconditien zullen 8 dagen voor en op de Verkoopdag voormiddag» tot J 2 uuren toe te zien zyn ten Compt. van den Notaris E. Haverkamp. Uyt dé huni te Koop een wel beklante Banket en Confiturier» Winkel met deszelfs Gereetfchappen , waer in die affaire» circa SS jaeren met c.\".cd fucces zyn gecontinueert: Te bevragen by Hendrik Bofch en Gomp., in de Kalverftract, tuflehen de Öjjes-flaysen Olyfiigers-fteeg\" t'Amfterdam. Als meede het Huys te hu*r.",
    "J. van Zutphen Mikelaer , zal op Maendag den 14 September , t'Amft. in 't Oude Heeren Logement verkopen , een extra wel geleegen 3-ftreeksLiken-Raem , met zyn Thuyn , Huyzinge en Droogfcheevders Winkel, eertyts genaemt het Varken, en nu Buyren-Ruft, ftaende en aeleegen buyten de Raem-Pooit ren eynde het eerfte Sceene-Pad , zynde Stads Grond , en geteekent met. No! 27, 28 en 29, Alles breder by Biljetten en dagelyks te zien , de bewyzen van Eygendom zullen 8 dagen voor en op de verkoopdag te zien zyn '. ten Compt. van den Nots. Arnoud Roermond in de Krom-Elleboogfteeg , en nader onderrigting by gem. Makelaer. ; Alle die iets te pretendeeren mogten hebben ten.laften van wylen Jean Fredrik Bernard , in zyn Leven Boekverkoper t'Amft. v of die eenige Engagementen met dezelve hebben lopende , uyr wat hoofde het o:)-_zoude mogen zyn, als mede die aen der Boedel fchuJdig mogten wezen , of ook eenige Goederen van hem onder zig hebben , werden verzogt zulks op voor tieu ïsOitcber 1744. .;ten Compt. vanden Notaris Jan Ardinois , op de Cingel t'Amfterdam.",    
]

target = "Makelaar"
target = "Oude Heere Logement"
for text in texts:
    matches = search_levenshtein_distance(target, text)
    print(matches)



[('Oude Heeren Logemen', 568, 2), ('Oude Heeren Logement', 568, 1), ('Oude Heeren Logement ', 568, 2)]
[]
[]
[('Oude Heeten Logement', 73, 2)]
[('Oude Heere Logeme', 74, 2), ('Oude Heere Logemen', 74, 1), ('Oude Heere Logement', 74, 0), ('Oude Heere Logement ', 74, 1), ('Oude Heere Logement v', 74, 2)]
[('Oude Heeren Logemen', 74, 2), ('Oude Heeren Logement', 74, 1), ('Oude Heeren Logement ', 74, 2)]
